Introduction to the Analysis of Electronic Health Records: A Short Course

A 3-Day Livestream Seminar Taught by Jesse Gronsbell, Ph.D.

The widespread adoption of electronic health records (EHR) has generated massive amounts of clinical data with potential to improve healthcare delivery and advance biomedical research. EHRs contain comprehensive patient-level information collected over time, including demographics, disease diagnoses, medical procedures, and vital signs. Large scale EHR databases are also being increasingly linked across healthcare systems and to biobanks containing detailed genetic data to characterize individual health at unprecedented scale and precision.

However, EHR data is complex and heterogeneous. Effective data analysis requires a deep understanding of the data as well as familiarity with modern statistical and machine learning methods. This course will provide a broad overview of the analysis of EHR data for participants with little or no prior experience with the topic. We will start with the opportunities and challenges associated with the analysis of EHR data. We will then build an understanding of data provenance and structure. Finally, we will cover basic and advanced methods for EHR data analysis and their use in various research applications.

We will cover a full suite of methods for processing EHR data, developing phenotyping models, generating real-world evidence, and developing fair and privacy preserving predictive models. You will also be introduced to publicly available datasets, software packages for statistical analyses, and tools for clinical natural language processing. The course will be hands-on and use the R and Rstudio computing environment. After completing the course, you will be prepared to analyze your own EHR dataset and deepen your knowledge of the topic.

Starting February 29, we are offering this seminar as a 3-day synchronous*, livestream workshop held via the free video-conferencing software Zoom. Each day will consist of two lecture sessions which include hands-on exercises, separated by a 1-hour break. You are encouraged to join the lecture live, but will have the opportunity to view the recorded session later in the day if you are unable to attend at the scheduled time.

*We understand that finding time to participate in livestream courses can be difficult. If you prefer, you may take all or part of the course asynchronously. The video recordings will be made available within 24 hours of each session and will be accessible for four weeks after the seminar, meaning that you will get all of the class content and discussions even if you cannot participate synchronously.

Closed captioning is available for all live and recorded sessions. Live captions can be translated to a variety of languages including Spanish, Korean, and Italian. For more information, click here.


This seminar will use R as the base software and incorporate publicly available clinical natural language processing software such as MetaMap. All of the datasets used for exercises are openly available and detailed instructions will be provided for additional software.

Basic familiarity with R is highly desirable, but even novice R coders should be able to follow the presentation and do the exercises.

If you’d like to take this course but are concerned that you don’t know enough R, there are excellent online resources for learning the basics. Here are our recommendations

Who Should Register?

This course is for you if you want to learn the fundamentals of EHR data analysis and apply them to your own biomedical research questions. While no prior knowledge of EHR data is necessary, knowledge of linear and logistic regression is required for the course.


Day 1

1. Introduction to electronic health record (EHR) data

    • Types of EHR systems
    • EHR terminology
    • Data structure and provenance

2. Opportunities and challenges for EHR-based applications

    • Opportunities: comparative effectiveness studies, clinical decision support, biobank analyses, etc.
    • Challenges: selection bias, missing data, measurement error, etc.

Day 2

3. Curating research quality data

    • Code mapping
    • Free-text processing

4. EHR-based phenotyping

    • Rule-based algorithms
    • Machine learning methods

Day 3

5. Real-world evidence generation with EHRs

6. Predictive modeling with EHRs

    • Fairness considerations
    • Privacy preserving algorithms

Seminar Information

Thursday, February 29 –
Saturday, March 2, 2024

Daily Schedule: All sessions are held live via Zoom. All times are ET (New York time).

10:00am-12:30pm (convert to your local time)

Payment Information

The fee of $995 includes all course materials.

PayPal and all major credit cards are accepted.

Our Tax ID number is 26-4576270.