Introduction to Python for Data Analysis: A Short Course
A 4-Day Livestream Seminar Taught by Edwin Dalmaijer, DPhil
NOTE: this course is designed for those who have no previous experience with Python. If you are looking to learn more advanced methods, join Dr. Dalmaijer for Unsupervised Statistical Learning Using Python on August 12-13 to discover how to use the scikit-learn package in Python to uncover subgroups and latent components in datasets with unsupervised machine-learning techniques.
Python is one of the most popular languages in the world. It is a general-purpose language, but also highly user-friendly. This makes Python a very powerful tool, with a relatively easy learning curve. It is also open-source and supported by a large international community of users who support each other, and continue to develop additional functionality.
In the field of data science, Python has become indispensable. It is used for quick prototyping of statistical models and machine-learning pipelines, and you can even find highly mature Python applications in production environments! It has also become a go-to language in science, from astrophysics (e.g. black hole imaging) to zoology (e.g. evolution simulation).
This course is aimed at beginners, including those who are new to programming altogether. We will start with the basics of coding, including variables, logic, loops, functions, and object-oriented programming. In addition, we will discuss reading and writing data files, how to process large quantities of data fast, and data visualization. Finally, the course will cover statistics, regression, models, and a bit of machine learning. No prior knowledge on any of these topics is assumed.
Starting June 25, we are offering this seminar as a 4-day synchronous*, livestream workshop held via the free video-conferencing software Zoom. Each day will consist of two lecture sessions which include hands-on exercises, separated by a 1-hour break. You are encouraged to join the lecture live, but will have the opportunity to view the recorded session later in the day if you are unable to attend at the scheduled time.
*We understand that finding time to participate in livestream courses can be difficult. If you prefer, you may take all or part of the course asynchronously. The video recordings will be made available within 24 hours of each session and will be accessible for four weeks after the seminar, meaning that you will get all of the class content and discussions even if you cannot participate synchronously.
Closed captioning is available for all live and recorded sessions. Captions can be translated to a variety of languages including Spanish, Korean, and Italian. For more information, click here.
More Details About the Course Content
More specifically, the course will cover how to write your own functions and classes, using variables, statements, and loops. These make up the majority of code-bases, and are thus a crucial skill to master. You will also be introduced to some of the most commonly used packages: NumPy and SciPy for fast computing, Matplotlib for publication-quality visualizations, and scikit-learn for machine learning.
The course will be very hands-on. It will run through interactive notebooks in your internet browser, so you won’t have to download anything. However, we will provide advice on how to install Python and additional packages, so that you can continue using it at home and at work.
At the end of this course, you should be able to find your own way in Python. You will be equipped to handle datasets and to write full analyses. You will also be well-equipped to start deepening your Python knowledge, as this course will have introduced some of the most commonly used tools.
Computing
To run hands-on exercises, we will be using carefully crafted interactive notebooks via Google Colaboratory. For this, you only need an internet browser (like Firefox) and a Google account.
Alternatively, you are welcome to install Python on your own computer. In addition to Python (version 3.7 or higher), you will need the packages NumPy, SciPy, Matplotlib, and scikit-learn. Python package installation can be a bit tricky for those who aren’t familiar with it. We will cover installing Python packages on the first day of the course, so you might want to wait to install anything until then.
Who Should Register?
This course is for everyone who would like to learn Python, or to dip their toes into programming. The content leans towards data science, so this course will be especially useful to those who would like to expand their expertise in data handling, visualization, statistics, and basic machine learning. No prior knowledge of coding or statistics is necessary: we’ll start with the basics, and work our way up from there.
Outline
Day 1: Programming basics
- Variables
- Numerical values (int and float), operations, and functions
- Text values (str), operations, and functions
- Booleans and logical operation
- Collections (tuples, lists, dictionaries), operations, and functions
- If statements
- While loops
- For loops
- Functions
- What is a function?
- Input and output
Day 2: Data processing
- NumPy
- Arrays: fast, scalable, fantastic
- Useful array manipulation functions
- Random data generation
- Loading and writing data
- Paths and the os module
- Writing a CSV file
- Loading a CSV file
- Managing big data with memory-mapped arrays
Day 3: Data visualization
- Data visualization
- Matplotlib
- The basics: scatter plots, lines, error bars, and bar charts
- Better than bars: box plots and violin plots
- Drawing distributions
- Heatmaps
Day 4: Statistics
- Basic statistical tests
- Tests of relations
- Tests of differences
- Model fitting
- Scikit-learn
- Linear regression
- Multivariable regression
- Cross-validation
- Data for training, and data for testing
- N-folds cross-validation
Reviews of Introduction to Python for Data Analysis
“This course was one of the best trainings I’ve attended. It covered a number of different areas that are needed to really grasp the content, including saving files, different commands, running calculations, plotting data, and more.”
Juan Carlos Torres, San Diego County Office of Education
“Professor Dalmaijer was outstanding. Period. This is tough-going for rudimentary ‘coders’ but Dalmaijer had a ‘gentle’ way of taking us through it all.”
Warren Laskey, University of New Mexico
“The presentation style by the instructor and his knowledge about the subject were both excellent. Day 3 of the presentation materials was my favorite.”
Denis Nyongesa, Kaiser Permanente Center for Health Research
“I learned a lot from Edwin. He is very knowledgeable about Python.”
Aris Kaloudis, Norwegian University of Science and Technology
“The instructor was very nice and patient. I liked that each chapter had some assignments at the end to practice.”
Junyi Zhou, Johns Hopkins University
“I liked the course materials and instructor. Well done!”
Sarah Olson, Johns Hopkins University School of Medicine
Seminar Information
Tuesday, June 25 –
Friday, June 28, 2024
Daily Schedule: All sessions are held live via Zoom. All times are ET (New York time).
10:30am-12:30pm (convert to your local time)
1:30pm-3:00pm
Payment Information
The fee of $995 includes all course materials.
PayPal and all major credit cards are accepted.
Our Tax ID number is 26-4576270.