Python for Data Analysis

A 4-Day Remote Seminar Taught by Jason Anastasopoulos, Ph.D.

Python is a premier language for modern data science and data analysis. It is a free, open-source language that has a simple, easy-to-understand syntax and an incredible range of data analysis and visualization libraries. In four days, this seminar combines both an introductory and intermediate course in Python. The goal is to get participants to fully understand many of the basic elements of Python and immediately apply them to practical data analysis and data collection problems.

Starting May 26, we are offering this seminar as a 4-day synchronous*, remote workshop for the first time. Each day will consist of a 3-hour, live morning lecture held via the free video-conferencing software Zoom. Participants are encouraged to join the lecture live, but will have the opportunity to view the recorded session later in the day if they are unable to attend at the scheduled time. Each lecture session will conclude with a hands-on exercise reviewing the content covered, to be completed on one’s own that afternoon. A final session will be held each evening as an “office hour”, where participants can review the exercise results with the instructor and ask any questions.

*We understand that scheduling is difficult during this unpredictable time. If you prefer, you may take all or part of the course asynchronously. The video recordings will be available for 3 weeks after the course ends, meaning that you will get all of the class discussion and exercise solutions even if you cannot participate synchronously.

More Details About the Course Content

Python is rapidly becoming the preferred language of data scientists in both industry and academia. It’s used by Google, Facebook and other tech giants to perform data analysis and run machine learning algorithms that can handle hundreds of thousands of terabytes of data per day.

Python can be used for:

  • Storing and analyzing large and small datasets.
  • Web scraping and data collection using APIs.
  • Beautiful data visualization.
  • Natural language processing and text analysis.
  • General machine learning.
  • Deep learning.
  • Image analysis and much, much more…

By the end of this seminar you will be able to:

  • Program using Python (Jupyter) notebooks and IDEs.
  • Understand and use basic data analysis and visualization libraries such as NumPy, Pandas, Matplotlib, SciPy and statsmodels, among others.
  • Use basic data structures needed to do data analysis: variables, lists, loops, dictionaries, Boolean operators, functions.
  • Perform data analysis and basic statistical inference: GLMs, ANOVA, hypothesis testing.
  • Produce beautiful data visualizations.
  • Scrape and parse semi-structured data, including HTML, XML, and JSON.
  • Create and extract information from databases with Python.
  • Grasp the basics of unstructured data and natural language processing.

Computing

This remote seminar is held via Zoom, a free video conferencing application. Instructions for joining a session via Zoom are available here. Prior to each session, participants will receive an email with the meeting code you must use to join.

This is a hands-on class involving several structured and supervised assignments. To ensure that you are prepared, please have your own laptop available with Anaconda Python installed.

Please download and install Anaconda Python for your operating system prior to joining the seminar here: https://www.anaconda.com/distribution/.

You should also know how to access the command prompt (Windows users) or the terminal (Mac users). We will briefly review how to access these in class, but it will save you time and effort if you come already knowing these basics. You can get resources on the internet that will help you get started with the Windows Command Prompt or the Mac Terminal.

Who Should Register?

This seminar is designed for anyone who wants to quickly and efficiently obtain a solid foundation in the Python language that will allow them to begin using the language for their research, data analysis or visualization needs.

No prior experience in Python is assumed. Basic knowledge of programming would be helpful. However, those at an intermediate or advanced level in other packages or languages can also benefit greatly from this course.

Outline

Day 1:

  1. Getting started with Python:
    • Why Python?
    • Introduction to Anaconda Python.
    • Introduction to Python (Jupyter) notebooks.
    • Overview of basic libraries: NumPy, Pandas, Matplotlib, SciPy,
      statsmodels.
  2. Python basics and data structures:
    • Variables: numbers, strings values, using variables.
    • Lists and loops: lists basics, simple loops, pythonic loops.
    • Logical statements in Python.
    • Using and creating dictionaries.
    • Creating functions.
  3. Evening: Python Basics Assignment Solutions and Review.

Day 2:

  1. Data analysis and statistical inference:
    • Handling arrays with Pandas and NumPy.
    • Basic data analysis:
      • Summary statistics: mean, median, mode, variance and standard
        deviation.
      • Hypothesis testing: t-tests, confidence intervals.
      • Basic statistical models: linear regression, logistic regression, ANOVA.
    • Advanced data analysis: statistical inference and models with very large
      datasets.
  2. Data visualization
    • Distributions: densities, box plots, histograms.
    • Correlations: scatterplots, line plots, heatmaps.
    • Special topics: plotting maps.
  3. Evening: Data Analysis and Visualization Assignment Solutions and Review.

Day 3:

  1. Semi-structured data:
    • HTML and XML parsing.
    • JSON parsing.
  2. Database creation and extraction:
    • Introduction to SQL.
    • Introduction to MongoDB.
    • Using MongoDB and SQL to store and retrieve data.
  3. Evening: Semi-structured Data and Database Assignment Solutions and Review.

Day 4:

  1. Unstructured data and natural language processing:
    • Introduction to text processing in python: tokenization and text cleaning.
    • Preparing text data for analysis with the document-term matrix.
    • Sentiment analysis.
  2. Evening: Unstructured Data and Natural Language Processing Solutions and Review.

Seminar information

Tuesday, May 26, 2020 –
Friday, May 29, 2020

Each day will follow the below schedule:

11:00am-2:00pm EST: Live lecture via Zoom

After 2:00pm EST: Exercise assignment to be completed on one’s own

8:00pm-9:00pm EST: Live “office hour” via Zoom to review exercises and ask questions

Payment Information

The fee of $795 includes all course materials.

PayPal and all major credit cards are accepted.

Group discount rates are available for this course. All inquiries can be sent to info@statisticalhorizons.com.

Our Tax ID number is 26-4576270.