Basics of Data Analysis Using Python

Course leader: Dr Vitalii Naumov

Home institution: Cracow University of Technology, Poland

 

Course pre-requisite(s): Basics of Calculus, basic programming skills (not necessarily)

Course Overview

In the article “Data Scientist: The Sexiest Job of the 21st Century” published by Harvard Business Review in October 2012, T.H. Davenport and D.J. Patil have made a prediction that data scientists would become the most demanded specialists at every market due to development of communication and information technologies. This trend remains the same nowadays, and despite numerous courses and specializations which have been started in universities and in the net, data science professionals are still the most needed specialists in every area.
The course is devoted to persons who want to obtain essential skills in data analysis, and in this way, to catch a wave, and become the demanded professional.
During the course, students will become acquainted with the theoretical basis of data science – statistical analysis. We’re going to begin with the description of a random variable, its distribution functions, and numeric characteristics. Then I will present basics of distribution fitting and more advanced techniques of mathematical statistics – correlation and regression analysis.
All the presented methods and techniques will be supported by the respective tools in Python programming language. Students will learn basics of Python and also will get acquainted with the most popular tools for data analysis: pandas, numpy, matplotlib and scikit-learn libraries.
In the last part of the course, I will present essential machine learning tools – simple classifiers and neural networks. Implementation of these tools and its features will be explained with the help of examples in Python.

Learning Outcomes

By the end of the course, students will be able to use Python language and functionality of its libraries in order to perform basic operations of data processing. They will be proficient in statistical inference, including distribution fitting, correlation, and regression analysis. Students will have basic skills in data visualization with the use of Python libraries

Course Content

1. Introduction: data analysis skills are the must in the modern world
2. Basics of Python: data types, conditions, loops, functions
3. Random variable: distribution functions, numeric characteristics
4. Using pandas for data representation in Python
5. Creating in Python simple functions for basic data analysis
6. Basic distributions of random variables: discrete and continuous distributions
7. Data visualization in Python: basic tools of the matplotlib library
8. Using Python for distribution fitting: Pearson’s chi-squared test and Kolmogorov-Smirnov test
9. Numpy library: the most important functions for data processing and analysis
10. Correlation analysis: Pearson’s product-moment coefficient, rank correlation coefficients and correlation matrices
11. Machine learning introduction: simple classifiers with the scikit-learn library (decision trees and k-nearest neighbors method)
12. Regression analysis using Python: estimation of regression coefficients and significance tests
13. Basics of neural networks with Python: linear classification using the perceptron

Instructional Method

During the course, we will have lectures and individual projects in 50/50 proportion of time

Required Course Materials

All the required materials will be provided by the instructor during the course.
Recommended additional reading:
Madsen, B.S. Statistics for Non-Statisticians, Springer, 2016
Downey, A.B. Think Python: How to Think Like a Computer Scientist, O'Reilly, 2015
Raschka, S., Mirjalili, V. Python Machine Learning, Packt, 2017

Assessment:

The final grade will be calculated on the grounds of two tests (midterm and final) and the project developed during the course. Tests will contribute 80% to the final result, and the project will give 20% respectively.