Introduction to Data Analysis in R
- The objective of the course is, that students should be able to: work easily in R, know fundamental of R Syntax; import data in R, make basic manipulation with it to prepare data for calculations and export results of calculations; visualize data; apply basic methods of preliminary data analysis; understand limitation and relevance of the methods.
- Know basic data types and R syntax. Is able to transform datasets. Have skills of data visualization.
- Know basics of parametric and nonparametric hypothesis testing. Is able to resample data. Have skills of hypothesis testing with bootstrap.
- Know methodology of PCA and clustering. Is able to do PCA and clustering using R. Have skills of evaluation of clustering quality.
- Introduction to data analysis with R1. Importing and cleaning data. Fundamentals of R Syntax. Ways to import data. Introduction and exploring raw data. Tidying data. Preparing data for analysis. 2. Data manipulation. Data wrangling. Select and mutate functions. Filter and arrange functions. Summarise and the pipe operator. Joining data. Intermediate operations in R. 3. Data visualization. Exploring of ggplot2. Plot aesthetic. Plot geometries. Applying statistical methods. Themes. Plots for specific data types.
- Hypothesis testing4. Hypothesis testing: parametric vs nonparametric. Main advantage and limitations of parametric hypothesis testing. T-test on comparing means in independent and dependent tests. Z-test on comparing proportions in independent and dependent tests. Main advantages and limitations of non-parametric hypothesis testing. Distribution free tests: The Sign test, Wilcoxon Signed-Ranks Test, Mann-Whitney U Test. ANOVA tests. 5. Hypothesis testing with bootstrap. Main advantages of resampling. Bootstrap: estimate coefficients in regression with bootstrap. Bootstrap data and bootstrap residuals. Jackknife resampling method: advantages and disadvantages comparing with bootstrap. Hypothesis testing with bootstrap and jackknife.
- PCA and clustering6. Principal component analysis. Main objectives of principal component analysis (PCA). Mathematical model of components discovery. Algorithms of PCA implementation. Latent variable, criteria for defining number of components. Rotation, interpretation of the results. 7. Clustering. Main objectives of clustering, geometrical interpretation. Measures of distance between objects and measures of distance between clusters. Hierarchical clustering: objective, algorithm, results interpretation, dendrogram. k-means and k-median clustering: objective, algorithm, results interpretation. Criteria for defining number of clusters and quality of clustering.
- Self-study (DataCamp)
- ExamThe exam is held on the platform http://trajectory.hse.perm.ru/. At the same time students should be in zoom-session with working camera. Requirements for the exam: laptop, web-camera and robust internet.
- Interim assessment (3 module)0.3 * Exam + 0.15 * Microtests + 0.15 * Reports + 0.1 * Self-study (DataCamp) + 0.3 * Test
- Spector, P. (2008). Data Manipulation with R. New York: Springer. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=229058
- Corder, G. W., & Foreman, D. I. (2014). Nonparametric Statistics : A Step-by-Step Approach (Vol. Second edition). Hoboken, New Jersey: Wiley. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=798830
- Gatignon, H. (2013). Statistical Analysis of Management Data (Vol. Third edition). New York: Springer. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1073815
- Govaert, G. (2009). Data Analysis. London: Wiley-ISTE. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=310759
- Rahlf, T. (2017). Data Visualisation with R : 100 Examples. Cham, Switzerland: Springer. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1377904