ECONOMIA E IMPRESAData ScienceAnno accademico 2022/2023

9793875 - DATA ANALYSIS AND STATISTICAL LEARNING
Modulo DATA ANALYSIS

Docente: ANTONIO PUNZO

Risultati di apprendimento attesi

1.      Knowledge and understanding (Conoscenza e capacità di comprensione). The first “Statistical Learning” module mainly concerns the fundamentals of two of the main methods used in unsupervised learning: principal component analysis and cluster analysis. 

2.      Applying knowledge and understanding (Capacità di applicare conoscenza e comprensione). On completion, the student will be able: i) to implement the main methods used in unsupervised learningii) to summarize the main features of a dataset and extract knowledge from data properly. 

3.      Making judgements (Autonomia di giudizio). On completion, the student will be able to choose a suitable statistical model, apply it, and perform the analysis using statistical software. 

4.      Communication skills (Abilità comunicative). On completion, the student will be able to present the results from the statistical analysis, and which conclusions can be drawn. 

5.      Learning skills (Capacità di apprendimento). On completion, the student will be able to understand the structure of unsupervised learning.

Modalità di svolgimento dell'insegnamento

The exam aims to evaluate the achievement of the learning objectives. It is carried out through an oral exam that includes questions related to the program in addition to the discussion of a report concerning a real data analysis performed using both the methodologies treated during the course and the R statistical software.

Prerequisiti richiesti

Basic notions in statistics, linear algebra, and computing.

Frequenza lezioni

Lectures via slides. The freely available R statistical software will be also used.

Contenuti del corso

Statistical Models for Univariate Random Variables. Discrete and continuous random variables. Basic distribution functions. Expectation and variance. Statistical models for random variables. Parametric Inference: classical properties of estimators; the maximum likelihood approach and its properties. Goodness-of-fit tests. R functions and packages. Illustration in R. (Slides) 

Basics of Matrices. Matrices. Special matrices. Basic matrix identities. Trace. Inverse and determinant. Eigen-decomposition. Quadratic forms and definite matrices. (Bishop 2007, Appendix C) 

Basics of Multivariate Modelling. Random vectors and their distributions. Mean vector, covariance, and correlation matrices. Multivariate normal distribution: properties and effect of the covariance matrix on the shape of the contours. Data Matrix, centered data matrix, and standardized data matrix. (McNeil, Frey, and Embrechts 2005, Chapter 3) 

Principal Component Analysis (PCA). The goal of PCA. PCA as a tool for data visualization. Definition of principal components (PCs). PCA and Eigen-decomposition. Computing PCs. PCA: Geometrical interpretation. Choosing the number of PCs. Biplot. Illustration of PCA in R. (James G., Witten D., Hastie T., Tibshirani R. 2017, Chapter 10) 

Cluster Analysis (CA). Clustering distance/dissimilarity measures. Data types in CA. Data standardization. Distance matrix computation. R functions and packages. (Kassambara 2017, Chapter 3) 

Hierarchical clustering methods. Peculiarities. Agglomerative hierarchical clustering. Algorithm. Dendrogram. Linkage methods. Simplified example. Agglomerative hierarchical clustering methods using the data matrix. Illustration in R. (Kassambara 2017, Chapter 7) 

Partitioning (or partitional) clustering methods. Peculiarities. K-means clustering. Algorithm. R functions and packages. Illustration in R. K-medoids clustering. PAM Algorithm. R functions and packages. Illustration in R. (Kassambara 2017, Chapters 4–5) 

Cluster Validation. Overview. Assessing Clustering Tendency. R functions and packages. Illustration in R. Determining the Optimal Number of Clusters. R functions and packages. Illustration in R. Cluster Validation Statistics: Internal and external measures. R functions and packages. Illustration in R. Choosing the Best Clustering Algorithm(s). Measures for comparing clustering algorithms. Cluster stability measures. R functions and packages. Illustration in R.   (Kassambara 2017, Chapters 11–14) 

Model-Based Clustering. Preliminaries. Mixture models. Clustering with mixture models. Maximum a posteriori probability criterion. Gaussian mixtures. Parsimonious modeling via eigendecomposition. Choosing the number of mixture components and the best parsimonious configuration: the Bayesian information criterion. R functions and packages. Illustration in R. (Kassambara 2017, Chapter 18) 

Testi di riferimento

·         Bishop C. M. (2007). Pattern Recognition and Machine Learning, Springer, Cambridge.

·         Hastie T., Tibshirani R., Friedman J. (2008). The Elements of Statistical Learning, Springer, New York.

·         James G., Witten D., Hastie T., Tibshirani R. (2017). An Introduction to Statistical Learning with Applications in R, Springer, New York.

·         Kassambara A. (2017). Practical Guide to Cluster Analysis in R.

·         McNeil A. J., Frey R., Embrechts P. (2005). Quantitative Risk Management Concepts, Techniques and Tools. Princeton University Press, Princeton, New Jersey.

Verifica dell'apprendimento

Modalità di verifica dell'apprendimento

The exam aims to evaluate the achievement of the learning objectives. It is carried out through an oral exam that includes questions related to the program in addition to the discussion of a report concerning a real data analysis performed using both the methodologies treated during the course and the R statistical software.

Esempi di domande e/o esercizi frequenti

·         Maximum likelihood

·         K-means algorithm

·         Silhouette width

·         Model-based clustering

·         Dunn index

·         Likelihood-ratio test


English version