DATA ANALYSIS AND STATISTICAL LEARNING

12 CFU - 1° e 2° semestre

Docenti titolari dell'insegnamento

ANTONIO PUNZO - Modulo DATA ANALYSIS - SECS-S/01 - 6 CFU
SALVATORE INGRASSIA - Modulo STATISTICAL LEARNING - SECS-S/01 - 6 CFU

Obiettivi formativi

DATA ANALYSIS

1. Knowledge and understanding (Conoscenza e capacità di comprensione). The first “Statistical Learning” module mainly concerns the fundamentals of two of the main methods used in unsupervised learning: principal component analysis and cluster analysis.

2. Applying knowledge and understanding (Capacità di applicare conoscenza e comprensione). On completion, the student will be able: i) to implement the main methods used in unsupervised learning; ii) to summarize the main features of a dataset and extract knowledge from data properly.

3. Making judgements (Autonomia di giudizio). On completion, the student will be able to choose a suitable statistical model, apply it, and perform the analysis using a statistical software.

4. Communication skills (Abilità comunicative). On completion, the student will be able to present the results from the statistical analysis, and which conclusions can be drawn.

5. Learning skills (Capacità di apprendimento). On completion, the student will be able to understand the structure of unsupervised learning.

STATISTICAL LEARNING
1. Knowledge and understanding (Conoscenza e capacità di comprensione). The objectives of the module aim at acquiring knowledge about: i) setting of the learning problem and introducing the general model of the risk functional from empirical data; ii) main statistical learning techniques for regression and data classification.
2. Applying knowledge and understanding (Capacità di applicare conoscenza e comprensione). On completion, The student will be able: i) to implement main statistical models for supervised and unsupervised learning; ii) to summarize the main features of a dataset and extract knowledge from data properly.
3. Making judgements (Autonomia di giudizio). On completion, students will able how to choose a suitable statistical model, apply sound statistical methods, and perform the analyses using statistical software
4. Communication skills (Abilità comunicative). On completion, students will be able how to present the results from the statistical analyses, and which conclusions can be drawn from the analyses.
5. Learning skills (Capacità di apprendimento). On completion, students will be able to understand the structure of the statistical learning.

Modalità di svolgimento dell'insegnamento

DATA ANALYSIS
Lectures via slides. The freely available R statistical software will be also used.

Should teaching be carried out in mixed mode or remotely, it may be necessary to introduce changes with respect to previous statements, in line with the programme planned and outlined in the syllabus.
STATISTICAL LEARNING
Lectures and practical data modeling in R.

Should teaching be carried out in mixed mode or remotely, it may be necessary to introduce changes with respect to previous statements, in line with the programme planned and outlined in the syllabus.

Prerequisiti richiesti

DATA ANALYSIS
Although there is no formal pre-requisite, knowledge of the basic notions in statistics, linear algebra, and computing are essential.
STATISTICAL LEARNING
Basics of statistics and data analysis, Matrix Algebra, Calculus

Frequenza lezioni

DATA ANALYSIS
Recommended.
STATISTICAL LEARNING
Mandatory

Contenuti del corso

DATA ANALYSIS
1. Statistical Models for Univariate Random Variables. Discrete and continuous random variables. Basic distribution functions. Expectation and variance. Statistical models for random variables. Parametric Inference: classical properties of estimators; the maximum likelihood approach and its properties. Goodness-of-fit tests. R functions and packages. Illustration in R.
2. Basics of Matrices. Matrices. Special matrices. Basic matrix identities. Trace. Inverse and determinant. Eigen-decomposition. Quadratic forms and definite matrices.
3. Basics of Multivariate Modelling. Random vectors and their distributions. Mean vector, covariance and correlation matrices. Multivariate normal distribution: properties and effect of the covariance matrix on the shape of the contours. Data Matrix, centered data matrix and standardized data matrix.
4. Cluster Analysis (CA). Clustering distance/dissimilarity measures. Data types in CA. Data standardization. Distance matrix computation. R functions and packages.
5. Hierarchical clustering methods. Peculiarities. Agglomerative hierarchical clustering. Algorithm. Dendrogram. Linkage methods. Simplified example. Agglomerative hierarchical clustering methods using the data matrix. Illustration in R.
6. Partitioning (or partitional) clustering methods. Peculiarities. K-means clustering. Algorithm. R functions and packages. Illustration in R. K-medoids clustering. PAM Algorithm. R functions and packages. Illustration in R.
7. Cluster Validation. Overview. Assessing Clustering Tendency. R functions and packages. Illustration in R. Determining the Optimal Number of Clusters. R functions and packages. Illustration in R. Cluster Validation Statistics: Internal and external measures. R functions and packages. Illustration in R. Choosing the Best Clustering Algorithm(s). Measures for comparing clustering algorithms. Cluster stability measures. R functions and packages. Illustration in R.
8. Model-Based Clustering. Preliminaries. Mixture models. Clustering with mixture models. Maximum a posteriori probability criterion. Gaussian mixtures. Parsimonious modelling via eigen-decomposition. Choosing the number of mixture components and the best parsimonious configuration: the Bayesian information criterion. R functions and packages. Illustration in R.
STATISTICAL LEARNING
Statistical Learning. Estimation of dependences based on empirical data. Supervised and Unsupervised Learning. Regression and Classification problems. Parametric and non-parametric models. Assessing Model Accuracy.

Linear Regression. Simple linear regression. Multiple linear regression. Least squares criterion and parameter estimation. Assessing the accuracy of the coefficient estimates and of the model. Use of qualitative predictors. Extension of the linear model and non-linear relationships.

Classification. Logistic regression; parameter estimation. Linear and quadratic discriminant analysis.

Resampling methods. Cross-validation, Bootstrap.

Linear Model Selection an Regularization. Variable selection. Dimension reduction methods.

Tree-based Methods. Regression Trees and Classification Trees. Bagging, Random Forest, Boosting

Support Vector Machines and Neural Networks. Support vector classifiers. Deep learning and multilayer perceptrons.

Testi di riferimento

DATA ANALYSIS
1. Bishop C. M. (2007). Pattern Recognition and Machine Learning, Springer, Cambridge.
2. Giordani P., Ferraro M. B., Martella F. (2020). An Introduction to Clustering with R, Springer, New York.
3. James G., Witten D., Hastie T., Tibshirani R. (2017). An Introduction to Statistical Learning with Applications in R, Springer, New York.
4. Kassambara A. (2017a). Practical Guide to Cluster Analysis in R.
5. Kassambara A. (2017b). Practical Guide to Principal Component Methods in R.
6. McNeil A. J., Frey R., Embrechts P. (2005). Quantitative Risk Management Concepts, Techniques and Tools. Princeton University Press, Princeton, New Jersey.
STATISTICAL LEARNING
1. James G., Witten D., Hastie T., Tibshirani R. (2017). An Introduction to Statistical Learning with Applications in R, 2nd Edition 2021, Springer, New York.
2. Hastie T., Tibshirani R., Friedman (2008). The Elements of Statistical Learning, Springer, New York
3. Course notes

Altro materiale didattico

DATA ANALYSIS
http://studium.unict.it/dokeos/
STATISTICAL LEARNING
See the website studium: http://studium.unict.it/dokeos/2019/

Programmazione del corso

DATA ANALYSIS
	Argomenti	Riferimenti testi
1	Statistical Models for Univariate Random Variables	Slides
2	Basics of Matrices	Bishop (2007, Appendix C)
3	Basics of Multivariate Modelling	McNeil, Frey and Embrechts (2005, Chapter 3)
4	Principal Component Analysis (PCA)	James, Witten, Hastie, Tibshirani (2017, Chapter 10)
5	Cluster Analysis (CA)	Kassambara (2017a, Chapter 3)
6	Hierarchical clustering methods	Kassambara (2017a, Chapter 7)
7	Partitioning (or partitional) clustering methods	Kassambara (2017a, Chapters 4–5)
8	Cluster Validation	Kassambara (2017a, Chapters 11–14)
9	Model-Based Clustering	Kassambara (2017a, Chapter 18) and Giordani, Ferraro, Martella (2020, Part IV)
STATISTICAL LEARNING
	Argomenti	Riferimenti testi
1	Fundamentals of Statistical Learning: Estimation of dependencies based on empirical data; supervised and unsupervised learning; regression and classification problems	Textbook #1: Chap 1 and Chap 2, Sect. 2.1
2	Fundamentals of Statistical Learning: Parametric and non-parametric models; assessing model accuracy; Lab: introduction to R	Textbook #1: Chap 2, Sect. 2.2 and 2.3
3	Linear Regression: simple linear regression and multiple linear regression; least squares criterion and parameter estimation	Textbook #1: Chap 3, Sect. 3.1 and 3.2
4	Linear Regression: assessment of model fit; qualitative predictors; extension of the linear model and non-linear relationship; Lab with R	Textbook #1: Chap 3, Sect. 3.3 and 3.6
5	Classification: logistic regression; linear and quadratic discriminant analysis; Lab with R	Textbook #1: Chap. 4
6	Resampling Methods: cross-validation, bootstrap; Lab with R	Textbook #1: Chap. 5
7	Linear model selection and regularization. Variable selection; dimension reduction methods; Lab with R	Textbook #1: Chap. 6, Sect 6.1 and 6.3
8	Tree-Based Methods: regression trees; classification trees; bagging; random forests; boosting; Lab with R	Textbook #1: Chap. 8
9	Support Vector Machines: support vector classifiers; lab with R	Textbook #1: Chap. 9.
10	Neural networks: deep learning and multilayer perceptrons; Lab with R	Course notes

Verifica dell'apprendimento

MODALITÀ DI VERIFICA DELL'APPRENDIMENTO

DATA ANALYSIS
The exam aims to evaluate the achievement of the learning objectives. It is carried out through an oral exam that includes questions related to the program in addition to the discussion of a report concerning a real data analysis performed using both the methodologies treated during the course and the R statistical software.

Learning assessment may also be carried out online if the conditions require it.
STATISTICAL LEARNING
Practical activities (data analysis and modeling with R) and oral exam.

Learning assessment may also be carried out on line, should the conditions require it.

ESEMPI DI DOMANDE E/O ESERCIZI FREQUENTI

DATA ANALYSIS
1. Distributions on a compact support;
2. Maximum likelihood estimation;
3. Goodness-of-fit tests;
4. Principal component analysis;
5. Hierarchical clustering;
6. K-means;
7. Model-based clustering;
8. Cluster validation;
9. Dunn index;
10. Model selection in mixture modeling.
STATISTICAL LEARNING
Regression Models, Support Vector Machines, Decision Trees, Multilayer perceptros

Apri in formato Pdf English version