The course introduces the main data mining techniques. The focus is on the mining of large amounts of data which do not enter in main memory. The examples presented during the course will cover the web, social networks and Next Generation Sequencing data produced in the biomedical field. In addition, the course deals with the subject also from the point of view of algorithms, emphasizing the difference from machine-learning. Among the topics discussed, we find, tools such as map-reduce, to work in a distributed environment with large amounts of data. This argument will be the common denominator in all the mining issues presented. Next, we introduce the issue of similarity research and the use of hashing techniques for large volumes of data. It also addresses the classical problem of high-support mining by describing the a priori algorithm and its variants. Recommendation systems will then be introduced. In this context we will also address the problem of high dimensional data and dimensional dimensioning techniques such as SVD, CUR, NNMF. The course will then introduce the main themes in network analysis. The centrality measures for the network will be introduced, with particular emphasis to the page-rank and its variants. The concept of null network model will be introduced to maintain network characteristics such as, degree distribution and clustering coefficient. Among the models presented we will find: Erdos-Renyi, Chung-Lu, Preferential Attachment. The problem of clustering will be addressed through the use of modularity and spectral clustering techniques.
General teaching training objectives in terms of expected learning outcomes.
Knowledge and understanding: The course aims to give the knowledge and basic and advanced skills to the analysis of large amounts of data.
Applying knowledge and understanding: the student will acquire knowledge about the models and algorithms for analyzing data such as: mining high support, recommendation systems, search for similarities high dimension, map-reduce and spark, complex networks analysis, text mining and the document tagging systems.
Making judgments: Through concrete examples and case studies, the student will be able to independently develop solutions to specific problems related to big data.
Communication skills: the student will acquire the necessary communication skills and expressive appropriateness in the use of technical language in the general area of big data.
Learning skills: The course aims to provide students with the necessary theoretical and practical methods to deal independently and solve new problems that may arise during a work activity. For this purpose, different topics will be covered in class by involving students in the search for possible solutions to real problems, using benchmarks available in the literature.
Lectures and laboratory
High Support Data Mining. Recommendation Systems. Map-Reduce. Beyond or map-reduce Similarity search of higher dimensions: shingling, Min-Hashing, LSH, Min-LSH. Dimensionality reduction: SVD, CUR, Application to LSI Johnson-Lindenstrauss theorem. Link Analysis: PageRank, link spam, Hub-Authorities, Applications on Map-Reduce. Web Advertising: online Algorithms, Adword and its implementations. Graph mining: subgraph matching, motif finding, community detection, Network alignment and network analysis. Text mining: TF.IDF, Bag-Of-Word, Entity annotation.
Mining of Massive Datasets
Jure Leskovec, Anand Rajaraman, Jeff Ullman