Lectures will mainly consist in live sessions dealing with using the Cloud for the purposes of data analysis and machine learning. These sessions will be carried out by the lecturer and replicated, with suggested variations, by the students, on available equipment. Laboratory practice aims at enabling students to refine their understanding of the technologies presented and acquire autonomous operating skills. As a framework and guidance, lecture notes will be displayed during lectures and shared with students. Notes will provide a precise record of the material presented, as well as pointers to the required reference technical documentation.
Should teaching be carried out in mixed mode or remotely, it may be necessary to introduce changes with respect to previous statements, in line with the programme planned and outlined in the syllabus. Learning assessment may also be carried out on line, should the conditions require it.
Fundamentals of data analysis and machine learning. Basic skills in using a desktop computing environment and the Web.
Attending classes is not mandatory but strongly recommended.
This course aims at enabling the data scientist to put into practice on the public Cloud principles and methodologies learnt in courses concerned with data storage, processing, analysis, and machine learning. Indeed, in these areas, present day industrial and enterprise applications typically require storage volumes, computing power and bandwidth at a scale impossible or (even for large organizations) impractical to attain with proprietary equipment on premises. In realistic Data Science scenarios, it is therefore hardly avoidable for the data scientist to resort to the Cloud, i.e. storage and computing services offered by third-party providers over the public Internet, with a pay-per-use cost model.
In a nutshell, quoting reference [2], we may say that: “The Cloud turbocharges Data Science” .
Google Cloud Platform (GCP) is the platform of choice, for its ease of use and free availability to students.
A list of the main topics treated in the course follows.
Argomenti | Riferimenti testi | |
---|---|---|
1 | Google Cloud (GC): Performing structured queries on BigQuery | Lecture notes |
2 | GC: Performing structured queries on Cloud SQL | Lecture notes |
3 | Processing big data with a cloud (Unix) shell | Lecture notes |
4 | GC: Importing big data from CSV files | Lecture notes |
5 | Downloading large public data sets to GC | Lecture notes |
6 | Processing data with the Google App Engine | Lecture notes |
7 | GC Dataflow: processing a real-time, real-world data set | Lecture notes |
8 | Case study: real-time geospatial data on GC | Lecture notes |
9 | GC Data Studio: Visualizing data from Google Cloud SQL | Lecture notes |
10 | GC Datalab: Data Analysis and Google BigQuery | Lecture notes |
11 | GC Datalab notebooks for rapid exploratory data analysis | Lecture notes |
12 | GC AI Platform: queries and data presentation | Lecture notes |
13 | Machine Learning (ML) with Spark on GC | Lecture notes |
14 | ML with Spark on GC | Lecture notes |
15 | ML with TensorFlow on GC: developing and evaluating predictive models | Lecture notes |
16 | MapReduce e Hadoop on Google Cloud: exploiting parallelism and machine clusters | Lecture notes |
Laboratory session individually performed by the student vis-à-vis the lecturer. The student will be required to carry out the Cloud-based procedures demonstrated during the lectures, as well as to discuss their significance, and critically assess their outcomes. Learning assessment may also be carried out on line, should the conditions require it.
The student will choose one or more datasets, and prepare a project demonstrating the technologies presented in the course. Typically, queries for BigQuery, notebooks, and data ingestion procedures are expected. Datasets and the project contents should be agreed in advance with the course instructor.