Faculdade de Ciências e Tecnologia

Systems for Big Data Processing

Code

12078

Academic unit

Faculdade de Ciências e Tecnologia

Department

Departamento de Informática

Credits

6.0

Teacher in charge

Nuno Manuel Ribeiro Preguiça

Weekly hours

4

Total hours

48

Teaching language

Inglês

Objectives

This course will focus on the programming mdoels and their use to solve concrete problems.

The main goals are the following:

Knowledge

- Know the different facets of processing large volumes of data.
- Know the main classes of systems for storage of large volumes of data - Know the dominant programming models for Big Data
- Know solutions for specific problem domains

Application

- Be capable of identifying the best system class for solving a specific problem.
- Be capable of coding a specific problem solution in the most suitable programming model - Be capable of executing a big data application in a distributed platform.

Prerequisites

Knowledge of programming.

Subject matter

1.Overview
a.Motivation, Applications
b.Challenges

2.Programming models
a.Batch vs. Incremental vs. Real-time
b.Structured data vs. Unstructured data
c.Declarative programming vs. General-purpose

3.Data storage
a.Distributed file systems (e.g. HDFS)
b.Relational databases
c.NoSQL databases (e.g. key-value stores, document stores)
d.Integration of multiple data sources (e.g. Hive)

4.Generic processing platforms
a.Infrastructure: context, properties and implications 
b.Map-reduce model and supporting platform (e.g. Hadoop) 
c.Second generation platforms (e.g. Pig, Spark)

5.Processing for specific domains
a.Machine learning libraries (e.g. Spark MLlib)
b.Platforms for graph processing (e.g. GraphX)

6.Introduction to real-time processing platforms
a.Data sources (e.g. Flume, Kafka)
b.Data models: micro-batch vs. continuous
c.Processing platforms (e.g. Storm, Spark Streaming)

Bibliography

Selected set of book chapters and papers -- these materials will be made available at CLIP.

Teaching method

In the lectures, the topics that comprise the course syllabus are presented and discussed, using existing systems and platforms to highlight the issues and present concrete examples.

In labs, the students acquire experience on developing solutions for large-scale data processing problems, using a selection of current platforms and systems. Classes comprise demos, exercises and support for the two programming assignments.

Grading is based on the following components: two quizzes (25% each) and two team programming-assignments (25% each).

Evaluation method

2 quizzes (25%+25%) or exam (50%)

2 programming assignments (25% + 25%)

Groups of 3 students


Courses