NOVA Information Management School

Big Data

Code

200144

Academic unit

NOVA Information Management School

Credits

7.5

Teacher in charge

Flávio Luís Portas Pinheiro

Teaching language

Portuguese. If there are Erasmus students, classes will be taught in English

Objectives

Big data is nothing but a collection of a large number of data that is impossible to be processed using traditional data computing techniques. Hadoop on the other hand can be defined as a complete subject that involves various tools, techniques and frameworks.The course provide in-depth knowledge on Big data and Hadoop technologies.

At the end of the course the studentes should be able to process and analyze vast amount of heterogeneous data for getting useful insights from them.

Prerequisites

Basic knowledge of at least one programming language.

Subject matter

The course consists of different modules:

- Introduction to Big Data: challenges and motivations.

- Hadoop: Hadoop technology enables the user to run the applications on systems with thousands of commodity hardware and it also helps to handle huge data. It can be defined as a distributed file system that allows rapid data transfer rates among nodes and also enables the system to operate seamlessly in case of a node failure. Hadoop is a preferred computing technology, especially for smaller enterprises looking to leverage analytics to draw valuable big data insights. This implies that many employers are looking to hire candidates with knowledge and expertize of the Hadoop technology. The Hadoop technology makes it easier for the organizations to make decisions based on complete analysis made by multiple variables and data sets. This module of the course describes the Hadoop architecture and its advantages when compared to traditional data-analysis techniques.

- MapReduce: MapReduce as a concept is the heart of Hadoop. It is a programming concept that allows vast scalability across several servers in the cluster. MapReduce is a part of two tasks that the Hadoop programme performs. The MapReduce takes a set of data and then converts it into another set of data. Here, the individual elements are broken down into value pairs. This is known as the Map Job. The next function is the reduce job. Here it takes the output from a map, which is in the form of an input and then combines these data pairs into a smaller set of pairs. Therefore, just as the name states the reduce job is performed only after the map job. Several organisations solve their problems using the designs created by MapReduce, in order to yield performance gains of several orders of magnitude. In the course the logic of MapReduce will be explained and several common tasks will be presented.

- Sqoop: Sqoop is a Big Data tool, which has the capability to extract data from the non-Hadoop data stores and then it transforms the data into a form that is usable by Hadoop. This data is later loaded into the Hadoop Distributed File System. This process is known as ETL, where E stands for Extract, T stands for Transform, and L stands Load. Sqoop also has the ability to get data out of Hadoop and into an external data source for use in other kinds of application.

- HIVE: The Hadoop technology was built to organize and store massive amounts of data of all shapes, sizes and formats. Hive is used by Data analysts to query, summarize, explore and analyze this data, then turn it into actionable business insight. Hive is defined as a data warehouse system for Hadoop. The Hive technology was developed by Facebook. It supports Data definition Language(DDL), Data Manipulation Language(DML) and also user defined functions. Hive is used to project structure on largely unstructured data. It is like a traditional database code that comes with SQL access. It is based on Hadoop and MapReduce operations and is a read-based technology.

- PIG: PIG is defined as a high level scripting language that is used with Hadoop. PIG enables data workers to write complex data transformations and makes this possible even for those who do not know the Java programming language. PIG’s simple SQL-like scripting language is known as Pig Latin. It is very appealing to those developers, who are already familiar with scripting languages and SQL. PIG is complete and thus it helps the consumer to do all data manipulations in Apache Hadoop with PIG.One of the major benefits of PIG is that the consumer can embed the PIG scripts in other languages. Hence, it is used as a component to build larger and more complex applications that solve business problems. Pig works with the data from many sources, which includes structured and unstructured data, and later it stores the results into the Hadoop Data File System.

Bibliography

Hadoop: The Definitive Guide, Tom White.

Teaching method

Theoretical classes where the professor will present the main technologies in the field of big data.

After each one of the theoretical classes, students will be asked to work on a practical assignment.

Evaluation method

First epoch: two tests during the semester (each test contributes 50% for the final grade).

Second epoch: project (70%) and the grade of the mini tests of the first epoch (30%).

Courses

Master degree program in Advanced Analytics