Faculdade de Ciências e Tecnologia

Data Analytics and Mining

Code

11563

Academic unit

Faculdade de Ciências e Tecnologia

Department

Departamento de Informática

Credits

6.0

Teacher in charge

Joaquim Francisco Ferreira da Silva, Pedro Manuel Corrêa Calvente Barahona

Weekly hours

Teaching language

Português

Objectives

Knowledge

Understand the paradigms and challenges of Data Analytics and Text Mining
Learn the fundamental methods and their applications in the extraction of patterns from data. Understand data features, the selection of models and interpretation of model’s results.
Understand the advantages and disadvantages of the different methods.

Skills

Implement and adapt Data Analytics and Text Mining algorithms;
Model real data experimentally.
Interpret and evaluate experimental results.

Competences

Evaluate the suitability of each method to case studies
Critical evaluation of the results.

Autonomy and self-reliance in the application and furthering studies in Data Analytics and Text Mining.

Subject matter

Introduction

Data Analytics

What is data: Examples of data analytic tasks and various perspectives of them

Visualization as a convenient tool for business analytics

Text Mining

Structured or unstructured data? Why mining texts?

What types of problems can be solved?

Module I

Data Understanding

1D Summarization and Visualization of a Single Feature
2D Analysis: Correlation and Visualization of Two Quantitative Features
Verification of structure in data

Data Preparation

Variable cleaning
Feature creating
Why normalization matters

Descriptive Modeling I

Principal Component Analysis(PCA): Model and Method

Summarization versus Correlation
Matrix spectrum and Singular Value Decomposition (SVD)
PCA as SVD. Conventional PCA’s.

PCA: Applications

Descriptive Modeling II

K‐means, Anomalous clusters, Intelligent K‐Means
Spectral clustering
Fuzzy clustering

Interpreting Descriptive Models

Conventional Cluster Model Interpretation
Assessing Cluster Tendency
Least squares principle induced interpretation aids

Data Analytics Case Studies

Module II‐ Text Mining

Relevant Information Extraction

Relevant Expressions: Multi‐words and single‐words
Statistical vs symbolic extractors. Algorithms and metrics
Language‐independence

Symbolic and Statistical Analysis of texts

Tokenization, Stemming and Part‐Of‐Speech Tagging
Word and Multi-word distribution in Big Data context. Zipf Law
Metrics for word association and retrieval
Document correlation
Word Sense Disambiguation

Document Descriptors

Language‐independent Mining of Explicit and Implicit Keywords from documents.
Semantic Scope of Documents
Document Summarization

Document Classification

Relevant Expressions as features for document characterization. Feature selection and reduction.
Document Similarity
Supervised vs unsupervised Document Clustering.
Prediction and evaluation

Text Mining Case Studies (some examples)

Extraction of Named Entities
Email filtering
Language detection
Efficient Extraction of Multiwords
Polarity Detection

Bibliography

D. T. Larose, C. D. Larose (2015), Data Mining and Predictive Analytics, 2nd Edition, Wiley.
B. Mirkin (2011), Core Concepts in Data Analysis: Summarization, Correlation, Visualization. Undergraduate Topics for Computer Science Series, Springer, London.
Weiss, S.M., Indurkhya, N., Zhang, T., Damerau, F. (2005), Text mining: Predictive Methods for Analyzing

Evaluation method

Continuous assessment 
The laboratory grade, NL, is calculated by the arithmetic mean 
of the notes of the 2 practical assignments, one of each module,
 i.e. NL = (TP1 + TP2) / 2. Frequency ("frequência") is granted in this
 course,
 to students who earn a grade not less than 8.5 values. 
Theoretical
 grade, NT, is obtained, during continuous evaluation, through
 arithmetic mean of the scores of the 2 tests, one in each module,
 i.e. NT = (T1 + T2) / 2. The final grade of the course,
 NF, is obtained by means of the average grades laboratory and
 theoretical, i.e. NF = (NT + NL) / 2 To be approved at UC, a 
student shall cumulatively Have a theoretical grade of not less than
 8.5 values, NT ≥ 8.5 Have a frequency, i.e., a laboratory grade of
 not less than 8.5 values, NL ≥ 8.5 Have a final mark of not less 
than 9.5 values, NF ≥ 9.5. 

Exam 
Students ahaving "frequência") admitted to 
examination for approval. or improvement of grade in the course. 
The exam is composed of 2 parts, corresponding to the 2 tests,
 each one about each module. For the purpose of calculating the
 final grade, the grade of each component of the examination, 
replaces, if better, the grade obtained in the corresponding
 test.

Universidade Nova de Lisboa

Faculdade de Ciências e Tecnologia

Data Analytics and Mining

Code

Academic unit

Department

Credits

Teacher in charge

Weekly hours

Teaching language

Objectives

Subject matter

Bibliography

Evaluation method

Courses