Faculdade de Ciências e Tecnologia

Data Analytics and Mining

Code

11563

Academic unit

Faculdade de Ciências e Tecnologia

Department

Departamento de Informática

Credits

6.0

Teacher in charge

Joaquim Francisco Ferreira da Silva, Pedro Manuel Corrêa Calvente Barahona

Weekly hours

4

Teaching language

Português

Objectives

Knowledge

  • Understand the paradigms and challenges of Data Analytics and Text Mining
  • Learn the fundamental methods and their applications in the extraction of patterns from data. Understand data features, the selection of models and interpretation of model’s results.
  • Understand the advantages and disadvantages of the different methods.

 

Skills

  • Implement and adapt Data Analytics and Text Mining algorithms;
  • Model real data experimentally.
  • Interpret and evaluate experimental results.

 

Competences

  • Evaluate the suitability of each method to case studies
  • Critical evaluation of the results.

Autonomy and self-reliance in the application and furthering studies in Data Analytics and Text Mining.

Subject matter

 Introduction

Data Analytics

What is data: Examples of data analytic tasks and various perspectives of them

Visualization as a convenient tool for business analytics

Text Mining

Structured or unstructured data? Why mining texts?

What types of problems can be solved?

  • Module I

Data Understanding

  • 1D Summarization and Visualization of a Single Feature
  • 2D Analysis: Correlation and Visualization of Two Quantitative Features
  • Verification of structure in data

Data Preparation

  • Variable cleaning
  • Feature creating
  • Why normalization matters

Descriptive Modeling I

Principal Component Analysis(PCA): Model and Method

  • Summarization versus Correlation
  • Matrix spectrum and Singular Value Decomposition (SVD)
  • PCA as SVD.  Conventional PCA’s.

PCA: Applications

Descriptive Modeling II

  • K‐means, Anomalous clusters, Intelligent K‐Means
  • Spectral clustering
  • Relational clustering (if time permits)

Interpreting Descriptive Models

  • Conventional Cluster Model Interpretation
  • Assessing Cluster Tendency
  • Least squares principle induced interpretation aids

Data Analytics Case Studies

 

  • Module II Text Mining

Relevant Information Extraction

  • Relevant Expressions: Multi‐words and single‐words
  • Statistical vs symbolic extractors. Algorithms and metrics
  • Language‐independence

Symbolic and Statistical Analysis of texts

  • Tokenization, Stemming and Part‐Of‐Speech Tagging
  • Word distribution in texts and Zipf Law
  • Metrics for word association and retrieval
  • Document correlation
  • Word Sense Disambiguation

Document Descriptors

  • Language‐independent Mining of Explicit and Implicit Keywords from documents.
  • Semantic Scope of Documents
  • Document Summarization

Document Classification

  • Relevant Expressions as features for document characterization. Feature selection and reduction.
  • Document Similarity
  • Supervised vs unsupervised Document Clustering.
  • Prediction and evaluation

Text Mining Case Studies (some examples)

  • Extraction of Named Entities
  • Email filtering
  • Language detection
  • Efficient Extraction of Multiwords
  • Polarity Detection

Bibliography

  • D. T. Larose, C. D. Larose (2015), Data Mining and Predictive Analytics, 2nd Edition, Wiley.
  • B. Mirkin (2011), Core Concepts in Data Analysis: Summarization, Correlation, Visualization. Undergraduate Topics for Computer Science Series, Springer, London.
  • Weiss, S.M., Indurkhya, N., Zhang, T., Damerau, F. (2005), Predictive Methods for Analyzing

Evaluation method

The evaluation of this curricular unit is made by two components: theoretical/problems (T) and project (P). Each component contributes with 50% to the final grade.

 To pass, the student must have: a grade of at least 10 points (out of 20 points) in each of the theoretical/problems and project components. The final grade is defined as the weighted average of the two components of evaluation.

The theoretical part consists of two written individual tests; alternatively, this component can be evaluated by a written exam. 

The project component is evaluated by a set of assignments, and two programming projects accompanied by written reports.

Attendance to at least 2/3 of the lectures either theoretical or practical is required.

Courses