
Data Analytics and Mining
Code
11563
Academic unit
Faculdade de Ciências e Tecnologia
Department
Departamento de Informática
Credits
6.0
Teacher in charge
Joaquim Francisco Ferreira da Silva, Pedro Manuel Corrêa Calvente Barahona
Weekly hours
4
Teaching language
Português
Objectives
Knowledge
- Understand the paradigms and challenges of Data Analytics and Text Mining
- Learn the fundamental methods and their applications in the extraction of patterns from data. Understand data features, the selection of models and interpretation of model’s results.
- Understand the advantages and disadvantages of the different methods.
Skills
- Implement and adapt Data Analytics and Text Mining algorithms;
- Model real data experimentally.
- Interpret and evaluate experimental results.
Competences
- Evaluate the suitability of each method to case studies
- Critical evaluation of the results.
Autonomy and self-reliance in the application and furthering studies in Data Analytics and Text Mining.
Subject matter
Introduction
Data Analytics
What is data: Examples of data analytic tasks and various perspectives of them
Visualization as a convenient tool for business analytics
Text Mining
Structured or unstructured data? Why mining texts?
What types of problems can be solved?
- Module I
Data Understanding
- 1D Summarization and Visualization of a Single Feature
- 2D Analysis: Correlation and Visualization of Two Quantitative Features
- Verification of structure in data
Data Preparation
- Variable cleaning
- Feature creating
- Why normalization matters
Descriptive Modeling I
Principal Component Analysis(PCA): Model and Method
- Summarization versus Correlation
- Matrix spectrum and Singular Value Decomposition (SVD)
- PCA as SVD. Conventional PCA’s.
PCA: Applications
Descriptive Modeling II
- K‐means, Anomalous clusters, Intelligent K‐Means
- Spectral clustering
- Relational clustering (if time permits)
Interpreting Descriptive Models
- Conventional Cluster Model Interpretation
- Assessing Cluster Tendency
- Least squares principle induced interpretation aids
Data Analytics Case Studies
- Module II‐ Text Mining
Relevant Information Extraction
- Relevant Expressions: Multi‐words and single‐words
- Statistical vs symbolic extractors. Algorithms and metrics
- Language‐independence
Symbolic and Statistical Analysis of texts
- Tokenization, Stemming and Part‐Of‐Speech Tagging
- Word distribution in texts and Zipf Law
- Metrics for word association and retrieval
- Document correlation
- Word Sense Disambiguation
Document Descriptors
- Language‐independent Mining of Explicit and Implicit Keywords from documents.
- Semantic Scope of Documents
- Document Summarization
Document Classification
- Relevant Expressions as features for document characterization. Feature selection and reduction.
- Document Similarity
- Supervised vs unsupervised Document Clustering.
- Prediction and evaluation
Text Mining Case Studies (some examples)
- Extraction of Named Entities
- Email filtering
- Language detection
- Efficient Extraction of Multiwords
- Polarity Detection
Bibliography
- D. T. Larose, C. D. Larose (2015), Data Mining and Predictive Analytics, 2nd Edition, Wiley.
- B. Mirkin (2011), Core Concepts in Data Analysis: Summarization, Correlation, Visualization. Undergraduate Topics for Computer Science Series, Springer, London.
- Weiss, S.M., Indurkhya, N., Zhang, T., Damerau, F. (2005), Predictive Methods for Analyzing
Evaluation method
The evaluation of this curricular unit is made by two components: theoretical/problems (T) and project (P). Each component contributes with 50% to the final grade.
To pass, the student must have: a grade of at least 10 points (out of 20 points) in each of the theoretical/problems and project components. The final grade is defined as the weighted average of the two components of evaluation.
The theoretical part consists of two written individual tests; alternatively, this component can be evaluated by a written exam.
The project component is evaluated by a set of assignments, and two programming projects accompanied by written reports.
Attendance to at least 2/3 of the lectures either theoretical or practical is required.