
Data Analytics and Mining
Code
11563
Academic unit
Faculdade de Ciências e Tecnologia
Department
Departamento de Informática
Credits
6.0
Teacher in charge
Joaquim Francisco Ferreira da Silva, Pedro Manuel Corrêa Calvente Barahona
Weekly hours
4
Teaching language
Português
Objectives
Knowledge
- Understand the paradigms and challenges of Data Analytics and Text Mining
- Learn the fundamental methods and their applications in the extraction of patterns from data. Understand data features, the selection of models and interpretation of model’s results.
- Understand the advantages and disadvantages of the different methods.
Skills
- Implement and adapt Data Analytics and Text Mining algorithms;
- Model real data experimentally.
- Interpret and evaluate experimental results.
Competences
- Evaluate the suitability of each method to case studies
- Critical evaluation of the results.
Autonomy and self-reliance in the application and furthering studies in Data Analytics and Text Mining.
Subject matter
Introduction
Data Analytics
What is data: Examples of data analytic tasks and various perspectives of them
Visualization as a convenient tool for business analytics
Text Mining
Structured or unstructured data? Why mining texts?
What types of problems can be solved?
- Module I
Data Understanding
- 1D Summarization and Visualization of a Single Feature
- 2D Analysis: Correlation and Visualization of Two Quantitative Features
- Verification of structure in data
Data Preparation
- Variable cleaning
- Feature creating
- Why normalization matters
Descriptive Modeling I
Principal Component Analysis(PCA): Model and Method
- Summarization versus Correlation
- Matrix spectrum and Singular Value Decomposition (SVD)
- PCA as SVD. Conventional PCA’s.
PCA: Applications
Descriptive Modeling II
- K‐means, Anomalous clusters, Intelligent K‐Means
- Spectral clustering
- Fuzzy clustering
Interpreting Descriptive Models
- Conventional Cluster Model Interpretation
- Assessing Cluster Tendency
- Least squares principle induced interpretation aids
Data Analytics Case Studies
- Module II‐ Text Mining
Relevant Information Extraction
- Relevant Expressions: Multi‐words and single‐words
- Statistical vs symbolic extractors. Algorithms and metrics
- Language‐independence
Symbolic and Statistical Analysis of texts
- Tokenization, Stemming and Part‐Of‐Speech Tagging
- Word and Multi-word distribution in Big Data context. Zipf Law
- Metrics for word association and retrieval
- Document correlation
- Word Sense Disambiguation
Document Descriptors
- Language‐independent Mining of Explicit and Implicit Keywords from documents.
- Semantic Scope of Documents
- Document Summarization
Document Classification
- Relevant Expressions as features for document characterization. Feature selection and reduction.
- Document Similarity
- Supervised vs unsupervised Document Clustering.
- Prediction and evaluation
Text Mining Case Studies (some examples)
- Extraction of Named Entities
- Email filtering
- Language detection
- Efficient Extraction of Multiwords
- Polarity Detection
Bibliography
- D. T. Larose, C. D. Larose (2015), Data Mining and Predictive Analytics, 2nd Edition, Wiley.
- B. Mirkin (2011), Core Concepts in Data Analysis: Summarization, Correlation, Visualization. Undergraduate Topics for Computer Science Series, Springer, London.
- Weiss, S.M., Indurkhya, N., Zhang, T., Damerau, F. (2005), Text mining: Predictive Methods for Analyzing
Evaluation method
Continuous assessment
The laboratory grade, NL, is calculated by the arithmetic mean
of the notes of the 2 practical assignments, one of each module,
i.e. NL = (TP1 + TP2) / 2. Frequency ("frequência") is granted in this
course,
to students who earn a grade not less than 8.5 values.
Theoretical
grade, NT, is obtained, during continuous evaluation, through
arithmetic mean of the scores of the 2 tests, one in each module,
i.e. NT = (T1 + T2) / 2. The final grade of the course,
NF, is obtained by means of the average grades laboratory and
theoretical, i.e. NF = (NT + NL) / 2 To be approved at UC, a
student shall cumulatively Have a theoretical grade of not less than
8.5 values, NT ≥ 8.5 Have a frequency, i.e., a laboratory grade of
not less than 8.5 values, NL ≥ 8.5 Have a final mark of not less
than 9.5 values, NF ≥ 9.5.
Exam
Students ahaving "frequência") admitted to
examination for approval. or improvement of grade in the course.
The exam is composed of 2 parts, corresponding to the 2 tests,
each one about each module. For the purpose of calculating the
final grade, the grade of each component of the examination,
replaces, if better, the grade obtained in the corresponding
test.