My Blog.

DS-U5-ARQ

Clustering Algorithms

1. What are the key steps involved in the K-Means clustering algorithm?

  • Answer:
    1. Initialize K centroids randomly.
    2. Assign each data point to the nearest centroid.
    3. Recalculate the centroids as the mean of all points assigned to each centroid.
    4. Repeat the assignment and recalculation steps until the centroids no longer change significantly or a maximum number of iterations is reached.

2. How does hierarchical clustering differ from K-Means clustering?

  • Answer:
    • Hierarchical Clustering: Builds a hierarchy of clusters either through agglomerative (bottom-up) or divisive (top-down) approaches. It does not require specifying the number of clusters in advance.
    • K-Means Clustering: Requires specifying the number of clusters (K) beforehand and partitions data into K non-overlapping subsets.

Time-Series Analysis

3. What are the key characteristics to consider in time-series data?

  • Answer:
    • Trend: Long-term increase or decrease in the data.
    • Seasonality: Regular pattern of changes that repeats over time.
    • Noise: Random variations or irregularities in the data.
    • Stationarity: Statistical properties like mean and variance remain constant over time.

Text Analysis

4. What is text preprocessing and why is it important in text analysis?

  • Answer:
    • Text preprocessing involves cleaning and preparing text data for analysis by performing steps such as tokenization, stemming, lemmatization, and removing stop words. It is important because it standardizes the text data, reduces noise, and ensures that the analysis focuses on meaningful information.

5. Explain the term TF-IDF and its significance in text analysis.

  • Answer:
    • TF-IDF stands for Term Frequency-Inverse Document Frequency. It is a statistical measure used to evaluate the importance of a word in a document relative to a collection of documents. TF-IDF helps highlight important words while reducing the weight of common words that appear frequently across all documents.

Social Network Analysis

6. What are the primary goals of social network analysis?

  • Answer:
    • To uncover patterns of relationships and interactions among entities in a network.
    • To identify influential nodes, communities, and sub-networks.
    • To understand the spread of information, behaviors, or diseases within the network.

Business Analysis

7. How is data analytics used in business decision-making?

  • Answer:
    • Data analytics helps businesses understand market trends, customer behavior, and operational efficiencies. It supports decision-making by providing insights through descriptive, predictive, and prescriptive analytics, thereby enabling strategic planning and competitive advantage.

Model Evaluation and Selection

8. What metrics can be used to evaluate the performance of a classifier?

  • Answer:
    • Accuracy, Precision, Recall, F1-score, ROC-AUC, and Confusion Matrix are common metrics. Each metric provides different insights, such as the balance between false positives and false negatives (precision and recall), or the overall correct classification rate (accuracy).

9. Describe the holdout method and its purpose in model evaluation.

  • Answer:
    • The holdout method involves splitting the dataset into two separate sets: training and testing sets. The model is trained on the training set and evaluated on the testing set to assess its performance on unseen data. This helps prevent overfitting and provides an unbiased evaluation of the model's generalization ability.

10. What is the purpose of the elbow plot in K-Means clustering?

  • Answer:
    • The elbow plot helps determine the optimal number of clusters (K) by plotting the explained variance as a function of the number of clusters. The "elbow" point, where the rate of variance reduction sharply decreases, indicates a suitable number of clusters.

Practical Implementation with Scikit-learn

11. How do you interpret a confusion matrix?

  • Answer:
    • A confusion matrix is a table used to evaluate the performance of a classification model. It shows the true positive (TP), true negative (TN), false positive (FP), and false negative (FN) counts. From these, metrics like accuracy, precision, recall, and F1-score can be derived to assess the model's performance.

12. Explain the significance of ROC-AUC curves in model evaluation.

  • Answer:
    • The ROC (Receiver Operating Characteristic) curve plots the true positive rate (sensitivity) against the false positive rate (1-specificity) at various threshold settings. The AUC (Area Under the Curve) measures the model's ability to distinguish between classes. A higher AUC indicates better model performance.

By practicing these active recall questions and understanding their answers, you'll be well-prepared to tackle the exam questions related to Data Analytics and Model Evaluation.