Classification - Naïve Bayes, Decision Trees
Classification Algorithms: Naïve Bayes and Decision Trees
Naïve Bayes Classifier
Overview: The Naïve Bayes classifier is a probabilistic machine learning model based on Bayes' Theorem, particularly useful for classification tasks. Despite its simplicity and the "naïve" assumption of independence among features, it often performs surprisingly well in various applications such as spam filtering, text classification, and sentiment analysis.
Bayes' Theorem: Bayes' Theorem provides a way to update the probability estimate for a hypothesis as more evidence or information becomes available. [ P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)} ] where:
- ( P(A|B) ) is the posterior probability of class ( A ) given predictor ( B ).
- ( P(B|A) ) is the likelihood which is the probability of predictor ( B ) given class ( A ).
- ( P(A) ) is the prior probability of class ( A ).
- ( P(B) ) is the prior probability of predictor ( B ).
Assumption of Independence: The Naïve Bayes classifier assumes that the presence of a particular feature in a class is independent of the presence of any other feature (conditional independence). [ P(A|X_1, X_2, \ldots, X_n) = \frac{P(X_1|A) \cdot P(X_2|A) \cdot \ldots \cdot P(X_n|A) \cdot P(A)}{P(X_1, X_2, \ldots, X_n)} ]
Types of Naïve Bayes Classifiers:
- Gaussian Naïve Bayes: Assumes that the continuous values associated with each class are distributed according to a Gaussian distribution.
- Multinomial Naïve Bayes: Suitable for discrete count features, commonly used in text classification.
- Bernoulli Naïve Bayes: Assumes binary features (0s and 1s).
Implementation in Python using Scikit-learn:
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# Sample data
X = [[1, 2], [2, 3], [3, 4], [4, 5]]
y = [0, 1, 0, 1]
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=42)
# Initialize the classifier
model = GaussianNB()
# Train the model
model.fit(X_train, y_train)
# Make predictions
y_pred = model.predict(X_test)
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')
Decision Trees
Overview: Decision Trees are a non-parametric supervised learning method used for classification and regression. They create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features. They are intuitive and easy to interpret.
Structure of a Decision Tree:
- Root Node: The topmost node representing the entire dataset, which gets split into two or more homogeneous sets.
- Internal Nodes: Nodes representing the attributes tested.
- Leaf Nodes: Terminal nodes representing the outcome or class labels.
Splitting Criteria:
- Gini Impurity: Measures the impurity or impurity of a node. [ Gini = 1 - \sum (p_i^2) ] where ( p_i ) is the probability of an element being classified for a particular class.
- Entropy: Measures the randomness in the information being processed. [ Entropy = - \sum (p_i \log_2 p_i) ]
Algorithm:
- Select the Best Attribute: Based on Gini or Entropy to split the dataset.
- Splitting: Divide the dataset into subsets that contain the possible values for the best attributes.
- Repeat: Recursively split until a stopping criterion is met (e.g., all samples belong to one class or no more attributes to split).
Pruning: Reduces the size of the decision tree by removing sections of the tree that provide little power in predicting target variables to prevent overfitting.
Implementation in Python using Scikit-learn:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn import tree
# Sample data
X = [[0, 0], [1, 1], [2, 2], [3, 3]]
y = [0, 1, 1, 0]
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=42)
# Initialize the classifier
model = DecisionTreeClassifier()
# Train the model
model.fit(X_train, y_train)
# Make predictions
y_pred = model.predict(X_test)
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')
# Visualize the tree
tree.plot_tree(model)
Comparison of Naïve Bayes and Decision Trees:
-
Naïve Bayes:
- Simple and fast.
- Performs well with high-dimensional data.
- Based on a strong assumption of feature independence, which may not always be true.
-
Decision Trees:
- Intuitive and easy to interpret.
- Can handle both numerical and categorical data.
- Prone to overfitting, but this can be mitigated with techniques like pruning or ensemble methods (e.g., Random Forests).
Both classifiers have their strengths and weaknesses, and the choice between them often depends on the specific characteristics of the dataset and the problem at hand.
MM - Classification - Naïve Bayes, Decision TreesMM - Classification - Naïve Bayes, Decision TreesCreating a mind map with keywords and short sentences for future recall can be an effective way to summarize and visualize the content. Here’s how you can structure your mind map for Naïve Bayes and Decision Trees: Naïve Bayes 1. Overview * Probabilistic Model * Classification Tasks * Independence Assumption 1. Bayes' Theorem * Posterior Probability * Likelihood * Prior Probability 1. Assumption of Independence * Conditional Independence * Simplified Calculation 1. T