Regression and Classification using Scikit-learn

Sure, let's dive deeper into regression and classification using Scikit-learn, which are fundamental techniques in predictive analytics.

Regression

Regression analysis is used to model the relationship between a dependent variable (target) and one or more independent variables (features). The goal is to predict the value of the dependent variable based on the values of the independent variables.

Linear Regression

Linear regression is a simple yet powerful technique that assumes a linear relationship between the dependent and independent variables.

Simple Linear Regression

Here's how to perform simple linear regression using Scikit-learn:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

# Sample data
data = {
    'X': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    'Y': [3, 4, 2, 5, 6, 7, 8, 9, 10, 12]
}
df = pd.DataFrame(data)

# Splitting the data into training and testing sets
X = df[['X']]
y = df['Y']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Creating and training the model
model = LinearRegression()
model.fit(X_train, y_train)

# Making predictions
y_pred = model.predict(X_test)

# Evaluating the model
print(f'Mean Squared Error: {mean_squared_error(y_test, y_pred)}')
print(f'R^2 Score: {r2_score(y_test, y_pred)}')

# Plotting the results
plt.scatter(X, y, color='blue')
plt.plot(X_test, y_pred, color='red', linewidth=2)
plt.xlabel('X')
plt.ylabel('Y')
plt.title('Simple Linear Regression')
plt.show()

Multiple Linear Regression

For multiple linear regression, you have more than one independent variable:

# Sample data with multiple features
data = {
    'X1': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    'X2': [2, 3, 4, 5, 6, 7, 8, 9, 10, 11],
    'Y': [3, 4, 2, 5, 6, 7, 8, 9, 10, 12]
}
df = pd.DataFrame(data)

# Splitting the data into training and testing sets
X = df[['X1', 'X2']]
y = df['Y']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Creating and training the model
model = LinearRegression()
model.fit(X_train, y_train)

# Making predictions
y_pred = model.predict(X_test)

# Evaluating the model
print(f'Mean Squared Error: {mean_squared_error(y_test, y_pred)}')
print(f'R^2 Score: {r2_score(y_test, y_pred)}')

# Plotting the results (optional, for illustration)
plt.scatter(X['X1'], y, color='blue')
plt.plot(X_test['X1'], y_pred, color='red', linewidth=2)
plt.xlabel('X1')
plt.ylabel('Y')
plt.title('Multiple Linear Regression')
plt.show()

Logistic Regression

Logistic regression is used for binary classification problems. It predicts the probability that a given input point belongs to a certain class.

from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix

# Loading the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# For simplicity, let's consider only two classes
X = X[y != 2]
y = y[y != 2]

# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Creating and training the model
model = LogisticRegression()
model.fit(X_train, y_train)

# Making predictions
y_pred = model.predict(X_test)

# Evaluating the model
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

Classification

Classification involves predicting a categorical label. Scikit-learn provides various algorithms for classification, such as Naive Bayes, Decision Trees, and more.

Naive Bayes

Naive Bayes classifiers are based on Bayes' theorem and are particularly suited for high-dimensional data.

from sklearn.datasets import load_iris
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix

# Loading the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Creating and training the model
model = GaussianNB()
model.fit(X_train, y_train)

# Making predictions
y_pred = model.predict(X_test)

# Evaluating the model
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

Decision Trees

Decision trees are a non-parametric supervised learning method used for classification and regression.

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
import matplotlib.pyplot as plt
from sklearn import tree

# Loading the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Creating and training the model
model = DecisionTreeClassifier()
model.fit(X_train, y_train)

# Making predictions
y_pred = model.predict(X_test)

# Evaluating the model
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

# Plotting the decision tree
plt.figure(figsize=(20,10))
tree.plot_tree(model, filled=True, feature_names=iris.feature_names, class_names=iris.target_names)
plt.show()

Conclusion

By understanding and implementing these regression and classification techniques using Scikit-learn, you can build powerful predictive models to analyze and interpret data. These foundational skills are crucial for tackling a wide range of data science problems. Let me know if you need further clarification on any specific part or additional examples!

MM - Regression and Classification using Scikit-learnMM - Regression and Classification using Scikit-learnCreating mind maps for key concepts in regression and classification using Scikit-learn can help you quickly recall the material. Here are some keywords and short sentences you can use for your mind maps: Regression Linear Regression Definition**: Linear relationship between variables. Formula**: (y = mx + c) Libraries**: LinearRegression Example**: Predict house prices. Key Functions**: fit(), predict() Metrics**: Mean Squared Error (MSE), R-squared ((R^2)) Multiple Linear Regression Multi