In today's data-driven world, the ability to analyze and categorize text data is invaluable. Text classification, a fundamental task in natural language processing (NLP), involves automatically assigning predefined categories or labels to free-text documents. In this blog post, we'll explore how to perform text classification using scikit-learn, a popular machine learning library in Python.
Introduction to Text Classification
Text classification, also known as text categorization or document classification, is a supervised learning task where we train a model to classify text documents into one or more predefined categories. This can have various applications such as spam detection, sentiment analysis, topic modeling, and more.
The Dataset: 20 Newsgroups
For this tutorial, we'll use the 20 Newsgroups dataset, a classic benchmark dataset widely used for text classification tasks. It consists of approximately 20,000 newsgroup documents across 20 different topics. Each document belongs to one of the predefined categories.
Code Implementation
Let's dive into the Python code to perform text classification using scikit-learn:
import ssl
ssl._create_default_https_context = ssl._create_unverified_context
# Importing necessary libraries
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn import metrics
# Load dataset
dataset = fetch_20newsgroups()
X, y = dataset.data, dataset.target
# Splitting dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=45)
# Convert dataset into feature vectors using TF-IDF Vectorizer
vectorizer = TfidfVectorizer(stop_words="english")
X_train = vectorizer.fit_transform(X_train)
X_test = vectorizer.transform(X_test)
# Train classifier (Logistic Regression)
clf = LogisticRegression()
clf.fit(X_train, y_train)
# Making predictions
pred = clf.predict(X_test)
# Evaluating the model
print(metrics.classification_report(y_test, pred))
Understanding the Code
1. Importing Libraries: We import the necessary libraries including scikit-learn modules for dataset loading, feature extraction (TF-IDF Vectorizer), model training (Logistic Regression), and evaluation metrics.
2. Loading Dataset: We load the 20 Newsgroups dataset using `fetch_20newsgroups()` function provided by scikit-learn.
3. Splitting Dataset: The dataset is split into training and testing sets using `train_test_split()` function.
4. Feature Extraction: We use TF-IDF Vectorizer to convert text documents into numerical feature vectors.
5. Training Classifier: We train a Logistic Regression classifier using the training data.
6. Making Predictions: We use the trained classifier to make predictions on the testing data.
7. Model Evaluation: Finally, we evaluate the model's performance using classification metrics such as precision, recall, and F1-score.
Conclusion
In this blog post, we've demonstrated how to perform text classification using scikit-learn library in Python. We've covered loading the dataset, preprocessing text data, feature extraction, model training, making predictions, and evaluating the model's performance. Text classification is a powerful technique with numerous real-world applications, and scikit-learn provides a user-friendly interface to implement it efficiently.
The link to the github code is here.