GitHub - Felixburton7/Machine-Learning-Unsupervised-Learning-Implementing-K-Means-clustering: ML Unsupervised learning

Unsupervised Learning: Implementing K-means for Text Data Clustering and Evaluation with Wikipedia Articles

Overview

This project showcases the implementation of the k-means algorithm, a core unsupervised learning technique, specifically applied to clustering text documents from Wikipedia. Unlike supervised learning, which depends on labeled data, unsupervised learning involves discovering hidden patterns or intrinsic structures within unlabeled data. The k-means algorithm is particularly effective for clustering, grouping data points into distinct clusters based on their similarities.

The project involved several key components:

Text Preprocessing: Converting text documents into numerical feature vectors using techniques like TF-IDF.
Algorithm Implementation: Developing the k-means algorithm from scratch, covering initialization, assignment, and update steps.
Experimental Analysis: Exploring the effects of random initialization, varying the number of clusters, and evaluating clustering results both quantitatively and qualitatively.

Technology Used

To accomplish this project, I used the following technologies:

Python 3.x: The primary programming language for the implementation.
NumPy: For numerical computations and array manipulations.
Pandas: For data loading and manipulation.
Matplotlib: For visualizing data and clustering results.
scikit-learn: For utilities such as TfidfVectorizer and pairwise_distances.

Steps to Implement k-means

Setup and Requirements
- Ensure you have Python 3.x installed.
- Install the required Python libraries using pip:
```
pip install numpy pandas matplotlib scikit-learn
```
Running the Notebook
- Clone the repository:
```
git clone <repository-url>
cd <repository-directory>
```
- Start Jupyter Notebook:
```
jupyter notebook
```
- Open the Assignment6_KMeans_Text_Data.ipynb file and start working on the cells.

Notebook Structure

Initial Setup: Import the necessary libraries such as numpy, pandas, matplotlib, and scikit-learn.

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

%matplotlib inline

Load Data and Extract Features: Load the Wikipedia dataset and extract TF-IDF features from the text data.

wiki = pd.read_csv('people_wiki.csv')
wiki.head(20)

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(max_df=0.95)
tf_idf = vectorizer.fit_transform(wiki['text'])
words = vectorizer.get_feature_names_out()

Normalize Vectors: Normalize all vectors to unit length to make Euclidean distance mimic cosine distance.
```
from sklearn.preprocessing import normalize
tf_idf = normalize(tf_idf)
```

Implement k-means: Implement the k-means algorithm by defining functions for initialization, cluster assignment, and centroid revision.

Initial Centroids: Randomly choose initial centroids from the data points.

def get_initial_centroids(data, k, seed=None):
    if seed is not None:
        np.random.seed(seed)
    n = data.shape[0]
    rand_indices = np.random.choice(n, k)
    centroids = data[rand_indices, :].toarray()
    return centroids

Assign Clusters: Assign each data point to the closest centroid.

from sklearn.metrics import pairwise_distances

def assign_clusters(data, centroids):
    distances = pairwise_distances(data, centroids, metric='euclidean')
    return np.argmin(distances, axis=1)

Revise Centroids: Update the centroids to be the mean of the assigned data points.

def revise_centroids(data, k, cluster_assignment):
    new_centroids = []
    for i in range(k):
        assigned_data = data[cluster_assignment == i]
        new_centroids.append(assigned_data.mean(axis=0))
    return np.array(new_centroids)

Iterate: Combine the steps in an iterative process until convergence.

def kmeans(data, k, max_iters=100, seed=None):
    centroids = get_initial_centroids(data, k, seed)
    for i in range(max_iters):
        cluster_assignment = assign_clusters(data, centroids)
        new_centroids = revise_centroids(data, k, cluster_assignment)
        if np.all(centroids == new_centroids):
            break
        centroids = new_centroids
    return centroids, cluster_assignment

Evaluation

Evaluate the clustering results both quantitatively (e.g., using inertia) and qualitatively (e.g., by examining cluster contents).

By following these steps, I successfully implemented and evaluated the k-means algorithm for clustering text data from Wikipedia. The project provided insights into the clustering process and the impact of various factors such as initialization and the number of clusters.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
ML_Unsupervised.py		ML_Unsupervised.py
README.md		README.md
enhanced_k-means.py		enhanced_k-means.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Unsupervised Learning: Implementing K-means for Text Data Clustering and Evaluation with Wikipedia Articles

Overview

Technology Used

Steps to Implement k-means

Evaluation

About

Releases

Packages

Languages

Felixburton7/Machine-Learning-Unsupervised-Learning-Implementing-K-Means-clustering

Folders and files

Latest commit

History

Repository files navigation

Unsupervised Learning: Implementing K-means for Text Data Clustering and Evaluation with Wikipedia Articles

Overview

Technology Used

Steps to Implement k-means

Evaluation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages