K-means Clustering

K-means cluster analysis is an unsupervised machine learning algorithm used to partition a dataset into distinct groups or clusters based on similarity of data points. It aims to find K cluster centers in such a way that the within-cluster sum of squares is minimized.

Here's how k-means cluster analysis works:

Initial Setup: The algorithm begins by randomly selecting K initial cluster centers from the dataset. K is a user-defined parameter that specifies the desired number of clusters.
Assignment Step: Each data point in the dataset is assigned to the nearest cluster center based on a distance metric, typically Euclidean distance. This step forms the initial clustering of the data.
Update Step: After the assignment step, the cluster centers are updated by computing the mean of the data points assigned to each cluster. The updated cluster centers become the new centroids.
Iterative Refinement: Steps 2 and 3 are repeated iteratively until convergence. Convergence occurs when the assignment of data points to clusters no longer changes or reaches a specified maximum number of iterations.
Final Clustering: Once the algorithm converges, the data points are assigned to their respective final clusters based on the updated cluster centers.
Evaluation: The quality of the resulting clustering can be assessed using various metrics, such as the within-cluster sum of squares (WCSS) or silhouette score, which measure the compactness of clusters and the separation between clusters.

Key Considerations and Limitations:

Choice of K: The selection of an appropriate value for K is crucial and often requires domain knowledge or experimentation. Various techniques, such as the elbow method or silhouette analysis, can assist in determining an optimal K value.
Initialization: The initial random selection of cluster centers can affect the final clustering result. Multiple runs with different initializations can be performed to mitigate this issue, and the best result can be chosen.
Sensitivity to Outliers: K-means is sensitive to outliers, as they can significantly impact the calculation of cluster centers. Preprocessing or outlier detection techniques may be necessary to handle outliers effectively.
Non-Convex Clusters: K-means assumes that clusters are convex and isotropic, which means they have a roughly spherical shape and similar variances. If the clusters in the data are non-convex, K-means may struggle to capture complex cluster structures.

K-means cluster analysis is widely used in various domains, such as customer segmentation, image analysis, document clustering, and anomaly detection. It provides a simple and efficient approach for partitioning data into meaningful groups based on similarity, although it does have certain assumptions and limitations that should be considered in its application.

Here is a code for running a K-Means cluster analysis.

 # -*- coding: utf-8 -*-
"""
Created on Mon Jan 18 19:51:29 2016

@author: jrose01
"""

from pandas import Series, DataFrame
import pandas as pd
import numpy as np
import matplotlib.pylab as plt
from sklearn.cross_validation import train_test_split
from sklearn import preprocessing
from sklearn.cluster import KMeans

"""
Data Management
"""
data = pd.read_csv("tree_addhealth.csv")

# upper-case all DataFrame column names
data.columns = map(str.upper, data.columns)

# Data Management

data_clean = data.dropna()

# subset clustering variables
cluster = data_clean[['ALCEVR1', 'MAREVER1', 'ALCPROBS1', 'DEVIANT1', 'VIOL1',
                      'DEP1', 'ESTEEM1', 'SCHCONN1', 'PARACTV', 'PARPRES', 'FAMCONCT']]
cluster.describe()

# standardize clustering variables to have mean=0 and sd=1
clustervar = cluster.copy()
clustervar['ALCEVR1'] = preprocessing.scale(clustervar['ALCEVR1'].astype('float64'))
clustervar['ALCPROBS1'] = preprocessing.scale(clustervar['ALCPROBS1'].astype('float64'))
clustervar['MAREVER1'] = preprocessing.scale(clustervar['MAREVER1'].astype('float64'))
clustervar['DEP1'] = preprocessing.scale(clustervar['DEP1'].astype('float64'))
clustervar['ESTEEM1'] = preprocessing.scale(clustervar['ESTEEM1'].astype('float64'))
clustervar['VIOL1'] = preprocessing.scale(clustervar['VIOL1'].astype('float64'))
clustervar['DEVIANT1'] = preprocessing.scale(clustervar['DEVIANT1'].astype('float64'))
clustervar['FAMCONCT'] = preprocessing.scale(clustervar['FAMCONCT'].astype('float64'))
clustervar['SCHCONN1'] = preprocessing.scale(clustervar['SCHCONN1'].astype('float64'))
clustervar['PARACTV'] = preprocessing.scale(clustervar['PARACTV'].astype('float64'))
clustervar['PARPRES'] = preprocessing.scale(clustervar['PARPRES'].astype('float64'))

# split data into train and test sets
clus_train, clus_test = train_test_split(clustervar, test_size=.3, random_state=123)

# k-means cluster analysis for 1-9 clusters                                                           
from scipy.spatial.distance import cdist

clusters = range(1, 10)
meandist = []

for k in clusters:
    model = KMeans(n_clusters=k)
    model.fit(clus_train)
    clusassign = model.predict(clus_train)
    meandist.append(sum(np.min(cdist(clus_train, model.cluster_centers_, 'euclidean'), axis=1))
                    / clus_train.shape[0])

"""
Plot average distance from observations from the cluster centroid
to use the Elbow Method to identify number of clusters to choose
"""

plt.plot(clusters, meandist)
plt.xlabel('Number of clusters')
plt.ylabel('Average distance')
plt.title('Selecting k with the Elbow Method')

# Interpret 3 cluster solution
model3 = KMeans(n_clusters=3)
model3.fit(clus_train)
clusassign = model3.predict(clus_train)
# plot clusters

from sklearn.decomposition import PCA

pca_2 = PCA(2)
plot_columns = pca_2.fit_transform(clus_train)
plt.scatter(x=plot_columns[:, 0], y=plot_columns[:, 1], c=model3.labels_, )
plt.xlabel('Canonical variable 1')
plt.ylabel('Canonical variable 2')
plt.title('Scatterplot of Canonical Variables for 3 Clusters')
plt.show()

"""
BEGIN multiple steps to merge cluster assignment with clustering variables to examine
cluster variable means by cluster
"""
# create a unique identifier variable from the index for the 
# cluster training data to merge with the cluster assignment variable
clus_train.reset_index(level=0, inplace=True)
# create a list that has the new index variable
cluslist = list(clus_train['index'])
# create a list of cluster assignments
labels = list(model3.labels_)
# combine index variable list with cluster assignment list into a dictionary
newlist = dict(zip(cluslist, labels))
newlist
# convert newlist dictionary to a dataframe
newclus = DataFrame.from_dict(newlist, orient='index')
newclus
# rename the cluster assignment column
newclus.columns = ['cluster']

# now do the same for the cluster assignment variable
# create a unique identifier variable from the index for the 
# cluster assignment dataframe 
# to merge with cluster training data
newclus.reset_index(level=0, inplace=True)
# merge the cluster assignment dataframe with the cluster training variable dataframe
# by the index variable
merged_train = pd.merge(clus_train, newclus, on='index')
merged_train.head(n=100)
# cluster frequencies
merged_train.cluster.value_counts()

"""
END multiple steps to merge cluster assignment with clustering variables to examine
cluster variable means by cluster
"""

# FINALLY calculate clustering variable means by cluster
clustergrp = merged_train.groupby('cluster').mean()
print("Clustering variable means by cluster")
print(clustergrp)

# validate clusters in training data by examining cluster differences in GPA using ANOVA
# first have to merge GPA with clustering variables and cluster assignment data 
gpa_data = data_clean['GPA1']
# split GPA data into train and test sets
gpa_train, gpa_test = train_test_split(gpa_data, test_size=.3, random_state=123)
gpa_train1 = pd.DataFrame(gpa_train)
gpa_train1.reset_index(level=0, inplace=True)
merged_train_all = pd.merge(gpa_train1, merged_train, on='index')
sub1 = merged_train_all[['GPA1', 'cluster']].dropna()

import statsmodels.formula.api as smf
import statsmodels.stats.multicomp as multi

gpamod = smf.ols(formula='GPA1 ~ C(cluster)', data=sub1).fit()
print(gpamod.summary())

print('means for GPA by cluster')
m1 = sub1.groupby('cluster').mean()
print(m1)

print('standard deviations for GPA by cluster')
m2 = sub1.groupby('cluster').std()
print(m2)

mc1 = multi.MultiComparison(sub1['GPA1'], sub1['cluster'])
res1 = mc1.tukeyhsd()
print(res1.summary())

'공부하며 성장하기 > 인공지능 AI' 카테고리의 다른 글

DeeplabV3+ 모델 전이 학습(transfer learning) 쉽게 구현하기 (0)	2023.06.15
AMP(Automatic Mixed Precision) 쉽게 적용하기 (0)	2023.06.14
Lasso regression (0)	2023.05.21
Random Forest (0)	2023.05.20
Decision Tree (0)	2023.05.19

'공부하며 성장하기 > 인공지능 AI' 카테고리의 다른 글

티스토리툴바