本文共 5535 字,大约阅读时间需要 18 分钟。
聚类是一种无监督学习方法,无监督学习方法没有训练过程。所谓无监督学习是指事先并不知道要寻找的内容,即没有目标变量。聚类将数据点归到多个族中,其中相似数据点处于同一族,而不相似数据点处在不同族。聚类可以使用多种不同的方法来计算相似度。
一种广泛使用的聚类算法是K-均值算法,其中k是用户指定的要创建的簇的数目。 K-均值聚类算法以k个随机质心开始。算法会计算每个点到质心的距离。每个点会被分配到距其最近的簇 质心,然后紧接着基于新分配到簇的点更新簇质心。以上过程重复数次,直到簇质心不再改变。这个简单的算法非常有效但是也容易受到初始簇质心的影响。为了获得更好的聚类效果,可以使用另一种称为二分K-均值的聚类算法。二分K-均值算法首先将所有点作为一个簇,然后使用K-均值算法(k = 2)对其划分。下一次迭代时,选择有最大误差的簇进行划分。该过程重复直到k个簇创建成功为止。二分K-均值的聚类效果要好于K-均值算法。K-均值算法以及变形的K-均值算法并非仅有的聚类算法,另外称为层次聚类的方法也被广泛使用。
附上二分k均值的算法实现(python代码)
from numpy import *#加载数据集def loadDataSet(fileName): #general function to parse tab -delimited floats dataMat = [] #assume last column is target value fr = open(fileName) for line in fr.readlines(): curLine = line.strip().split('\t') dataMat.append(curLine) return dataMat#计算欧拉距离,两个向量之间的距离def distEclud(vecA, vecB): return sqrt(sum(power(vecA - vecB, 2))) #la.norm(vecA-vecB)#随机选择K个聚类中心def randCent(dataSet, k): n = shape(dataSet)[1] centroids = mat(zeros((k,n))) #创建k个聚类中心 for j in range(n):#create random cluster centers, within bounds of each dimension minJ = min(dataSet[:,j]) rangeJ = float(max(dataSet[:,j]) - minJ) centroids[:,j] = mat(minJ + rangeJ * random.rand(k,1)) return centroids#k均值算法def kMeans(dataSet, k, distMeas=distEclud, createCent=randCent): m = shape(dataSet)[0] clusterAssment = mat(zeros((m,2)))#create mat to assign data points #to a centroid, also holds SE(误差平方和) of each point centroids = createCent(dataSet, k) #创建k个聚类中心 clusterChanged = True while clusterChanged: clusterChanged = False for i in range(m): #将数据点i归类到与它最近的集群 minDist = inf; minIndex = -1 for j in range(k): distJI = distMeas(centroids[j,:],dataSet[i,:]) if distJI < minDist: minDist = distJI; minIndex = j if clusterAssment[i,0] != minIndex: clusterChanged = True clusterAssment[i,:] = minIndex,minDist**2 print (centroids) for cent in range(k):#recalculate centroids ptsInClust = dataSet[nonzero(clusterAssment[:,0].A==cent)[0]]#get all the point in this cluster centroids[cent,:] = mean(ptsInClust, axis=0) #assign centroid to mean return centroids, clusterAssment#二分k均值聚类算法def biKmeans(dataSet, k, distMeas=distEclud): m = shape(dataSet)[0] #得到多少行 clusterAssment = mat(zeros((m,2))) centroid0 = mean(dataSet, axis=0).tolist()[0] #初始化一个聚类中心 centList =[centroid0] #create a list with one centroid for j in range(m): #计算初始化的误差平方和 clusterAssment[j,1] = distMeas(mat(centroid0), dataSet[j,:])**2 while (len(centList) < k): #每次增加一个聚类中心,直到达到k个聚类中心为止 lowestSSE = inf #初始化为无穷大 for i in range(len(centList)): ptsInCurrCluster = dataSet[nonzero(clusterAssment[:,0].A==i)[0],:] #得到聚类中心i中的所有数据点 centroidMat, splitClustAss = kMeans(ptsInCurrCluster, 2, distMeas) sseSplit = sum(splitClustAss[:,1])#compare the SSE to the currrent minimum sseNotSplit = sum(clusterAssment[nonzero(clusterAssment[:,0].A!=i)[0],1]) #得到不属于聚类中心i的所有数据点的误差平方和 print ("sseSplit, and notSplit: ",sseSplit,sseNotSplit) if (sseSplit + sseNotSplit) < lowestSSE: bestCentToSplit = i bestNewCents = centroidMat bestClustAss = splitClustAss.copy() lowestSSE = sseSplit + sseNotSplit bestClustAss[nonzero(bestClustAss[:,0].A == 1)[0],0] = len(centList) #重新调整聚类数据(注意是数据) bestClustAss[nonzero(bestClustAss[:,0].A == 0)[0],0] = bestCentToSplit print ('the bestCentToSplit is: ',bestCentToSplit) print ('the len of bestClustAss is: ', len(bestClustAss)) centList[bestCentToSplit] = bestNewCents[0,:].tolist()[0] #重新调整聚类中心(注意是中心) centList.append(bestNewCents[1,:].tolist()[0]) clusterAssment[nonzero(clusterAssment[:,0].A == bestCentToSplit)[0],:]= bestClustAss#reassign new clusters, and SSE return mat(centList), clusterAssmentdata=loadDataSet('testSet2.txt')print(data)dealData=[]print('___________')for i in range(len(data)): subData=[] subData.append(float(data[i][0])) subData.append(float(data[i][1])) dealData.append(subData)data=mat(dealData)print(data)centList,clusterAssment=biKmeans(data,3)print(centList)
testSet2.txt中的数据
3.275154 2.957587-3.344465 2.6035130.355083 -3.3765851.852435 3.547351-2.078973 2.552013-0.993756 -0.8844332.682252 4.007573-3.087776 2.878713-1.565978 -1.2569852.441611 0.444826-0.659487 3.111284-0.459601 -2.6180052.177680 2.387793-2.920969 2.917485-0.028814 -4.1680783.625746 2.119041-3.912363 1.325108-0.551694 -2.8142232.855808 3.483301-3.594448 2.8566510.421993 -2.3726461.650821 3.407572-2.082902 3.384412-0.718809 -2.4925144.513623 3.841029-4.822011 4.607049-0.656297 -1.4498721.919901 4.439368-3.287749 3.918836-1.576936 -2.9776223.598143 1.975970-3.977329 4.900932-1.791080 -2.1845173.914654 3.559303-1.910108 4.166946-1.226597 -3.3178891.148946 3.345138-2.113864 3.5481720.845762 -3.5897882.629062 3.535831-1.640717 2.990517-1.881012 -2.4854054.606999 3.510312-4.366462 4.0233160.765015 -3.0012703.121904 2.173988-4.025139 4.652310-0.559558 -3.8405394.376754 4.863579-1.874308 4.032237-0.089337 -3.0268093.997787 2.518662-3.082978 2.8848220.845235 -3.4544651.327224 3.358778-2.889949 3.596178-0.966018 -2.8398272.960769 3.079555-3.275518 1.5770680.639276 -3.412840
转载地址:http://nukmi.baihongyu.com/