聚类总结 - ZhengYang's Blog

距离

原型聚类

k-means

LVQ

DBSCAN

层次聚类

距离

1-范数距离(曼哈顿距离)
$$||x_i-y_i||_1 = \sum_{i=1}^n|x_i-y_i|$$
2-范数距离(欧式距离)
$$||x_i-y_i||_2 = (\sum_{i=1}^n|x_i-y_i|^2)^{1/2}$$
p-范数距离(闵可夫斯基距离)
$$||x_i-y_i||_p = (\sum_{i=1}^n|x_i-y_i|^p)^{1/p}$$
无限范数距离(切比雪夫距离)
$$||x_i-y_i||_\infty = (\sum_{i=1}^n|x_i-y_i|^\infty)^{1/\infty}$$

原型聚类

Prototype methods represent the training data by a set of points in feature space. These prototypes are typically not examples from the training sample, except int the case of 1-nearest-neighbor classification. Each prototype has an associated class label, the classification of a query point x is made to the class of the closest prototype.

原型：样本空间中具有代表性的点，不一定是训练集上的点，也可能不存在训练集上。
原型方法对于不同密度，不同形状的簇的聚类，效果不好。

k-means

在样本中，随机选择K个点作为初始质心
将样本中的每个点指派到最近的质心，形成K个簇
重复计算每个簇的质心
如果质心改变大于阈值，返回步骤2；
如果质心改变小于阈值，则停止，得到K个质心和簇

学习向量量化 LVQ(Learning Vector Quantization)

初始化R个原型，每个原型是一个带类别信息的向量
即$(p_j,t_j)$，其中，类别 $t_j\in \{1,2,…,K\}$，
R一定要大于等于K，且必须包含所有的类别
例如，轮流在每个类上抽取一个向量，共抽取R个向量做原型
重复随机抽取训练样例$(x_i,y_i)$，计算$d_{ij}=||x_i-p_j||_2$，
找到与$x_i$最近距离的原型$p_{j^*}$，
$$j^*=argmax_{j=1,2,…,R}d_{ij}$$
该原型的类别为$t_{j^*}$
(a) 如果$y_i=t_{j^*}$，说明是同类，那么
$$p_{j^*}=p_{j^*}+\lambda(x_j-p_{j^*})$$
(a) 如果$y_i\neq t_{j^*}$，说明是不同类，那么
$$p_{j^*}=p_{j^*}-\lambda(x_j-p_{j^*})$$
$\lambda$是学习率
重复步骤2，逐渐减低学习率，直到原型的变化小于阈值，停止

DBSCAN (Density-Based Spatial Clustering of Application with Noise)

DBSCAN assumes clusters of similar density, and may have problems separating nearby clusters.
OPTICS is a DBSCAN variant that handles different densities much better.
OPTICS (Ordering points to identify the clustering structure)

三种类型的点

核心点 (core point)：在$\epsilon$ 距离的范围内至少包含minPts个点(包括自己)，则该点为核心点，这个$\epsilon$的范围称为该核心点的邻域。核心点可能落在多个核心点的邻域内。
边界点 (border point 或 reachable point)：边界点不是核心点，因为$\epsilon$ 距离范围内，点的个数少于minPts。但是它落在某个核心点的邻域内。同样，边界点也可能落在 [1,minPts) 个核心点的邻域内。
异常点 (outlier point 或 noise point)：对于任何点都不可达，则为outlier。
三种关系
(密度)直达 (directly density-reachable)：核心点直达自己邻域内的所有点(minPts个)。
(密度)可达 (density-reachable)：对于点x和点y，若存在一条路径p1, p2, …, pn，其中p1=x，pn=y，且pi直达pi+1，则称x可达y。这条路径上的所有点(除了终点pn)，都是核心点，终点pn是核心点或可达点。不满足对称性
(密度)相连 (density-connectedness)：对于点x和点y，存在点o，且o可达x，同时可达y，则称x和y密度相连。满足对称性。

关系的说明

每个簇至少包含一个core point
一个簇中的所有点是相连的。
如果一个点被一个簇中任意一点可达，那么该点也属于这个簇。
可达关系不是对称的，因为只有核心点才能作为始点，而边界点只能作终点(被可达)。
直达一定是可达，但可达不一定是直达，同样，直达也不是对称的(对于边界点)。

DBSCAN的簇定义

由密度可达关系导出的最大的密度相连的样例集合。

算法伪码

//// 返回聚类结果 C
// D 数据集
// eps 距离阈值
// minPts 邻域阈值
DBSCAN(D, eps, minPts) {
   C // 是一个动态的二维数组，(K, 样例集合)，第一维动态保存簇数，第二类保存簇中的样例点集合
   for each point p in dataset D {
      if p is visited
         continue
      mark p as visited
      neighborPts = regionQuery(p, eps)  // 一个集合，查询点p在eps范围内的样例
      if sizeof(neighborPts) < minPts
         mark p as NOISE // !!!
      else {
         expandCluster(p, neighborPts, C[k], eps, minPts)
         k = k + 1;
      }
   }
   return C
}
//// 计算p为核心的簇
// p 核心点
// neighborPts 核心点的邻域
// Ck 最终形成的簇的集合
// eps 距离阈值
// minPts 邻域阈值
expandCluster(p, neighborPts, Ck, eps, minPts) {
   add p to cluster Ck
   for each point p1 in neighborPts {   // neighborPts 用队列
      if p1 is not visited {
         mark p1 as visited
         neighborPts1 = regionQuery(p1, eps)
         if sizeof(neighborPts1) >= minPts
            neighborPts = neighborPts joined with neighborPts1  // 队列加到尾部
      }
      // 本函数里的点全部都属于簇Ck，只要不重复添加就好
      if p1 is not yet member of any cluster
         add p1 to cluster Ck
   }
}
// 返回点p为中心，在eps距离内的所有样例
regionQuery(p, eps){
  return all points within p's eps-neighborhood (including p)
}

例子

In this diagram, minPts = 4.
Point A and the other red points are core points, because the area surrounding these points in an ε radius contain at least 4 points (including the point itself). Because they are all reachable from one another, they form a single cluster.
Points B and C are not core points, but are reachable from A (via other core points) and thus belong to the cluster as well.
Point N is a noise point that is neither a core point nor density-reachable.

层次聚类

层次聚类可以自底向上，也可以自顶向下。

集合间的距离

层次聚类需要计算集合之间的距离，计算的方式有如下几种：

最小距离(Maxinum or single-linkage)：两集合的最小距离。
$$dist_{min}(X,Z)=min_{x\in Xz\in Z}dist(x,z)$$
最大距离(Minimum or complete-linkage)：两集合的最大距离。
$$dist_{max}(X,Z)=max_{x\in Xz\in Z}dist(x,z)$$
平均距离(Mean or average-linkage or UPGMA)：两集合所有距离的平均。
$$dist_{avg}(X,Z)=\frac{1}{|X||Z|}\sum_{x\in X}\sum_{z\in Z}dist(x,z)$$
中心距离(Centroid linkage or UPGMC)：两集合中心点的距离。
$$dist_{cen}(X,Z)=dist(\frac{1}{|X|}\sum_{x\in X}x,\frac{1}{|Z|}\sum_{z\in Z}z)$$
Hausdorff距离：两集合的最大最小距离。
$$dist_{hau}(X,Z)=max_{x\in X}min_{z\in Z}dist(x,z)$$

自底向上的层次聚类

将每个样例看成一个簇，然后找到距离最近的两个簇进行合并，不断重复，知道合并到规定的簇个数。

算法

输入：样本D={x1,x2,…,xN}，距离函数dist，聚类簇数K
输出：K个簇的聚类结果

初始化每个样例为一个簇$C_i=\{x_i\}$
计算簇之间的距离矩阵$M$，其中$M(i,j)=dist(C_i,C_j), i,j=1,2,…,N$
设置当前聚类簇的个数$q=N$
在距离矩阵中，找到最小距离对应的两个簇$C_{i^*},C_{j^*}$，合并$C_{i^*}=C_{i^*}\bigcup C_{j^*}$
删除簇$C_{j^*}$
修改距离矩阵M中与簇$C_{i^*}$有关的值，此时M的有效规模为N-1*N-1
q = q-1
如果q=K，停止，得到K个簇的聚类结果；
如果q>K，返回步骤4

异常点检测

In data mining, anomaly detection (also outlier detection) is the identification of items, events or observations which do not conform to an expected pattern or other items in a dataset.

从统计上来说

分位点划分，0.05到0.95之外的为异常值
高cook距离（D统计量）、高杠杆值

从模型上来说

异常点可以理解为那些模型预测效果不佳的观测点

小结

其实，异常点或异常趋势都会有导致的原因，只是很多相关的事件未知。所以，如果能找到导致异常的原因最好，找不到只能当做噪声，在预处理时去除。

Reference

https://en.wikipedia.org/wiki/Cluster_analysis
The Elements of Statistical Learning
https://en.wikipedia.org/wiki/DBSCAN

距离