DBSCAN的小试

2015-01-03 R

DBSCAN 是由一群统计学家在1996年的Second International Conference on Knowledge Discovery and Data Mining (KDD-96)提出并完善的一套算法, 算法本身是在讲一个基于密度的聚类,可以看做把相似密度的部分归于一类的无监督学习,可以参考将聚类聚类查看我总结的其他聚类方法.

DBSCAN

DBSCAN的定义是基于density reachability(密度可达性)实现的.p和q两个点称为直接密度可达是指p和q之间的距离小于一个给定的距离(\epsilon).称p,q是密度可达是指:存在一个序列 (p_1,\ldots,p_n)使得(p_1 = p)并且(p_n = q) 其中每一个 (p_{i+1})都和(p_i)相连通.

比如在下图中,B和C是密度可达的,但A和C不是密度可达的:

思路很像基于连通的聚类(层次聚类)是不是?

不同的聚类方法对于聚类的认识是有区别的,比如,kmeans认为每个类是有一个中心的,基于分布的聚类会认为每个类的数据是服从某种分布的(GMM),而对于DBSCAN,这个算法对于聚类的看法是这样的.对于一个聚类,具有以下两个性质(DBSCAN的游戏规则):

一个聚类内的所有点都是密度可达的.

如果一个点对于聚类中每个点都是密度可达的,那这个点就是这个类中的一个.

DBSCAN的优势在于可以识别出非线性的聚类.而这是不能被kmeans或者GMM识别出来的. 先来show一下DBSCAN的聚类效果.

首先是demo codes.一些信噪比很高的数据,基本上没什么好介绍的.

library(fpc)
  set.seed(123)
  n <- 600
  x1 <- cbind(runif(10, 0, 10)+rnorm(n, sd=0.2), runif(10, 0, 10)+rnorm(n,
    sd=0.2))
  ds <- dbscan(x1, 0.2)
# run with showplot=1 to see how dbscan works.
  ds

## dbscan Pts=600 MinPts=5 eps=0.2
##         0  1  2  3  4  5  6  7  8  9 10
## border 16  6  9  6  3  5  4  4  6  7  7
## seed    0 52 51 53 57 53 54 54 51 53 49
## total  16 58 60 59 60 58 58 58 57 60 56

  plot(ds, x1)

plot of chunk unnamed-chunk-1

然后是一个基于wiki的数据,数据是类似这样的,据wiki上的资料,说这样的数据不能被kmeans或者是GMM来识别出来,用R生成类似的数据来试一试这个说法的真实性吧:

library(ggplot2)
library(mclust)

## Warning: package ‘mclust’ was built under R version 3.1.2

## Package ‘mclust’ version 4.4
## Type ‘citation("mclust")’ for citing this R package in publications.

library(gridExtra)

## Loading required package: grid

library(fpc)

simulate_plot <- function(epsilon=0.08,eps=0.09){
  set.seed(123)
y1 = runif(100)
x1 = -(y1-0.3)^2-1+runif(100,0,0.3)
y2 = rnorm(100,0.4,epsilon)
x2 = rnorm(100,-1.3,epsilon)
true = as.factor(c(rep(1,100),rep(2,100)))
data = data.frame(x=c(x1,x2),y=c(y1,y2))
## kmeans
cl = kmeans(data,centers = 2)
## GMM
cl2 = clustCombi(data,)
##DBSCAN
cl3 = dbscan(data, eps)

data1=cbind(data,true)

p1<-ggplot(data=data1)+
  geom_point(aes(x,y,color=true))+
      labs(title="True clustering")+
      theme(legend.position = "none")


data2=cbind(data,col=cl$cluster)
p2<-ggplot(data=data2)+
      geom_point(aes(x,y,color=as.factor(col)))+
      labs(title="Kmeans clustering")+
      theme(legend.position = "none")

data3=cbind(data,col=cl2$classification[[2]])
p3<-ggplot(data=data3)+
      geom_point(aes(x,y,color=as.factor(col)))+
      labs(title="GMM clustering")+
      theme(legend.position = "none")

data4=cbind(data,col=cl3$cluster)
p4<-ggplot(data=data4)+
      geom_point(aes(x,y,color=as.factor(col)))+
      labs(title="DBSCAN clustering")+
      theme(legend.position = "none")

grid.arrange(p1,p2,p3,p4,nrow=2)
}


simulate_plot()

plot of chunk unnamed-chunk-2

simulate_plot(0.1,eps=0.082)

plot of chunk unnamed-chunk-2

DBSCAN需要两个参数来开始执行运算,一个是eps代表(\epsilon) 最短距离,MinPts代表一个聚类中最小点数.伪代码算法如下:

## pseudocode
DBSCAN(D, eps, MinPts)
   C = 0
   for each unvisited point P in dataset D
      mark P as visited
      NeighborPts = regionQuery(P, eps)
      if sizeof(NeighborPts) < MinPts
         mark P as NOISE
      else
         C = next cluster
         expandCluster(P, NeighborPts, C, eps, MinPts)

expandCluster(P, NeighborPts, C, eps, MinPts)
   add P to cluster C
   for each point P’ in NeighborPts
      if P’ is not visited
         mark P’ as visited
         NeighborPts’ = regionQuery(P’, eps)
         if sizeof(NeighborPts’) >= MinPts
            NeighborPts = NeighborPts joined with NeighborPts’
      if P’ is not yet member of any cluster
         add P’ to cluster C

regionQuery(P, eps)
   return all points within P’s eps-neighborhood (including P)

DBSCAN的小试

Saturday, January 03, 2015

DBSCAN