Appendix for online supplement: Representing d2_cluster as a minimal linkage algorithm.
D2_cluster as described above can be mapped to the minimal linkage algorithm commonly seen in the statistics and engineering texts. Define a discrete distance, d0, on sequences to be:
d0(A,B) = 0, if d2(A,B) < THRESHHOLD, and
d0(A,B) = 1, if d2(A,B) >= THRESHHOLD.
Linkage methods are usually presented in terms of distance matrices. Since there are initially N sequences/clusters, let D={d0(i,j)} be the N by N matrix of discrete distances. In agglomerative clustering the dimensionality of D reduces as sequences are clustered and in minimal linkage the distance between two clusters, X and Y is:
d0(X,Y) = min{ d0(Si,Sj) : all Si in X, all Sj in Y}.
D2_cluster described in terms of minimal linkage is now:
|
|
|
|
When the clustering is completed, the matrix D will have as many rows and columns as there are distinct clusters.