d2_cluster: A Validated Method for Clustering EST and Full-Length cDNA Sequences

Appendix for online supplement: Representing d2_cluster as a minimal linkage algorithm.

D2_cluster as described above can be mapped to the minimal linkage algorithm commonly seen in the statistics and engineering texts. Define a discrete distance, d0, on sequences to be:

d0(A,B) = 0, if d2(A,B) < THRESHHOLD, and

d0(A,B) = 1, if d2(A,B) >= THRESHHOLD.

Linkage methods are usually presented in terms of distance matrices. Since there are initially N sequences/clusters, let D={d0(i,j)} be the N by N matrix of discrete distances. In agglomerative clustering the dimensionality of D reduces as sequences are clustered and in minimal linkage the distance between two clusters, X and Y is:

d0(X,Y) = min{ d0(Si,Sj) : all Si in X, all Sj in Y}.

D2_cluster described in terms of minimal linkage is now:

  1. Initially all N clusters contain one sequence. D is an N by N matrix.
  • Search distance matrix for smallest distances between clusters (or, equivalently, take the first zero distance; search in any order).
  • When a zero distance is found at, say, row i column j (i<j), delete rows i and j and columns I and j. Replace them with a new row and column specifying the distance, d0 (again 1 or 0 as defined above), from the merged cluster to other clusters.
  • Classical minimal linkage specifies that steps 2 and 3 are repeated (N-1) times until joining information exists for all clusters. We stop when there exist no more zero entries in the distance matrix.

When the clustering is completed, the matrix D will have as many rows and columns as there are distinct clusters.

This Article

  1. doi: 10.1101/gr.9.11.1135 Genome Res. November 1, 1999 vol. 9 no. 11 1135-1142

Preprint Server