d2_cluster: A Validated Method for Clustering EST and Full-Length cDNA Sequences

Appendix for online supplement: Representing d2_cluster as a minimal linkage algorithm.

D2_cluster as described above can be mapped to the minimal linkage algorithm commonly seen in the statistics and engineering texts. Define a discrete distance, d0, on sequences to be:

d0(A,B) = 0, if d2(A,B) < THRESHHOLD, and

d0(A,B) = 1, if d2(A,B) >= THRESHHOLD.

Linkage methods are usually presented in terms of distance matrices. Since there are initially N sequences/clusters, let D={d0(i,j)} be the N by N matrix of discrete distances. In agglomerative clustering the dimensionality of D reduces as sequences are clustered and in minimal linkage the distance between two clusters, X and Y is:

d0(X,Y) = min{ d0(Si,Sj) : all Si in X, all Sj in Y}.

D2_cluster described in terms of minimal linkage is now:

Initially all N clusters contain one sequence. D is an N by N matrix.

Search distance matrix for smallest distances between clusters (or, equivalently, take the first zero distance; search in any order).

When a zero distance is found at, say, row i column j (i<j), delete rows i and j and columns I and j. Replace them with a new row and column specifying the distance, d0 (again 1 or 0 as defined above), from the merged cluster to other clusters.

Classical minimal linkage specifies that steps 2 and 3 are repeated (N-1) times until joining information exists for all clusters. We stop when there exist no more zero entries in the distance matrix.

When the clustering is completed, the matrix D will have as many rows and columns as there are distinct clusters.

d2_cluster: A Validated Method for Clustering EST and Full-Length cDNA Sequences

This Article

Preprint Server

Current Issue

In This Issue