CLUSTER is a sublibrary of Fortran subroutines for cluster analysis and related line printer graphics. It includes routines for clustering variables and/or observations using algorithms such as direct joining and splitting, Fisher's exact optimization, single-link, K-means, and minimum mutations, and routines for estimating missing values. The subroutines in CLUSTER are described in the book "CLUSTERING ALGORITHMS" by J. A. Hartigan. Major modifications that have been made to the original code found in the book include: The elimination of Hartigan's "bordered matrix". The data matrix has been placed in a separate matrix, the borders have been placed in their own character vectors, and the title has been placed in its own character variable. The amalgamation of extra vectors and matrices into work vectors and matrices for shorter call sequences. Setup and plotting subroutines incorporated into the primary subroutines rather than being explicitly called from the user's main program. Results are automatically produced on a user-specified unit. A cluster is a set of similar objects. One principal function of clustering is to summarize data by referring to properties of clusters rather than to properties of individual objects. If some objects in a cluster have a certain property, other objects in the cluster will be expected to have the same property. The standard data structure of the input is a set of observations on each of which a number of variables is measured. Some routines require the values to be categorical, the others allow continuous data. Two clustering structures that are used are the partition and the tree. A partition is a family of clusters such that each observation lies in exactly one member of the partition. A tree is a family of clusters, which includes the set of all observations, for which any two clusters are disjoint or one includes the other. The structure can be drawn as a tree with clusters as the nodes, where the direct ancestor of a node is the smallest cluster that contains it, and the cluster of all observations is at the root. The types of clustering algorithms used are: Sorting -- An important variable is chosen somehow, and the observations are partitioned according to the values taken by this variable. Within each of the clusters of the partition, further partitioning takes place according to further important variables. Example: SPARSE Switching -- An initial partition of the observations is given, and new partitions are obtained by switching an observation from one cluster to another, with the algorithm terminating when no further switche improves some criterion. Examples: BUILD, MIX, and DITTO Joining -- Initially each observation is a cluster and the closest pair of clusters are joined to form a new cluster, continuing this step until a single cluster containing all the original observations is obtained. Examples: SLINK, JOIN, JOIN2, and DOT Splitting -- Initially all observations are one cluster and then a cluster is chosen according to some criterion and split into smaller clusters. Examples: SPLIT1, SPLIT2, and MIXIND Adding -- A clustering structure (partition or tree) already exists, and each object is added to the closest cluster by some criterion in turn. Examples: QUICK, LETREE, SEARCH, and LINK Searching -- Search over a subset of all possible clusters for the optimal one. Example: FISH CLUSTERING ROUTINES ------------------- Unconstrained Partitions (non-nested) QUICK produces a quick partition based on a threshold BUILD creates clusters by K-means algorithm MIX creates clusters by fitting a maximum likelihood criterion to the mixture model under various constraints on the within-cluster covariances MIXIND creates clusters by fitting the mixture model where the variables have constant variances and zero within-cluster covariances DITTO clusters category data to maximize the matches between observations in a cluster and the cluster mode SPLIT1 clusters observations within each variable by splitting the cluster with the largest variance at each iteration SPLIT2 clusters the data matrix by splitting both observations and variables in the cluster with the largest variance at each iteration. contains user-controlled constraint options JOIN2 clusters the data matrix by joining either the observations or the variables of the two closest clusters at each iteration SPARSE clusters variables to approximate the loading matrix for factor analysis Trees (nested) LETREE creates tree of clusters based on a set of threshold variances SEARCH creates clusters by finding the tree which predicts the most correct triads (expensive to run) SLINK creates tree of clusters by single-linkage algorithm JOIN creates tree of clusters of observations by joining the two closest clusters at each iteration DOT creates tree of clusters of category data using minimum mutation fits to the variables LINK creates tree by adding observations which minimize the sum of the link distances Constrained FISH partitions sequence of observations into subsequences by Fisher's method of exact optimization SPLIT2 clusters the data matrix by splitting both observations and variables in the cluster with the largest variance at each iteration. contains user-controlled constraint options The cluster approach to prediction is to partition the data into clusters according to the variable to be predicted. Assign an existing observation to a cluster, and use the value associated with that cluster as the value for the predicted variable for the existing observation. PREDICTION ROUTINES ------------------- AID uses the automatic interaction detection technique to split data to best predict a variable for an existing observation VARCO uses the variance components algorithm to a set of clusters to assign values for the mean and variance of the predicted variable for each cluster Profiles can be used as an inexpensive method of suggesting possible clusters, or they may suggest reasonable weights for variables. Profiles may be best described as histograms on each variable, connected between variables by identifying observations. One model for clusters in multivariate data is that the data are sampled from a density with many modes, one mode for each cluster. Methods of estimating multivariate densities may therefore be converted to clustering techniques. Univariate, bivariate, and multivariate histograms are useful in suggesting modes. Another approach is to represent each observation as a three-dimensional box. If the observations are ordered by some other clustering technique, the similarity of neighboring boxes suggests or denies clusters. STATISTICAL PLOTTING ROUTINES ----------------------------- PROF constructs linearly optimal profiles of the variables RANK constructs rank profiles of the variables LINE constructs line profiles of the cases HIST constructs a univariate histogram MODAL constructs a bivariate block histogram for a pair of variables MHIST constructs multivariate block histograms BOXES constructs multivariate plots of the observations in the form of three-dimensional boxes PLOT constructs scatter plot of a sequence of paired variables PMANY constructs a matrix of pairwise plots of the variables MISCELLANEOUS ROUTINES ---------------------- TWO computes overall means and covariances, replacing missing values by the overall mean WCOV computes within-cluster covariances DIST computes Euclidean distance between two observations or variables WDIST computes Mahalanobis distance between two observations MISS estimates missing values by the cluster mean STAND standardizes each variable to have zero mean and unit variance SCALE discretizes the data into classes ASSIGN assigns an observation to its closest cluster RELOC sets each cluster center equal to the cluster mean INVERT computes the inverse of a covariance matrix and its determinant IN reads a data matrix with appropriate labels OUT outputs a data matrix with appropriate labels NOTE: "The subroutines in 'Clustering Algorithms' were written as research tools. Speed of execution and storage requirements are not necessarily optimized. These subroutines should NOT be used routinely to analyze large data sets." Programs giving examples of each of the above routines (except IN which every sample program uses) exist in the file SAMPLE under the CMLIB account. The commands ATTACH,SAMPLE/UN=CMLIB. GTR,SAMPLE,<lfn>.<subroutine>. will place a program which calls <subroutine> into the local file <lfn>. All sample programs use the CLUSTER subroutine IN which expects input data on Fortran unit 5 (see documentation of IN for format of the input data file), and places output on a user-specified unit . The sample data set that was used in each sample program can be gotten with the commands: ATTACH,SMPIN/UN=CMLIB. GTR,SMPIN,TAPE5.<subroutine>. All tests were run on the data set in the SMPIN file. The corres- ponding output can be found in the SMPOUT file. The commands ATTACH,SMPOUT/UN=CMLIB. GTR,SMPOUT,<lfn>.<subroutine>. will place the output from the sample program which calls <subroutine> into the local file <lfn>. Parts of the above text were taken from the following (the last two are available from Greg Rhoads of the Scientific Computing Division FTS 921-3395). Hartigan, J. A. (1975). Clustering Algorithms, John Wiley & Sons, Inc., New York. Hartigan, J. A. (1975) Printer Graphics for Clustering. Journal of Statistical Computation and Simulation. Volume 4, Pages 187-213. Dallal, Gerard E. A User's Guide to J. A. Hartigan's "Clustering Algorithms". Yale University.

Math Department Homepage | Locally Maintained Software | CMLIB | Last updated: August 18, 1996 |