Multivariate data - classification

Approaches to the Analysis and Display of Multivariate Data

Ordination	*vs.*	Classification
(placing samples relative to continuous scales)	*vs.*	(placing samples into discontinuous categories)

Classification

There is a large number of contrasting algorithms available for the classification of samples.

One contrast is between hierarchical and reticulate classification.

Hierarchical classification is one that can be represented by means of a dendrogram; the placing of a sample within a class at a low level in the dendrogram automatically places it within higher-level classes. Reticulate classifications do not have this property. The former are more informative and more often used - if you have a biological background you should be used to hierarchical classifications! If you understand the relationships between files making up pages of a web wite - that's essentially a reticulate classification, each unit being identified by its closest links.

A monothetic classification allocates items into classes according to their values for a single variable, in contrast to polythetic classifications which use many (usually all) variables.

There is also a contrast between agglomerative (i.e. lumping) and divisive (i.e. splitting) approaches. Agglomerative methods start with individual items and groups them together in a series of steps; divisive ones start with the whole set of data and progressively split them up to form the groups at lower levels of the dendrogram. The latter is the preferred approach as it uses more of the information in the data set.

Examples of different classification algorithms:

	Monothetic	Polythetic
Agglomerative		Cluster analysis
Divisive	Association Analysis	Indicator Species Analysis TWINSPAN

TWINSPAN

stands for Two-Way INdicator SPecies ANalysis

It is based on Reciprocal Averaging ordination (RA) and is best envisaged in terms of samples characterised by species' abundances.

RA can be summarised thus: samples are placed in order according to the abundances of the various species; the species are then assigned weights to correspond with the relative sample positions and the sample scores re-calculated. The samples are then placed in order according to the re-calculated scores and the species weights can be re-calculated - then the sample scores are re-calculated.... and so on in a recursive process. Finally, this settles down with the samples in the best order according to their species composition and the species in the best order according to their occurrence in the samples.

Steps in TWINSPAN

1. Ordinate the samples by RA.
2. Find the best place ("centre of gravity") at which to split the data set into two.
3. Identify the species showing most difference in occurrence on the two sides (+ve and -ve) of the split - these are termed Indicator Species.
4. Use these species to do a "refined ordination" and verify the best split.
5. Calculate indicator scores for the samples (adding +1 for each +ve indicator species present and -1 for each -ve indicator species).

Repeat steps 1 - 5 for each of the sub-groups.

This process can then be repeated going down the dendrogram until the required number of classes is obtained.

The splits between classes can be described in terms of (a) how "good" they are, i.e. how different are the resultant groups, and (b) indicator species.

Contrast with Ordination

A. The samples are placed into discrete categories (i.e. the classes or end-groups) rather than placed in sequence along a continuous axis. Note that with TWINSPAN, however, it is an ordered classification, with a clear sequence to the classes usually evident (relative to the first axis of reciprocal averaging).

B. An ordination is restricted to the data set on which it was performed. In contrast, to some extent a classification can be applied to new samples. A classification such as TWINSPAN usually generates a key which can be applied to additional samples to place them within the defined classes (as long as their species composition is not too different from the analysed data set).

Back to HATs