SIMSTAT: Multivariate Statistics

Correlations
There are two sub-options: Pearson for normal data or Kendall tau together with Spearman rank otherwise. It is assumed that modest numbers of samples are being analysed (say 10 at a time) otherwise the output matrices of r-values and p-values may overflow.
Use the Pearson method only if you are sure that the columns of data are normally distributed. Note that when plotting pairwise correlations after using this option, Simfit gives you the two regression lines of y on x and x on y, since the choice of axes is arbitrary and neither line alone is representative of the correlation. Similar regressions of y on x and x on y (lines almost parallel) indicate strong correlation but orthogonal lines (almost at right angles) indicate the absence of any (parameteric, linear, normal) correlation. If you just want one line, then it should be the reduced major axis or possibly the orthogonal line, which are also available, since these allow for variation in both x and y.
For nonparametric correlations you should use the Kendall tau and Spearman rank options which check for monotonicity, i.e. nonlinear correlations, rather than just linear correlations.
Note that for the matrix or library file provided, r-values and two-tail p-values are calculated for all possible pairwise correlations. Try the library file test file npcorr.tfl.

Canonical correlation
This technique is used when columns of a data matrix fall naturally into one of two groups, say X and Y. Canonical coordinates are calculated to maxmimise the correlation between the two groups and the minimum number of components needed to represent the data can be decided from a scree diagram (or a chi-square test if a multi-variate normal distribution is assumed). The loadings can be used to plot a scatter diagram for the two groups projected onto chosen canonical variate axes, but it should be remembered that the transformation is not orthogonal, i.e. it is not a rotation so distances are not preserved. Try test file matrix.tf5, taking columns 1 and 2 as X and columns 3 and 4 as Y.

Partial Correlations
For data sets with more than two variables, partial correlation coefficients along with significance tests and confidence limits can be calculated either following Pearson analysis, by reading in a data matrix, or by reading in a correlation matrix directly.

Multivariate cluster analysis
You input a data matrix with n rows (cases) and m columns (variables) then the program calculates generalised distances, i.e. a dissimilarity matrix which is output in strict lower triangular form. The pre-analysis transformation, variables to be excluded, distance model, scaling and link algorithm are selected and a cluster diagram or dendrogram can be plotted to illustrate the clusters. Dendrograms use labels which you can paste as a column to the end of data file, as will be clear from the test file cluster.tf1.

Multivariate K-means cluster analysis
First you input a data matrix with n rows (cases) and m columns (variables), then you input a matrix with KMAX rows (KMAX < n) and m columns which are the starting estimates, i.e. the cluster centres from which to start to cluster the data. The algorithm is iterative and proceeds by moving data points between clusters to minimise the sum of squared distances from the points to the cluster centres. After each iteration the cluster centroids are re-calculated, and the algorithm will stop if an empty cluster is created, or if the maximum number of iterations is exceeded. The starting clusters and K, the number of clusters (2 =< K =< KMAX < N), can be altered interactively, and also weights to account for differing numbers of replicates can be supplied.

Classical metric and non-metric scaling
When a distance matrix has been calculated, as for dendrogram creation, several techniques are provided to view the results. For instance, classical metric scaling (i.e. principal coordinate analysis) can be done. This finds a set of coordinates that best represents the differences in a Euclidean space of arbitrary dimensions so that, if such a representation exists, the difference between the cases can be visualised. If the distance matrix was calculated for ordinal data for which only ranks are important, not Euclidean distances, non-metric scaling can be done to calculate the STRESS or SSTRESS functions. This requires an optimisation technique, so options to use random starting estimates are provided, to ensure that a global minimum has been located. The coordinates resulting from application of these methods can be plotted in two or three dimensions, with or without labels, so that the differences between multivariate cases can be assessed.

Principal components analysis
You input a data matrix with n rows (cases) and m columns (variables) and the program performs a principal components analysis acording to several options that are provided to control the type of results required. The eigenvalues, loadings and scores are then tabulated, statistics are calculated and scree diagrams can be plotted to assist in the choice of the minum number of components required to adequately represent the data, and selected scores can be plotted as scatter diagrams.

Procrustes analysis
Two loading matrices X and Y from principal components analysis, or any two arbitrary matrices with the same dimensions, can be translated, rotated, and scaled in order to estimate how close matrix Y is to matrix X.

Varimax/Quartimax rotation
A loading matrix, e.g. from factor or canonical variates analysis, is input and can then be rotated according to the Varimax or Quartimax procedures.

Multivariate analysis of variance (MANOVA)
Groups and subgroups of multivariate normal data can be analysed for equality of covariance matrices (Box's test), for equality of mean vectors (Wilk's lambda, Roy's largest root, Lawley-Hotelling trace, and Pillai trace), and profile analysis for repeated measurements can be done on selected groups by using a transformatiom matrix followed by a Hotelling's T-squared test.

Canonical variates analysis
Mutivariate groups can be compared using canonical variates or principal components, and extra data such as other group means for comparison can be added to the data file (see manova1.tf4), for plotting with group means and confidence regions to assign the extra data to existing groups.

Distances between groups
The squared Mahalanobis distances between group means and between group means and individual samples can be calculated using pooled or individual covariance matrix estimates. Allocation of samples to groups based on these distances and using known groups as a training set can be performed, using estimative or predictive Bayesian methods.

Factor analysis
Either a multivariate data matrix, or else a covariance or correlation matrix calculated from such data can be input. The number of factors is selected, and the sample size is provided if a covariance or correlation matrix has been supplied. Then the loadings and communalities are calulated, together with the score coefficients after varimax or quartimax rotation if required.

Biplots
A singular value decomposition is performed on any arbitrary matrix and, if the rank is at least three, a residual matrix is calculated. Biplots can then be displayed for either general, row emphasized, or column emphasized cases and, in addition, arbitrary row or column scaling factors can be used to stretch or reflect the vectors in order to improve interpretation.

Back to Help Menu or End Help