SIMSTAT: Multivariate Statistics
Correlations
There are two sub-options: Pearson for normal data or Kendall tau
together with Spearman rank otherwise. It is assumed that modest
numbers of samples are being analysed (say 10 at a time)
otherwise the output matrices of r-values and p-values
may overflow.
Use the Pearson method only if you are sure that the columns
of data are normally distributed. Note that when plotting pairwise
correlations after using this option, Simfit gives you the two
regression lines of
y on x and x on y, since the choice of axes is arbitrary and
neither line alone is representative of the correlation. Similar
regressions of y on x and x on y (lines almost parallel) indicate strong
correlation but orthogonal lines (almost at right angles) indicate
the absence of any (parameteric, linear, normal) correlation.
If you just want one line, then it should be the
reduced major axis or possibly the orthogonal line, which are also
available, since these allow for variation in both x and y.
For nonparametric correlations you should use
the Kendall tau and Spearman rank options which check for monotonicity,
i.e. nonlinear correlations, rather than just linear correlations.
Note that for the
matrix or library file provided, r-values and two-tail p-values are
calculated for all possible pairwise correlations.
Try the library file test file npcorr.tfl.
Canonical correlation
This technique is used when columns of a data matrix fall naturally into one
of two groups, say X and Y. Canonical coordinates are calculated to maxmimise
the correlation between the two groups and the minimum number of components
needed to represent the data can be decided from a scree diagram
(or a chi-square test if a multi-variate normal
distribution is assumed). The loadings can be used to plot a scatter diagram
for the two groups projected onto chosen canonical variate axes, but it should
be remembered that the transformation is not orthogonal, i.e. it is not
a rotation so distances are not preserved. Try test file matrix.tf5, taking
columns 1 and 2 as X and columns 3 and 4 as Y.
Partial Correlations
For data sets with more than two variables, partial correlation
coefficients along with significance tests and confidence limits
can be calculated either following Pearson analysis, by reading in a data
matrix, or by reading in a correlation matrix directly.
Multivariate cluster analysis
You input a data matrix with n rows (cases) and m columns (variables)
then the program calculates generalised distances,
i.e. a dissimilarity matrix which is output in strict lower
triangular form.
The pre-analysis transformation, variables to be excluded, distance
model, scaling and link algorithm are selected and a cluster
diagram or dendrogram can be plotted to illustrate the clusters.
Dendrograms use labels which you can paste as a column to the end of
data file, as will be clear from the test file cluster.tf1.
Multivariate K-means cluster analysis
First you input a data matrix with n rows (cases) and m columns (variables),
then you input a matrix with KMAX rows (KMAX < n) and m columns which are
the starting estimates, i.e. the cluster centres from which to start to
cluster the data. The algorithm is iterative and proceeds by moving data
points between clusters to minimise the sum of squared distances from the
points to the cluster centres. After each iteration the cluster centroids
are re-calculated, and the algorithm will stop if an empty cluster is
created, or if the maximum number of iterations is exceeded. The starting
clusters and K, the number of clusters (2 =< K =< KMAX < N), can be
altered interactively, and also weights to account for differing numbers
of replicates can be supplied.
Classical metric and non-metric scaling
When a distance matrix has been calculated, as for dendrogram creation,
several techniques are provided to view the results. For instance,
classical metric scaling (i.e. principal coordinate analysis) can be done.
This finds a set of coordinates that best represents the differences in
a Euclidean space of arbitrary dimensions so that, if such a representation
exists, the difference between the cases can be visualised. If the
distance matrix was calculated for ordinal data for which only ranks
are important, not Euclidean distances, non-metric scaling can be
done to calculate the STRESS or SSTRESS functions. This requires an
optimisation technique, so options to use random starting estimates
are provided, to ensure that a global minimum has been located. The
coordinates resulting from application of these methods can be plotted in
two or three dimensions, with or without labels, so that
the differences between multivariate cases can be assessed.
Principal components analysis
You input a data matrix with n rows (cases) and m columns (variables)
and the program performs a principal components analysis acording
to several options that are provided to control the type of results required.
The eigenvalues, loadings and scores are then tabulated, statistics are
calculated and scree diagrams can be plotted to assist in the choice of the
minum number of components required to adequately represent the data, and
selected scores can be plotted as scatter diagrams.
Procrustes analysis
Two loading matrices X and Y from principal components analysis, or any two
arbitrary matrices with the same dimensions, can be translated, rotated,
and scaled in order to estimate how close matrix Y is to matrix X.
Varimax/Quartimax rotation
A loading matrix, e.g. from factor or canonical variates
analysis, is input and can then be rotated according to the Varimax
or Quartimax procedures.
Multivariate analysis of variance (MANOVA)
Groups and subgroups of multivariate normal data can be analysed
for equality of covariance matrices (Box's test), for equality of
mean vectors (Wilk's lambda, Roy's largest root, Lawley-Hotelling trace,
and Pillai trace), and profile analysis for repeated measurements can be
done on selected groups by using a transformatiom matrix followed by a
Hotelling's T-squared test.
Canonical variates analysis
Mutivariate groups can be compared using canonical variates or
principal components, and extra data such as other group means for
comparison can be added to the data file (see manova1.tf4), for
plotting with group means and confidence regions to assign the
extra data to existing groups.
Distances between groups
The squared Mahalanobis distances between group means and between group
means and individual samples can be calculated using pooled or individual
covariance matrix estimates. Allocation of samples to groups
based on these distances and using known groups as a training set
can be performed, using estimative or predictive Bayesian methods.
Factor analysis
Either a multivariate data matrix, or else a covariance or correlation
matrix calculated from such data can be input. The number of factors
is selected, and the sample size is provided if a covariance or
correlation matrix has been supplied. Then the loadings and communalities
are calulated, together with the score coefficients after varimax or
quartimax rotation if required.
Biplots
A singular value decomposition is performed on any arbitrary matrix and,
if the rank is at least three, a residual matrix is calculated. Biplots
can then be displayed for either general, row emphasized, or column
emphasized cases and, in addition, arbitrary row or column scaling
factors can be used to stretch or reflect the vectors in order to
improve interpretation.
Back to Help Menu or End Help