CSAFIT: help and advice

CSAFIT: help and advice


Consult the reference manual for further details and worked examples.
W.G.Bardsley, University of Manchester, U.K.
References to the nonparametric analysis conducted by this program

1) A statistical theory for the interpretation of altered flow cytometry profiles in terms of the binding of ligands to cell surface receptors and changes in gene expression.
Bardsley,W.G. and Kyprianou,E.K. (1995) J. Math. Biol. 34, 271-296

2) A statistical model and computer program to estimate association constants for the binding of fluorescent-labelled monoclonal antibodies to cell surface antigens and to interpret shifts in flow cytometry data resulting from alterations in gene expression.
Bardsley,W.G., Ross Wilson,A., Kyprianou,E.K. and Melikhova,E.M. (1992) J. Immunol. Method. 153, 235-247

3) Analysis of gene dosage effects on the expression of CD18 by trisomy 21 lymphoblastoid cell lines using a statistical model to fit flow cytometry profiles.
Bardsley,W.G., McMurray,B.P., Robson,A., D'Souza,S.D. and Taylor, G.M. (1990) Hum. Genet. 86, 181-186
Comparing histograms

Sometimes observations are made over a range of values, and frequencies are recorded within partitions of the range, for instance with equally spaced bin limits in order to construct a histogram. Often one set of observations will be for a control group, and another for a treatment group, where it is suspected that the treatment group reflects a shift and/or a stretch, brought about by alterations in the in the parameters of the underlying (but unknown) probability density function. So the problem is: given two histograms over the same range and with the same bin limits, but no assumed probability density function, can the treatment be interpreted by nonparametric analysis as a shift and/or stretch of the control group. For the technique used by program CSAFIT to be useful, the following points must be observed.

  1. There must be a reasonably large number of bins, say > > 10.
  2. The sample sizes must be reaonably large, say > > 100.
  3. The histogram bins must all be the same width.
  4. Control and treatment frequencies must refer to the same bin limits.
  5. The data file must be formatted in three columns as follows.
    Column 1 must contain the histogram bin mid-points in increasing order.
    Column 2 must contain the control frequencies.
    Column 3 must contain the treatment frequencies.
    Frequencies must be positive.
  6. Test files for practice are csafit.tf1 (stretch), csafit.tf2 (shift), and csafit.tf3 (stretch and shift).
  7. The range and frequencies supplied in the data file are normalised by program CSAFIT so that the range is 0 to 1, and area under the histogram is 1.
  8. Best-fit spline curves used to approximate the probability density functions are normalised to have area 1.
A good example where this approach can be applied would be flow cytometry, which will now be described in more detail to illustrate how this can be done.

Flow cytometry histograms

Suppose a fluorescence intensity X with histogram Phi(X) has been obtained with some reference cells, and intensity Y has been measured, giving histogram Psi(Y) with comparison cells using the same instrument settings. The actual values of X and Y will, of course, be the same, and the partitions will be in arithmetical progression with differences D. The data would be:

0 =< A, X1 = [A + D], X2 = [X1 + D],..., XN = [X(N-1) + D] = B
0 =< A, Y1 = [A + D], Y2 = [Y1 + D],..., YN = [Y(N-1) + D] = B
Phi(X1), Phi(X2), ..., Phi(XN), and
Psi(Y1), Psi(Y2), ..., Psi(YN)
where
sum of Phi(Xi) = total number of reference cells, and
sum of Psi(Yi) = total number of comparison cells.
If the number of cells is large, we can view X and Y as random variables, and Phi(X) and Psi(Y) as approximations to density functions fX(x) and fY(y) (after normalising so area = 1). This program assumes the transformation model
    Y = alpha*X + beta,
so fY(y) is a shift and/or stretch of fX(x). Spline smoothing approximates fX(x) then alpha and beta are found by constrained nonlinear regression. See

Bardsley et al J. Immunol. Meth. 153, 235-247 (1992) and

Bardsley and Kyprianou J. Math. Biol.(1996) 34:271-296.

Modifications required for log_base_10 spaced data

Data histograms are often generated with histogram bins in a geometric instead of a linear progression, in order to cover ranges of several orders of magnitude. If such a log spacing is used, the previous analysis needs modification. In fact, the densities for log(X) and log(Y) are then given by

     f_logY(log(Y)) = [10^log(Y)*f_logX(log(Z))/Z]/(integral),
where the integral is for normalising, and Z is defined as
     Z = [10^log(Y) - beta]/alpha.
The data must be equally spaced log10(fluorescence intensity), but note that serious numerical problems may arise trying to estimate beta if the data are log-spaced. This program normalises so that the estimated stretch and shift parameters will always refer to linear fluorescences X and Y, irrespective of a linear or logarithmic data spacing.

Program operation

  1. Read N values V (= X or logX), Phi(V), Psi(V) from a file
  2. Predict the origin (A) and check the data for consistency
  3. Normalise (if linear) so 0 =< V =< 1, and area under the histograms is 1 (to facilitate computation)
  4. Generate V-mid values between the V values (calculate the mid-points of the re-scaled Phi(V) and Psi(V) histograms)
  5. Fit a smooth curve to serve as a representation of Phi(V)
  6. Normalise until the area under the best-fit curve to the Phi(V) histogram between 0 and 1 equals 1 (so best-fit curve is a probability density function fV(v))
  7. Fit the function: Gamma*fV(v = X(y), alpha, beta)) to the Psi(V) histogram by least squares. Gamma is a normalising factor calculated by numerical integration.It allows for truncation of the range of X when alpha and beta are not zero so the best-fit curve will have area 1.
  8. Calculate moments of data and best-fit curves and also statistics to estimate goodness of fit. Finally display plots and display/file tables of the statistics,residuals and parameter estimates as required.
Interpretation of alpha

If Psi(V) is stretched to the right of Phi(V) then test cells are more highly labelled than reference cells and alpha will be > 1. 100(alpha-1)% is simply the percentage stretch of X necessary to produce Y. The best-fit alpha value measures the extent to which this can be explained by increase in gene expression proportional to that in a reference population or by increased binding as ligand concentration increases.

Interpretation of beta

This is > 0 for shift to the right and 100[beta/range external coordinates])% is the percentage translation e.g. extent to which X is shifted to give Y due to an increased gene expression unrelated to the reference population or from increased saturation with ligand.

Goodness of fit

The program calculates observed and expected probabilities and does chi-square and run tests and calculates moments. However the best test for goodness of fit may just be the % relative difference between data and best-fit curves.

Cubic splines

This version uses a least-squares cubic spline function for data smoothing. The best-fit curve depends on the data and selection of interior knot positions.If just a few knots are used the resulting curve can be quickly calculated and will be smooth but it may not fit too well. With many knots the calculations will be slowed down and over-fitting may result with noisy data leading to spline-curves that oscillate in the attempt to fit fluctuations due to error in the data. The spline is set to zero if a negative best-fit curve results from noisy data, then it is re-normalised to area 1.

EXPERT mode

You can choose sparse, medium or dense interior spline-knots in order to fit your data but to optimise the program for your particular use you can use an EXPERT mode of operation. This allows you to input the interior spline knot positions. Choose knot positions K1, K2,..., KL with more knots where Phi(V) histogram changes most dramatically. Add these at the end of the data file (see csafit.tf1) and use EXPERT mode.

Format for the data files


     Title                    ( =< 80 characters, describing data)
     N    3                   (Data in the form of N by 3 matrix)
     V1, Phi(V1), Psi(V1)     (First line of data)
     V2, Phi(V2), Psi(V2)     (Second line of data)
     ...
     VN, Phi(VN), Psi(VN)     (Last line of data)
     M                        (M lines if required)
     L                        (L knots if required)
     K1, K2, ..., KL          (The knot positions)
     Text                     (Last of extra lines)
V1 to VN can be either fluorescence intensities or log10 of fluoresence intensities but they must be equally spaced. If the EXPERT mode is required the integers M and L and the knots must be as indicated above. Position the knots thus: X(1) < K(1) < K(2) < ... < K(L) < X(N) (see csafit.tf1) with the knots clustered where data changes most rapidly. Use your software (e.g. DATAMATE) to convert machine data to ASCII data files, then make Simfit files using MAKSIM then EDITMT.

Practising with test files

Before using this program you must familiarise yourself with what it does by running the test files. These have exact data and have a control line with spline knots if you want to use expert mode. This is only necessary with very spiky data and several modes. The test files are in linear not log spacing.
csafit.tf1: The geometric model
csafit.tf2: The arithmetic model
csafit.tf3: The full linear model

Simulating data to explore parameter estimates

Program MAKCSA should be used to simulate flow cytometry type data with added random error to explore the robustness of the parameter estimates. Investigation suggests that the standard errors calculated in the normal way from the Jacobian/Hessian are not reliable with this model so the parameter covariance matrix is not printed. See J. Math. Biol. (1996) 34: 271-296 for more details about parameter interpretation.

Summary of CSAFIT operations

  1. Fit a spline curve to the X data, that is, the reference data histogram.
  2. Output goodness of fit data and plot the best-fit spline with X data.
  3. This is then taken as an approximation to the probability density function for both X and Y, differing only in the parameter values.
  4. Fit the Y data by constrained nonlinear regression using the spline curve allowing for a translation and/or stretching.
  5. Output the best-fit translation and/or stretching parameters.
  6. Plot the best-fit modified spline curve and output goodness of fit and a plot.
  7. For completeness do a Kolmogorov-Smirnov two-sample test on X and Y and plot X and Y with best-fit curves.