SIMSTAT: Regression and Calibration

Fit a line
This is when you just want to fit a simple straight line
y = p(0) + p(1)x
to some (x,y) or (x,y,s) data in the least squares sense. It assumes that y is subject to experimental uncertainty but x is known exactly. If weighting factors (i.e. standard errors s) are provided they will be used, so set all s = 1 in the data file to perform unweighted regression. Note that you can also include a goodness of fit analysis together with tables of residuals on request.

Fit a reduced major axis line
This is used when there is error in the measurement of both x and y. It minimises the sum of areas of triangles formed between the data points and best fit line and is used where there is random variation in both x and y, as in allometric analysis. Set all s = 1 when using this option.

Fit a major axis line
This minimises the sum of squares of othogonal distances from the data points to the best fit line and is often used when there is random variation in both x and y, as in correlation. Set all s = 1 when using this option.

Fit a line and calibrate (simple)
This is a very simple interface for first time users who simply want to fit a line with minimum fuss and then use it as a standard curve to predict x given y. Try the test file polnom.tf1 for a standard curve and read in prediction data from polnom.tf3. Note that you can supply a column of x and y for unweighted regression, but a third column of s-values must be supplied if weighting is required. Also note that Simfit curve fitting files must have the first column (x-values, i.e. independent variable) in nondecreasing order. There is a utility called EDITFL which can correctly re-order, format and edit curve fitting files.

Fit a line and calibrate (advanced)
This gives more flexibility for the experienced user, including correlation analysis, but it is rather more complicated to use.

Fit a polynomial and calibrate
This fits all polynomials up to degree six, i.e.
y = p(0) + p(1)x + p(2)x^2 + p(3)x^3 + ... + p(6)x^6
and provides the same functionality as program POLNOM, which should be consulted for more details. Try polnom.tf1 again and select a polynomial of degree 2 (a quadratic). Note how the standard curve is now much better than when a line is fitted and the prediction of x given y is more meaningful due to the allowance made for curvature in the calibration data. For very complicated calibration curves program QNFIT should be used to fit user-selected models or program CALCURVE should be used to fit weighted least squares cubic splines.

Fit a transformed polynomial by least squares and calibrate
This routine replaces y by some chosen Y = f(y) and fits a polynomial in some chosen X = g(x). For instance, chosing f(y) = log[y/(1 - y)] and g(x) = x gives a simplified type of logistic regression, where the original measured responses are proportions with 0 < y < 1, or percentages with 0 < y < 100, and it is wished to fit a line, quadratic or higher degree polynomial to transformed data by least squares, e.g. for data smoothing or calibration. In this case the sequence would be
Y = log[y/(1 - y)] or log[y/(100 - y)]
X = x or log(x)
followed by fitting the model
Y = p(0) + p(1)X + p(2)X^2 + p(3)X^3 + ... + p(6)X^6.
It can be useful when the y-values do not correspond to a binomial distribution, e.g. due to overdispersion, so that generalized linear model fitting does not correspond to maximum likelihood and may be no more justified than least squares fitting. The idea is that you read in a two column (x,y) data sample and the program first transforms the y data to Y = f(y), e.g. to log[y/(1 - y)], and possibly also the x values to X = g(x), e.g. to log[x], before fitting a sequence of polynomials up to degree 6. This is often done when the y values are proportions (between 0 and 1) and it is wished to fit a smooth S-shaped curve. It is possible to normalise the y-values supplied if they are percentages rather than proportions by dividing by 100, and either natural or base ten logarithms can be used. Whatever weights are supplied, the program will always set all s = 1 and do unweighted regression, since transforming weights can lead to problems with these types of transformed regressions. Try fitting logistic.tf1, using log[y/(1-y)] as a function of log[x], to get the idea. However note that, whatever transformation is chosen, the symbols x and y after the fitting will always represent the transformed variables and not the original variables in subsequent tables and plots. Also note that, for true logistic regression with possibly several variables, you should select the generalized linear model option (GLM) and choose binomial errors with a logistic link function. Polynomials can also be fitted by GLM if the independent variable x is used to form other independent variables, e.g. x_1 = x, x_2 = x-squared, x_2 = x-cubed, and so on.

Multilinear regression
This technique is used when a dependent variable (y) depends on several independent variables (x_1, x_2, ..., x_n) in a linear manner, as in
y = p(0) + p(1)x_1 + p(2)x_2 + ... + p(n)x_n
and the experimental errors are independent and normally distributed. Often the error variance is not constant, so weighting factors (s) can be supplied to calculate weights for fitting according to w = 1/s-squared. Try the test file linfit.tf1 (which is singular) or linfit.tf2 (which has full rank) to see how SVD is used for problems with reduced rank. For full details of how to use this option or how to do GLIM or robust fitting consult program LINFIT. Note that the order of the data columns is x_1, x_2, x_3, ..., x_n, y, s but you can de-select columns or weights if required. To include categorical variables, i.e. qualitative variables or factors with several levels, dummy indicator variables can be defined, e.g. to fit allowing for male or female as a covariate you could set x_1 = 1, x_2 = 0 for male and x_1 = 0, x_2 = 1 for female, etc. Of course x_1 and x_2 should not both be included in the regression at the same time as the model would be overdetermined if a constant term were to be fitted as well as x_1 and x_2. It does not matter which of x_1 or x_2 is included as long as they are not both included at the same time. In general, a factor with m levels can be included in the regression by defining m dependent (0,1) dummy indicator variables, but then suppressing one, so that only m - 1 independent indicator variables are included. Otherwise this approach leads to overdetermined systems with loss of rank, where only certain functions of the parameters can be estimated but not all the parameters, i.e. aliasing.

Dose response curves, EC50 and LD50
Where dichotomous data sets have to be analysed as a function of some variable in order to construct dose response curves then estimate appropriate percentiles, like the median effective dose, a simple version of the Simfit GLM interface is provided to do probit analysis, logistic, or log-log regression using generalized linear models. The procedure requires values for the number of successes (y) in a number of independent Bernoulli trials (N) at settings for an independent variable (x). The data format can have columns in the order (y, N, x) as in analysis of proportions and test file ld50.tf1, or in the order (x, y, N, s) i.e. the format for generalized linear models, and test file ld50.tf2.

Logistic regression
This option provides access to a sub-set of the Simfit GLM techniques that are used for various types of logistic regression.

Cox regression
This fits the Cox proportional hazards model to survival analysis data of the form
x1, x2,..., xm, y, t, s
where x1 to xm are covariates, y is zero for failure or 1 for right censoring, t is the observed failure time, and s is the stratum indicator. Test files are cox.tf1, cox.tf2, cox.tf3, which have only one stratum, and cox.tf4 which has data for several strata.

Comparing two parameter estimates
This technique is used to compare parameters that have been estimated by regression, or have been derived from such parameters. Examples would be two LD50 or EC50 estimates from dose response curves, or two AUC estimates from exponential fitting. The routine accepts two means and standard errors of means, then does a t test corrected for unequal variances.

Comparing two sets of regression parameters
After fitting models, Simfit offers the opportunity to store vectors of parameters and covariance matrices to your archive project file c_recent.cfg. These can be used to calculate the Mahalanobis distance between two parameter vectors, so that sets of regression parameters can be conveniently tested for equality. Examples would be fitting the same growth model to different cell cultures too quantify differences in growth profiles, or comparing pharmacokinetic profiles for the same exponential model fitted to different populations.

Back to Help Menu or End Help