SIMSTAT: Regression and Calibration
Fit a line
This is when you just want to fit a simple straight line
y = p(0) + p(1)x
to some
(x,y) or (x,y,s) data in
the least squares sense. It assumes that y is subject to experimental
uncertainty but x is known exactly. If weighting factors
(i.e. standard errors s) are provided they will be used,
so set all s = 1 in the data file to perform unweighted regression.
Note that you can also include a goodness of fit analysis together
with tables of residuals on request.
Fit a reduced major axis line
This is used when there is error in the measurement of both x and y.
It minimises the sum of areas of triangles formed between the
data points and best fit line and is used where there is random variation
in both x and y, as in allometric analysis. Set all s = 1 when
using this option.
Fit a major axis line
This minimises the sum of squares of othogonal distances from the data
points to the best fit line and is often used when there is random
variation in both x and y, as in correlation. Set all s = 1 when
using this option.
Fit a line and calibrate (simple)
This is a very simple interface for first time users who simply
want to fit a line with minimum fuss and then use it as a
standard curve to predict x given y. Try the test file polnom.tf1
for a standard curve and read in prediction data from polnom.tf3.
Note that you can supply a column of x and y for unweighted
regression, but a third column of s-values must be supplied if
weighting is required. Also note that Simfit curve fitting files
must have the
first column (x-values, i.e. independent variable) in nondecreasing
order. There is a utility called EDITFL which can correctly
re-order, format and edit curve fitting files.
Fit a line and calibrate (advanced)
This gives more flexibility for the experienced user, including
correlation analysis, but it is rather more complicated to use.
Fit a polynomial and calibrate
This fits all polynomials up to degree six, i.e.
y = p(0) + p(1)x + p(2)x^2 + p(3)x^3 + ... + p(6)x^6
and provides the same
functionality as program POLNOM, which should
be consulted for more details. Try polnom.tf1 again and select a
polynomial of degree 2 (a quadratic). Note how the standard curve
is now much better than when a line is fitted and the prediction of
x given y is more meaningful due to the allowance made for curvature
in the calibration data. For very complicated calibration curves program
QNFIT should be used to fit user-selected models or program
CALCURVE should be used to fit weighted least squares cubic splines.
Fit a transformed polynomial by least squares and calibrate
This routine replaces y by some chosen Y = f(y) and fits a polynomial in some
chosen X = g(x).
For instance, chosing f(y) = log[y/(1 - y)] and g(x) = x gives a
simplified type of logistic regression, where the original
measured responses are proportions with 0 < y < 1, or percentages
with 0 < y < 100, and it is wished to fit a line, quadratic or
higher degree polynomial to transformed data by least squares, e.g. for data
smoothing or calibration. In this case the sequence would be
Y = log[y/(1 - y)] or log[y/(100 - y)]
X = x or log(x)
followed by fitting the model
Y = p(0) + p(1)X + p(2)X^2 + p(3)X^3 + ... + p(6)X^6.
It can be useful when the y-values do not correspond
to a binomial distribution, e.g. due to overdispersion, so that generalized
linear model fitting does not correspond to maximum likelihood and may be
no more justified than least squares fitting.
The idea is that
you read in a two column (x,y) data sample and the program first transforms
the y data to Y = f(y), e.g. to log[y/(1 - y)], and possibly also the
x values to X = g(x),
e.g. to log[x], before fitting a sequence of polynomials up to degree 6.
This is often done when the y values are proportions (between 0 and 1)
and it is wished to fit a smooth S-shaped curve.
It is possible to normalise the y-values supplied if they are percentages
rather than proportions by dividing by 100, and either natural or
base ten logarithms can be used.
Whatever weights are supplied, the program will always set all s = 1
and do unweighted regression, since transforming weights can lead
to problems with these types of transformed regressions.
Try fitting logistic.tf1, using log[y/(1-y)] as a function of log[x],
to get the idea. However note that, whatever transformation is chosen, the
symbols x and y after the fitting will always represent the transformed variables and
not the original variables in subsequent tables and plots.
Also note that, for true logistic regression with possibly
several variables, you should select the generalized linear model
option (GLM) and choose binomial errors with a logistic link function.
Polynomials can also be fitted by GLM if the independent variable x is used
to form other independent variables, e.g. x_1 = x, x_2 = x-squared, x_2 =
x-cubed, and so on.
Multilinear regression
This technique is used when a dependent variable (y) depends on several
independent variables (x_1, x_2, ..., x_n) in a linear manner, as in
y = p(0) + p(1)x_1 + p(2)x_2 + ... + p(n)x_n
and the experimental
errors are independent and normally distributed. Often the error variance
is not constant, so weighting factors (s) can be supplied to calculate
weights for fitting according to w = 1/s-squared.
Try the test file linfit.tf1 (which is singular) or linfit.tf2 (which
has full rank) to see how SVD is used for problems with reduced
rank. For full details of how to use this option or how to do
GLIM or robust fitting consult program LINFIT. Note that the order
of the data columns is x_1, x_2, x_3, ..., x_n, y, s but you can de-select
columns or weights if required. To include categorical variables, i.e.
qualitative variables or factors with several levels,
dummy indicator variables can be defined, e.g. to fit
allowing for male or female as a covariate you could set x_1 = 1,
x_2 = 0 for male and x_1 = 0, x_2 = 1 for female, etc. Of course
x_1 and x_2 should not both be included in the regression at the
same time as
the model would be overdetermined if a constant term were to be
fitted as well as x_1 and x_2. It does not matter which
of x_1 or x_2 is included as long as they are not both included at
the same time.
In general, a factor with m levels can be included in the regression
by defining m dependent (0,1) dummy indicator variables,
but then suppressing one, so that only m - 1 independent indicator
variables are included. Otherwise this approach leads to overdetermined
systems with
loss of rank, where only certain functions of the parameters can be
estimated but not all the parameters, i.e. aliasing.
Dose response curves, EC50 and LD50
Where dichotomous data sets have to be analysed as a function of
some variable in order to construct dose response curves then
estimate appropriate percentiles, like the median
effective dose, a simple version of the Simfit GLM interface
is provided to do probit
analysis, logistic, or log-log regression using generalized linear models.
The procedure requires values for the number of successes (y) in a
number of independent Bernoulli trials (N) at settings for an
independent variable (x).
The data format can have columns in the order (y, N, x) as in
analysis of proportions and test file ld50.tf1, or
in the order (x, y, N, s) i.e. the format for generalized linear models,
and test file ld50.tf2.
Logistic regression
This option provides access to a sub-set of the Simfit GLM techniques
that are used for various types of logistic regression.
Cox regression
This fits the Cox proportional hazards model to survival analysis
data of the form
x1, x2,..., xm, y, t, s
where x1 to xm are covariates, y is zero for failure or 1 for
right censoring, t is the observed failure time, and s is the stratum
indicator. Test files are cox.tf1, cox.tf2, cox.tf3, which
have only one stratum, and cox.tf4 which has data for several strata.
Comparing two parameter estimates
This technique is used to compare parameters that have been estimated
by regression,
or have been derived from such parameters. Examples would be two LD50
or EC50 estimates from dose response curves, or two AUC estimates
from exponential fitting. The routine accepts two means and standard
errors of means, then does a t test corrected for unequal variances.
Comparing two sets of regression parameters
After fitting models, Simfit offers the opportunity to store vectors
of parameters and covariance matrices to your archive project file
c_recent.cfg. These can be used to calculate the Mahalanobis distance
between two parameter vectors, so that sets of regression parameters
can be conveniently tested for equality. Examples would be fitting
the same growth model to different cell cultures too quantify
differences in growth profiles, or comparing pharmacokinetic
profiles for the same exponential model fitted to different populations.
Back to Help Menu or End Help