LINFIT: help and advice

LINFIT: help and advice


Consult the reference manual for further details and worked examples.
W.G.Bardsley, University of Manchester, U.K.
Introduction

Simple linear regression, i.e. fitting the model

     y = Constant + Bx
by weighted least squares can be explored using the test file line.tf1, while simple weighted m-dimensional multiple linear regression using the model
     y = Constant + B(1)x(1) + B(2)x(2) + ... + B(m)x(m)
can be explored using the test file linfit.tf2.

However, experienced users can request more advanced options, including reduced major axis, and major axis regression, e.g. for use in allometry, and correlation analysis. Also, where the data are more noisy than with normally distributed errors, several techniques for robust regression are available. With all multiple regressions, users can decide interactively which variables to include and which to exclude in the regresssion.

Generalized linear models (GLM) are also provided (e.g. logistic regression for binomial error as in proportions, or log linear models for Poisson error as in counting), where an error type is specified and an appropriate link function selected.

The models and functions available in this program

  1. Weighted or unweighted least squares straight line
  2. Weighted or unweighted reduced major axis line (nonlinear regression)
  3. Weighted or unweighted major axis line (nonlinear regression)
  4. Weighted or unweighted lines and simple calibration
  5. Weighted or unweighted lines, correlation analysis and advanced calibration options
  6. Weighted or unweighted polynomial fitting and calibration
  7. Weighted or unweighted transformed polynomial F(y) = P(G(x)) fitting and calibration
  8. Multilinear L_1 norm (iterative, weights not used)
  9. Multilinear L_2 norm (SVD if singular, weights can be used)
  10. Multilinear L_infinity norm (iterative, weights not used)
  11. Multilinear robust regression (M-estimates, weights calculated)
  12. Generalized Linear Models in addition to transformations
  13. Transformation of any data column: x(1), x(2), ..., x(m), y or s
  14. Selecting or suppressing any sub-set of independent variables in the regression equation
  15. Partial least squares (PLS)

Multiple Linear Regression

In this program the model is formulated as follows

     y = Constant + B(1)x(1) + B(2)x(2) +  ... + B(m)x(m),
where you have the options to include or exclude the constant term, or any subset of the variables x(i). Data must be prepared as in the files linfit.tf? with columns x(1), x(2), ..., x(m), y, s [where y is the measured response at settings x(1) through x(m), and s is the standard deviation of y or all s = 1]. If all the s values are set equal to 1 then regression will be unweighted but, if any s is too small to use as weighting (i.e. <1.0E-20), the y-measurement will not be used. Otherwise least squares regressions use weights w = 1/s^2 (but not the L_1, L_infinity, or M-estimate techniques). If the problem is singular, then only certain combinations of parameters can be estimated, and singular value decomposition is used. You will then be warned (as with the test file linfit.tf1) about the problem not having full rank and must be careful not to overinterpret the results, as the parameters will not be uniquely defined. In the reduced rank case, you should really suppress the least important variables until such time as a full rank solution with unambiguous parameter estimates can be obtained. If you want to fit lines or polynomials for data smoothing, graph plotting or calibration, you should really use programs POLNOM, CALCURVE or SPLINE, not LINFIT. Another way to fit polynomials is to use tricks, such as
     x(2) = x(1)^2
     x(3) = x(1)^3, etc.

Prepare, edit your data files using programs MAKMAT, EDITMT.

Choosing the best model for y = f(x(1),x(2),...,x(m))

Weighted least squares linear regression is one technique where the results are unambigous (in the full rank case) and many options exist to assess goodness of fit. Sadly, it is almost never valid in science, where the models are invariably nonlinear. Nevertheless, you may be in situations where you want to assess the relative importance of numerous effects which are approximately linear over a restricted range, or where you have arbitrarily assigned values (e.g. x = 0 for absence, x = 1 for presence of an effector). You can examine such relative effects by including/excluding variables interactively and observing

If you find yourself having to resort to all-subsets regression, it probably indicates desperation and/or a badly designed experiment. If you try to use variables like growth against time (usually sigmoidal?), or density of species against distance from a source (could perhaps be an inverse 1/x ?) you should perhaps be using data transformation, or nonlinear regression, and not simple linear regression. Unfortunately, goodness of fit is not easily visualised with several variables, and many succumb to the temptation to accept ill fitting or over-determined linear models.

Robust fitting: Using the L_p norm and M-estimates

Outliers are rogue points with rather more extreme errors than would be expected from a normal distribution, and they can lead to biased fitting if least squares regression is used uncritically. Robust fitting attempts to minimise the effect of outliers by special weighting techniques, or by using alternatives to least squares. The p-norm of a vector v = (v(1), v(2), ..., v(m)) is defined as

 L_p = {sum |v(i)|^p}^(1/p) for i = 1, 2, ..., m and p = 1, 2, ..., infinity,
and so least squares corresponds to the minimising in the L_2 norm, while minimising the sum of absolute values is known as L_1 norm fitting, and minimising the largest absolute residual is minimising in the L_infinity norm. The following should be noted.

Using the L_1 norm

With normal errors, weighted least squares is equivalent to maximum likelihood, so parameter standard errors can be estimated and F test used. Howvere, since errors are unlikely to be bi-exponential, parameter standard errors are not estimated in this program after L_1 norm fitting.

Using the L_infinity norm

In certain extreme cases it is advantageous to fit using the infinity norm, i.e. mimimising the largest absolute residual. This program will minimise in this norm in exactly the same way as with the L_1 norm, but again, standard errors are not given since errors do not normally follow a uniform distribution.

Using M-estimates

This technique proceeds by down-weighting extreme observations in an attempt to minimise the contribution of outliers to the regression. Unfortunately there is no unique way to do this and there are many alternative ways to define the weightings and to formulate a satisfactory objective function. Simfit provides all the most useful options but, to understand them, you must consult the theory and worked example in the Simfit reference manual.

Which norm should be used ?

Probably you should not be doing multilinear regression at all, since scientific experiments are almost invariably nonlinear. Multilinear regression is only useful with very noisy data of limited range, or where experimentalists are not able to develop a meaningful nonlinear mathematical model. If the data do not seem to have extreme values and error appears to be normally distributed, use the L_2 norm (least squares). With more noisy data try the L_1 norm (minimise the sum of absolute deviations) or the L_infinity norm (Chebyshev norm or minimising the largest absolute deviation) to minimise the effects of outliers. Robust regression leading to M-estimates requires many decisions as to control parameter settings and should only be used if you know exactly which techniques to use for your particular problem.

Partial least squares (PLS)

The relationship between an observed matrix Y (n by p) and a training set matrix X (n by q) is approximated by projection onto a sub-space of dimension less than q. Various techniques are provided for deciding on goodness of fit and a reasonable sub-space dimension, then a new test matrix Z (n by q) is supplied and used to predict a supposed Y matrix (n by p) of hypothetical responses.