It is frequently necessary to fit a smooth curve to x,y data in order to plot a graph, or estimate the derivatives or area under a best-fit curve. Often it is useful to fit two related sets of data to see if there is statistically significant change in response to an alteration in experimental conditions. For instance, before and after some treatment. If there is good reason to believe some deterministic equation can be fitted to the data then there is a preferred method of analysis: the equation should be fitted by weighted least squares, and the best-fit parameters examined for a statistically significant change.
For instance, compartmental elimination data can be fitted by EXFIT, enzyme kinetic data by RFFIT or MMFIT, ligand binding data by SFFIT or HLFIT, and growth data by GCFIT etc.
This program estimates derivatives and areas by constrained weighted least squares spline fitting when there are no simple equations to fit, or when the data are too sparse or too noisy to justify equation-fitting.
If your data are fairly dense and accurate, the best type of smoothing functions are weighted least squares cubic splines under tension, where the fit is controlled by varying knot positions, and compromising between underfit and overfit by varying the tension. A big problem is that, if your data are sparse, noisy, or have horizontal asymptotes, splines are too flexible and may lead to undulating best-fit curves. This program first fits a cubic (underfit) and allows you to choose the degree of smoothness required by varying the weighted sum of squares WSSQ from a value for this cubic down to a value of zero (overfit). It is up to you to vary this smoothing factor (i.e. WSSQ) until the fit displayed and analysis of residuals are satisfactory. If the error is normally distributed about zero and s, the weighting values, are accurate, then the WSSQ is approximately chi-square distributed, so that WSSQ should be of the order of the number of observations, n say. Under these circumstances, values of WSSQ much less than n suggest over fitting, while values much greater than n suggest under fitting.
This program will estimate the area under the best-fit spline and compare it with the area obtained by joining up the y-values (i.e. the trapezoidal rule). If you want to compare two best fit curves, the program finds a range of overlap for x-values and a baseline such that all best-fit y-values are positive. Then best-fit curves are compared in this window, using a variety of parameters to estimate the differences.
x is the independent variable, y the dependent variable and s is the estimated standard deviation of y (or 1). The possible situations are as follows.
You have at least two replicates at each fixed x-value.
Enter all data, with x in increasing order, individual-y
values and s = 1. The program calculates mean-y with 95%
confidence limits from replicates and a t-distribution.
This is the recommended way to use the program.
You have single y-values at some or all fixed x-values.
Input all x, y, s = est.std.y (or s = 1), then you can use
an independent estimate for the coefficient of variation
[cv% = 100(sample std.dev.)/|mean|] to calculate weights
and error bars from within the program.
You have mean-y and standard errors of mean-y at each x.
The program will accept mean-y and s = 1, s = std.err. of
mean-y, or s = sample standard deviation, but you will be
asked for no. of replicates used to calculate mean-y.
The program works internally with mean-y, weights = 1/s^2
(where s = est. std. err. mean-y), but there are several
options for plotting error bars.
When fitting a known deterministic equation it is sometimes best to have the minimum numbers of distinct x-values and a maximum number of y-replicates, in order to minimise the variance of parameter estimates. With data smoothing, it is always best to have a maximum number of x values, especially at horizontal asymptotes and points where y is changing rapidly, to anchor down the best-fit curve. Unfortunately, if the number of replicates is less than 5, this leads to very noisy sample estimates of variance. With data smoothing this is not so important, but it does lead to an uneven weighting, and a large variation in error bars.
For more even weighting, you can get the program to calculate weights from the cv%. This may give better curves, but it could lead to dishonest error bars. You can display error bars for 95% confidence limits on means, or 2 or more sample standard deviations, which is an approximate 95% confidence limit on ranges of y-values. If you have widely differing numbers of y replicates, you can input mean-y and standard error of mean-y and input the average number of replicates when asked.
Suppose you have two curves f(x) and g(x) defined on a range
X_min =< x =< X_max, with the following three properties:
(a) f(x) >= 0,
(b) g(x) >= 0, and
(c) f(x) never equals g(x).
Then one useful measure of the percentage difference between
the two curves, PD% say, could be
PD% = 100*|area_f(x) - area_g(x)|/(area_f(x) + area_ g(x)).This becomes misleading under the following circumstances:
1) Curves f(x) and g(x) are not defined over the same range.
2) There are points in the range where f(x) < 0 or g(x) < 0.
3) There are points in the range where f(x) intersects g(x).
4) The curves are disjoint over some of the range.
For example f(x) and g(x) defined as
f(x) = 1 - x, for 0 =< x =< 1, = 0 otherwise, and
g(x) = x - 1, for 1 =< x =< 2, = 0 otherwise,
could not be more disimilar over 0 =< x =< 2, yet PD% = 0.
COMPARE does estimate PD% by integrating the best fit spline curves over the individual ranges of x defined by the data. However, to accomodate such singular cases it also creates a window with a x-range equal to the range of overlap in x and with a zero baseline, then estimates the absolute difference between the curves as the integral of |f(x) - g(x)| over the common range of x. Finally, another window is created such that all function values are nonnegative. Percent differences in such windows may be be found to be more useful and robust than PD%.