NSTL回溯数据服务平台

首页

按字顺浏览

期刊浏览

卷期浏览

Cross-validatory Selection of Test and Validation Sets in Multivariate Calibration and ...

Cross-validatory Selection of Test and Validation Sets in Multivariate Calibration and Neural Networks as Applied to Spectroscopy

作者: Frank R. Burden,

期刊: Analyst （RSC Available online 1997）
卷期: Volume 122, issue 10

页码: 1015-1022

ISSN:0003-2654

年代: 1997

DOI:10.1039/a703565i

出版商: RSC

数据来源: RSC

摘要:

Cross-validatory Selection of Test and Validation Sets in Multivariate Calibration and Neural Networks as Applied to Spectroscopy Frank R. Burdenab, Richard G. Breretonb and Peter T. Walshc a Chemistry Department, Monash University, Clayton, Victoria, Australia 3168 b School of Chemistry, University of Bristol, Cantock’s Close, Bristol, UK BS8 1TS c Health and Safety Laboratory, Health and Safety Executive, Broad Lane, Sheffield, UK S3 7HQ Cross-validated and non-cross-validated regression models using principal component regression (PCR), partial least squares (PLS) and artificial neural networks (ANN) have been used to relate the concentrations of polycyclic aromatic hydrocarbon pollutants to the electronic absorption spectra of coal tar pitch volatiles.The different trends in the cross-validated and non-cross-validated results are discussed as well as a method for the production of a true cross-validated neural network regression model. It is shown that the methods must be compared through the errors produced in the validation sets as well as those given for the final model.Various methods for calculation of errors are described and compared. The separation of training, validation and test sets into fully independent groups is emphasized. PLS outperforms PCR using all indicators. ANNs are inferior to multivariate techniques for individual compounds but are reasonably effective in predicting the sum of PAHs in the mixture set.Keywords: Polycyclic aromatic hydrocarbons; chemometric; neural networks; regression In a previous paper the concentrations of polycyclic aromatic hydrocarbons (PAHs) in coal tar pitch volatile sources obtained by the Health and Safety Executive (HSE)1 have been reported. The choice of PAHs in this paper is for illustration only : it does not imply any HSE acceptance or approval of the list or method. It was shown that multivariate methods for regression such as partial least squares (PLS) were superior to univariate single wavelength calibration for prediction of concentrations.In this paper we apply both principal components regression (PCR) and PLS2–8 and artificial neural networks (ANNs)9–13 to the estimation of concentrations of PAHs an experimental dataset by calibrating known concentrations estimated by GC–MS to electronic absorption spectra (EAS). This paper discusses cross-validation.14–18 There is surprisingly limited discussion of cross-validation in the chemometrics literature, except in the form of algorithmic descriptions of which there are numerous.There are two different reasons for cross-validation, often confused. The first is to find out how many components/iterations (in the case of neural networks) are necessary for an adequate model. The second is to compare the effectiveness of different models. The use of test sets is very common in classification studies, for example, where a model is formed on one group and the predictions tested on another group.A good way of testing the effectiveness of a method is to leave each possible group out at a time, until all groups are left out once, and average the quality of predictions. Some investigators use methods for leaving one sample out at a time, but in cases of large datasets, this can be prohibitive. This paper concentrates on methods for leaving a group of samples out at a time. There is much literature on neural networks in chemistry, and many investigators want to compare these to conventional chemometric approaches, e.g., for classification or calibration.However, a common approach must be found for method comparison and there are substantial problems here. Most common methods for neural networks remove a set of samples to test convergence of a model (called the test set in this paper). It is unfair then to use this set of samples for validation, as the network is trained to minimise the error on these samples.A third and independent group must then be selected for validating the model. If a ‘leave one sample out at a time’ approach is adopted there will be difficulties for the following reasons. First, the model obtained on a single test sample will be dependent on that sample not being an outlier. A minimum size of test set of around four is recommended. Second, the number of calculations will be large: for 100 samples, there will need to be 9 900 computations of the network if every possible combination of single samples is tested.Using too few test sets can result in quite unrepresentative models and analyses of errors. Hence, methods for cross validation that include several samples in each group must be employed. In this paper the previously published PAH dataset1 is used to demonstrate a number of methods for cross validation and error analysis. Method The PAH data set consisted of the EAS of 32 samples taken at 181 wavelengths at 1 nm intervals from 220 to 400 nm inclusive, the X matrix.The concentrations of 13 PAHs in each of the samples had previously been measured by GC–MS, and consists of the y data. In this paper, the estimated concentrations of (a) anthracene, (b) fluoranthene and (c) the sum of concentrations of 13 PAHs were analysed, each, in turn, forming a y vector, of length 32. Similar conclusions can be obtained from the other PAHs in the mixture, but the dataset described in this paper has been reduced for sake of brevity.A concentration summary is given in Table 1. Further experimental details are reported elsewhere,1 including more information on the overall dataset. All of the calculations were carried out using MATLAB routines written by the authors but making use of the well-tested toolbox published by B. Wise and the Mathworks Neural Network Toolbox.19 All the algorithms were validated against Table 1 Concentration summary Concentration/mg ml21 Total detectable Anthracene Fluoranthene PAHs Mean 19.8 160.2 903.8 s 10.8 64.4 375.6 Analyst, October 1997, Vol. 122 (1015–1022) 1015other sources, including in-house software in C and Visual Basic and the Propagator neural network package. Multivariate Methods Principal component regression Since the number of wavelengths far exceeds the number of samples, it is normal to perform data reduction prior to regression, both in order to reduce the size of the problem and to remove colinearities in the dataset, so PCR is the technique of preference to multiple linear regression, which is not discussed in this paper.Principal components analysis was performed on the centred but unstandardized data. After six components have been computed 99.9997% of the variance has been explained, so it was decided to keep only the first six components. Similar conclusions are reached for anthracene and fluoranthene. Note that the main aim of this paper is to compare various approaches for cross-validation and calculation of errors and not how crossvalidation can be employed to determine the optimum number of components.If T is the matrix consisting of the scores of the first six principal components then the regression coefficients are produced simply via the pseudoinverse as follows b = (TAT)21TAy so that �y t b y ij ija ja j a A = + = å1 where �yij is the estimated concentration of the jth sample, using the ith method for calculating the concentrations (see below for extended discussion of this), there are A significant components and the scores matrix is denoted by T.Partial least squares Partial least squares regression can also be employed, as an alternative to PCR. In the case discussed here, PLS1, regressing one y variable at a time, was felt to be more suitable than PLS2, involving regressing all y variables at a time, especially when direct comparisons between methods are required.The optimum number of PLS components can be established as follows. The criterion employed is the variance in the y (concentration) predictions rather than the X or spectral direction. In this way, PLS differs from PCR in which only the X variance can be used. For the purpose of determining the number of components, the ‘leave one out’ method for cross validation is employed, whereby the prediction error as each sample in turn is red is estimated. The prediction on the training set will be reduced as more components are computed, whereas the prediction on the test set shows a minimum and then increases as too many components are employed.It is emphasized that this method is a fairly crude approach, and depends on the structure of the data, but is sufficiently good for determining how many components are useful to describe the data. From the values obtained using the sum of the concentration of 13 detected PAHs, a good number of PLS components to choose is six.It is emphasized that the main objective of this paper is strategy for cross-validation and comparison of models, and details of the dataset analyzed in this paper have already been published, so selection of the number of significant components is not discussed in detail below for reason of brevity. Cross validation and calculation of errors In the cases described here a ‘leave n out’ technique for crossvalidation is employed. A number of samples (n) are removed from the dataset and a model estimated from the remaining data, which is used to validate the model.In turn, each possible group of samples, where any particular sample is included once, is removed, in turn, to provide average estimates of models and errors over the entire data set. It is important that the validation set has no influence on the regression equations. For PCR, and PLS, if there are N samples in total, and Nv samples are to be left out and used as a validation set, the training set of Nt = N 2Nv, must have any pre-processing, such as mean centering or standardization, applied after the validation samples have been removed; these preprocessing parameters such as means and standard deviations of the training set are then applied to the validation set.Note that this implies that means and standard deviations are calculated each time a validation set is removed from the full data, and that these do not necessarily correspond to the overall statistics; in this way the approach in this paper differs from that employed by some other authors. Note that when PCA or PLS is performed, this must also be repeated for each training set in turn and not on the entire dataset.The case for the ANNs is more complex and is discussed in more detail below. The first step is to randomize the order of the data. In many key experimental situations, data are often presented to an investigator in some form of experimental sequence.Without this first stage, a cross-validation may be performed in a biased manner. Once randomized, the new order of the data is maintained. We assume, in this paper, that the samples are reordered in a random manner, so sample j = 7 is the seventh randomized sample. Note that the random seed in all computations in this paper is identical, so that the order with which the samples are arranged is constant for comparison; obviously the calculation of errors depends in part on the method of ordering, the most important aspect being that the original experimental order is not followed.In order to include all possible samples in one validation set, M combinations of training and validation sets were produced where M = N/Nv and Nv is the number of samples in each of the validation sets. Ideally, N is a multiple of Nv, although other regimes, not reported in this paper, are possible. In the case discussed below, four samples are removed, in turn, each time cross-validation is performed, implying eight calculations as illustrated in Fig. 1. Note that removing two samples results in validation sets that are too small, and removing eight samples reduces the number of validation sets to four. For validation run number i (where i varies from 1 to 8), samples j = (i 2 1) 3 Nv + 1 to i 3 Nv are removed. The training set consists of the remaining Nt = 28 samples. For each of the M training/validation sets, a model is produced that predicts the concentration �yij for the ith set (i being a number between 1 and 8 in this case) and the jth sample (varying between 1 and 32).Three possible predictions can be made. (i) The predictions for the samples in the validation sets. Note that there will be one prediction per sample, as each sample is a member of only one validation set. (ii) The predictions for the samples in the training sets. For each sample there will be seven predictions. (iii) Predictions for the overall model, taking into account both validation and test samples, resulting in eight predictions per sample. It is common to transform these predictions into errors for the overall dataset, and five possible errors can be calculated.Fig. 1 Summary of cross-validation for multivariate methods; validation set is shaded. 1016 Analyst, October 1997, Vol. 1221. The standard error for prediction for the validation set given by SEPV = v v ( � ) ( ) y y N j ij j i N iN i M - = - + = å å 2 1 1 1 which in the case of this paper, is simply the root mean square error of prediction of the validation samples. 2. The standard error for the training set given by SEPTR = v v ( � ) ( � ) ( ) ( ) y y y y N M j ij j ij j iN N j i N i M - + - é ë êêê ù û úúú � - = + = - = å å å 2 2 1 1 1 1 1 This is the root mean square of prediction of the training set samples counted seven times, for each of the M 2 1 validation runs. 3. The overall model error given by SEPM = ( � ) y y N M j ij j N i M - � = = å å 2 1 1 which includes the both validation and test samples.Note that this is not the same as the error calculated by performing regression on the entire dataset (see below). An additional possibility is to average the values of the predicted data over all validation runs. The average prediction is given by � � / y y M j ij i M = = å1 A variance on this is to simply average the predictions for the training sets to give t j ij ij i jN M i j N y y y M � � � ( ) ( ) + - = + - å å v v =1 1 1 1 These averaged predictions will result in lower errors, which can be defined as follows. 4. The standard error for the averaged prediction of the training set SEPTRav = - = å ( � ) t j j j N y y N 2 1 5. The standard error for the overall averaged model given by SEPMav = - = å ( � ) y y N j j j N 2 1 Note that since each validation sample is removed only once, there is no corresponding average error for the validation set.It is not, of course, necessary to model the overall data using an average of cross-validated models, but PCR or PLS can beperformed on the entire dataset to give predictions of the form �yj where � [( ] y b t y b y j a ja a A a j a A = + = ¢ ¢ + = - = å å 1 1 x x P PP ) ( )-1 where øy is the mean concentration (or y value), øx is a vector of mean absorbances over all wavelengths, and P is a loadings matrix, as is normal for centred data. This error (called prediction error) is then given by SEPP = ( � ) y y N j j j N - = å 2 1 It is important to recognise that there are differences between this error and SEPMav as will be evident below.Artificial Neural Networks An alternative approach is to use neural networks for calibration. The example in this paper is relatively simple, so only a fairly basic approach is employed. The methods for calculations of errors can be applied to neural networks of any level of sophistication, but it is not the primary purpose of this paper to optimize the network.The first step is to define the inputs, outputs and number of hidden nodes (architecture) of the network. A back-propagation9 feed-forward network using one hidden layer was employed which made use of sigmoidal transfer functions of the form 1 1- where net = net e z w p p p P - = å1 and zp is the input to the given node, with wp the weights. The neural network program uses the back-propagation algorithm to find the weights.The output is simply the estimated concentration. In order to reduce the size of the problem each concentration was estimated separately, so that all calculations only had one output, to be comparable to the rformed separately for each compound. For the given problem, it was found that one hidden layer with a single node was optimum. The input layer of principal component scores, together with a bias node, was connected to the hidden node.The hidden node, together with another bias node was connected to the output node. The bias nodes had no input and delivered an output of 1. The question of the input to the network is an important one. Using all 181 wavelengths would result in a large number (184) of weights when bias and hidden nodes are included, which is clearly unjustified by the present data set consisting only of 32 samples. Hence, the data was reduced first using PCA. As above, only six (linear) principal components were kept as the input to the network.A seventh, bias node, to represent an intercept term equaling one was also added to the input, resulting in seven inputs. Hence, there are 7 (input/hidden) + 2 (hidden/output) = 9 weights. This is illustrated in Fig. 2. It is the common practice when training a neural network to randomize the initial set of weights and then allow the backpropagation algorithm to continuously refine their values in order to reduce the SEPTR of the network training set.This Analyst, October 1997, Vol. 122 1017process is stopped when the error of an independent test set starts to rise (signifying entry into a memorizing domain) and the weights at this test set minimum are retained. If the neural network is run again on the same set of data it is likely (if the input data is not fully independent) to arrive at similar errors with a different, though just as valid, set of weights. In order to start the neural network, a set of randomized weights is required.The initial randomized set of weights can produce a bad starting point for the back-propagation algorithm to seek a global minimum of the test set error so that it is essential for the cross-validation procedure that a good initial set be found. In the present work this was accomplished by producing many ( Å 100) initial sets and choosing the best of these. In order to ensure that any repeat calculation found the same starting point the same seed for the MATLAB random number generator was always used for the first random choice of weights.Cross validation The issue of cross validation is much more complex in the case of neural networks as compared with normal regression. In the application reported below, the data was divided into three sets. A training set consisting of 24 samples is used to obtain a model. It is important to understand that all preprocessing such as PCA is performed on this subset of data, and not on the overall 32 samples.The remaining samples are then divided into two sets of Nvt = 4 samples each. One, referred to below as the test set, is used to determine when the network has converged. (Note that the use of the term validation to describe the dataset used to determine when the training of a neural network should be terminated has been here replaced by test set so that the term validation set can be used for the PLS and PCR calculations. Thus the terms validation set and test set are used in an inverse manner to much of the literature involving neural networks).The error in the training set decreases as the weights are improved. The error in the test set converges to a minimum and then increases again. The network is judged to have converged when the test set error is lowest. However, unlike normal regression, it is not correct to compare the mean error of the test set for the purposes of validation.The reason for this is that the test set, although not directly responsible for a model, has an influence on when the network is judged to have been optimized, and, hence, this error will be low. A small error in the test set is not necessarily an indication that the network can successfully predict unknown samples. The third, validation, set, consists of four samples that are totally left out of the initial computations. The validation error is the error arising from these remaining four samples.The selection of the test and validation sets is illustrated in Fig. 3. The samples are first randomized as in the case of PCR and PLS. Subsequent to that the first four samples are removed to act as a validation set. Then, in sequence samples 5 to 8, 9 to 12 up to 29 to 32 are removed in turn as test sets. The procedure is repeated, removing samples 5 to 8 as a validation set, and then, successively, samples 1 to 4, 9 to 12, 13 to 16, etc. If the number in the validation and testing sets are equal to one another, Nvt, then the number of calculations, Q, where Nvt samples are extracted from the total randomized set for the validation set together with Nvt different samples for the test set, are Q = (M) (M 2 1) where M = N/Nvt = 8 and this is the number of runs that is necessary for the neural network to ensure that each sample is included in each validation and test set. In the case reported in this paper, 56 ( = 8 3 7) computations are required.Note that the mean of the training set is subtracted from the corresponding validation and test sets, and the loadings of the PCs computed from the training set used to calculate the inputs to these sets, which are then weighted by the appropriate numbers obtained from the training set to give predicted outputs. Non-cross validated neural networks ANNs can also be performed on non-cross validated data. In this case, the calculation is somewhat simpler.A set of four samples is removed in turn for the test set. These are used to determine when the network converges, and are removed in a similar fashion to the cross-validated PCR or PLS. Eight computations are performed in total, with each group of samples being removed in turn. Note, however, the test set has a different purpose to the validation set in PCR or PLS. The error in estimating these samples is minimised during the ANN calculation. These samples cannot strictly be used in crossvalidation as they have been used in assessing the performance of the model.Calculation of errors The calculation of errors is more sophisticated than in the case of straight PLS and PCR, but must be performed correctly for comparability. If done in the wrong way, neural networks might appear to work spuriously well, despite the evidence. It is essential to recognise that comparison of methods depends critically on how the ability to predict data is measured, and that there is a fundamental difference between how this prediction ability can be estimated using neural networks as described in this paper, and using standard regression.There are frequent claims in the literature about one method being more effective than another; however these claims are, in part, a function of how the quality of the predictions is calculated. In the cross-validated method proposed in this paper, there are Q ( = 56) validation/test runs. Every (M 2 1) ( = 7) runs, the validation set changes.For the first seven runs it consists of Fig. 2 Summary of neural network. Fig. 3 Summary of cross-validation and testing. Cycles for the neural network; validation set is shaded vertically, test set horizontally. 1018 Analyst, October 1997, Vol. 122samples 1 to 4, for runs 8 to 14 it consists of samples 5 to 8 and so on. A variable p = mod [i/(M 2 1)] + 1 can be calculated, where i is the run number, equal to 1 for runs 1 to 7, 2 for runs 8 to 15, etc., can be computed. The validation set consists of samples 4(p 2 1) + 1 to 4p.The test set consists of the remaining seven possible combinations of four samples. Several errors may be computed. 1. The standard error for the validation samples which is given by SEPV = ( � ) ( ) ( ) ( ) y y M j ij i m m j m j m m M - - = - + = - + = = å å å 2 7 1 1 7 4 1 1 4 1 1 which in the case of this paper, is the root mean square error of the validation samples.Note that each sample is repeated 7 ( =M 2 1) times, hence a more complex equation. 2. The overall error of prediction for all samples across all validation/test runs given by SEPM = ( � ) y y N Q j ij i Q j N - � = = å å 2 1 1 each sample being estimated 56 ( =Q) ti. The standard error for the training set, SEPTR is calculated for the 42 = (M 2 1) 3 (M 2 2) estimates of the training set, defined as SEPTR = ( - 1) ( ) tr ( � ) y y N M M j ij i j j N - � � - ' = å å 2 1 2 where jtr are the training runs for sample j.For example, for sample 9, these are runs 1, 3–8, 10–14, 22–23, 25–30, 32–37, 39–44, 46–51 and 53–56. 4. A fourth error is of interest. It is debatable whether the four test samples should be used in the overall error. This is because they have been used to determine the minimum model for crossvalidation. An alternative overall error excluding these four test samples each time can be calculated as follows SEPM = ts ¢ - � - Ï = å å ( � ) ( ) y y N M j ij i j j N 2 1 2 1 where jts is the group of samples not belonging to the test set.For each sample, seven runs will be excluded, making 49 runs in total. As in the case of straight multivariate methods it is often useful to average the estimates over several runs. In many cases this procedure is important, as it is the only way to obtain an overall model. In PCR, cross-validation is often a separate step to producing a full predictive model. First the number of components or effectiveness of the model is determined and then the calculation, using an optimum number of components, is repeated again on the entire dataset.This is not possible for ANNs because the test set critically determines when the model converges. Removing a different test set results in a different optimum model. The algebraic definition of an overall model is extremely fraught using the methods described in this paper, because principal components will differ according to which samples removed for the test set.The PCs on a subset of 28 samples differ for each subset. Interesting features such as swap over of PCs, changing signs of scores and often completely different values for later PCs are encountered. Hence, the average estimate over several validation/test runs is of some significance. Unlike multivariate methods, each training and test set sample is removed seven times, not once, so all of the four errors above have corresponding and differing average estimates. 1.The standard error for the average estimate for the validation samples which is given by SEPVav v = - = å ( � ) y y N j j j N 2 1 where v � ( ) ( )( ) ( ) y M j i M s M s = - = - - + -å1 1 1 1 1 and s = mod[(i 2 1)/Nt] + 1, e.g., for sample 9 it equals 3, the validation set being represented in runs 15 to 21, as this sample belongs to the third group of four. 2. The overall error of average prediction for all samples across all validation/test runs given by SEPMav = - = å( � ) y y N j j j N 2 1 where � � / y y Q j ij i Q = = å1 3.The error of prediction for the average training set results, SEPTRav, can likewise be calculated, using t tr � � /( )( ) y y M M j ij i j = - - ' å 1 2 4. The equivalent error, SEPMAav can be calculated, removing the test samples. For the non-cross-validated data only two errors are strictly of interest. 1. The standard error for the training set given by SEPTR = ( t t y y y y N M j ij j ij j iN N j j N i N - + - é ë êêê ù û úúú � - = + = - = å å å � ) ( � ) ( ) ( ) 2 2 1 1 1 1 1 This is the root mean square of prediction of the training set samples counted seven times, for each of the M 2 1 validation runs. 2. The overall model error given by SEPM = =1 ( � ) y y N Q j ij j N i M - � = å å 2 1 Analyst, October 1997, Vol. 122 1019Fig. 4 Graphs of predicted versus observed for anthracene. Table 2 Summary of the RMS errors (mg ml21) Anthracene Fluoranthene Total detectable PAHs Multivariate methods— No cross-validation PCR PLS PCR PLS PCR PLS SEPP 1.582 1.379 7.436 5.150 56.604 47.938 Cross-validation PCR PLS PCR PLS PCR PLS Non-av Av Non-av Av Non-av Av Non-av Av Non-av Av Non-av Av SEPM 1.645 1.586 1.436 1.378 7.553 7.321 5.506 5.156 59.33 56.41 50.92 48.02 SEPTR 1.569 1.523 1.333 1.300 7.261 7.084 4.996 4.828 54.96 53.08 46.41 44.92 SEPV 2.101 2.101 2.010 2.010 9.343 9.343 8.231 8.231 83.76 83.76 75.29 75.29 Artificial neutral networks— No cross-validation Non-av Av Non-av Av Non-av Av SEPM 2.342 1.864 12.319 9.853 103.78 74.22 SEPTR 2.283 1.811 11.590 9.177 106.14 75.27 Cross-validation Non-av Av Non-av Av Non-av Av SEPM 2.953 2.062 16.08 10.81 105.63 72.86 SEPMA 2.817 1.923 15.14 9.98 104.37 70.82 SEPTR 2.731 1.803 14.64 9.59 97.84 70.27 SEPV 3.286 2.894 17.86 13.91 105.45 79.91 1020 Analyst, October 1997, Vol. 122which includes the both training and validation samples.The two equivalent errors on the averaged sample estimates can also be calculated. Results Analysis of Errors The results of the various errors are given in Table 2. The graphs of predicted versus observed concentrations for PCR and ANN for anthracene are given in Fig. 4. Only certain graphs are selected for brevity. A substantial number of conclusions are possible. For the multivariate methods, in all cases SEPV > SEPM > SEPP > SEPTR. This is expected for normal datasets.The validation error should be highest as the validation data was not used to form the model, and the training error least. SEPM should be close to SEPP. In all cases it is slightly higher, reflecting the fact that four samples are not included in computing the overall model and so their inclusion increases this error by a small amount. Averaging the estimates over all cross-validation runs is useful, and has an important influence over the error estimates. In order to get an overall averaged model from cross-validation it is useful to perform this operation, and the number of residuals for the averaged model over all cross-validated runs and the non-cross validated data can be compared directly.Since each sample is a member of only one validation set, SEPVav = SEPV. However, in all other cases, averaging reduces the error as expected, and as is clearly seen in the corresponding graphs. The averaged SEPM is now very close to SEPP in all cases.The amount of reduction in the error estimate on averaging reflects the underlying quality of the model. If the true model is completely linear, and all deviations from linearity are normally distributed with a mean of 0, the error should be reduced by A7 = 2.646 for the training set, reflecting the fact that each sample is included in seven training sets, and A8 = 2.828 for the overall model error, which is clearly not the case. The reason for this is that the underlying model is not exactly linear, indicating a small lack-of-fit. The reduction in error as sample estimates are averaged over cross-validation runs represents a valuable diagnostic tool.It is interesting to note in these results that SEPTR proportionally reduced by less than SEPM in all cases, as predicted, the average reduction in SEPTR being 3.0% and SEPM 4.6%, again suggesting that although there is a small but significant lack-of-fit, the dataset is reasonable. The results using ANNs are quite interesting.Without crossvalidation, the modelling error is only slightly higher than the training error. It is debatable as to which statistic best represents the true error. In this case, averaging the results of eight runs has quite a significant influence on the size of errors, reducing them by 20 to 30%. Normally distributed errors should reduce by 100 3 [1 2 (1/A8)] or around 65% on averaging. This indicates that the model improves considerably when performing repeat calculations using ANNs, as expected, but the amount by which the error reduces suggests that a perfect model will not be achieved even after averaging a larger number of runs (which could be done by randomizing the order of the original data again). It is debatable as to what to use as the predictive model to use for a neural network, whether to average the models for several runs or keep to the model of a single test run.Owing to the need to use a test set for to check for convergence, ies are left out of the computation each time.Developing a model removing just one set of samples will be unrepresentative of the dataset. The values for noncross- validated ANN errors in Table 2 are all higher than those for normal regression, suggesting that the averaged model obtained here is not as good as the models obtained for PLS and PCR. For cross-validated ANNs, SEPV > SEPM > SEPMA > SEPTR as expected in all cases.Averaging the sample estimates maintains this order also, which is interesting given the different number of samples in the training and validation datasets. Note Fig. 4 Continued— Analyst, October 1997, Vol. 122 1021that averaging has a greater influence on the errors than whether a sample is a member of a particular group (training, validation, etc.). The errors for the averaged cross-validated models are about comparable in size to the corresponding average noncross- validated errors.However, the non-averaged crossvalidated errors for anthracene and fluoranthene are significantly higher than the corresponding cross-validated errors. A possible reason is that only 24 samples or 3/4 of the original data are used for determining the model. This leads to a small number of highly outlying predictions as can be seen graphically. Because a root mean square error criterion is calculated, these outlying predictions have a major influence on the size of the error.In practical terms, this suggests that performing ANNs using one group of 24 samples has a chance of producing a very poor model. When 28 samples are used, this probability decreases. Application to Estimation of PAHs The methods in this paper can be employed to compare methods for estimation of PAHs using various chemometric techniques. In all cases, PLS outperforms PCR. It is important to calculate a number of indicators to ensure that similar trends are obeyed no matter which error is calculated.Had PCR proved superior using one or more indicators, this would lead to a less unambiguous conclusion. On the whole PLS is expected to outperform PCR providing the experimental dataset is well designed, as PLS takes into account variance in both the concentration and spectral dimensions. If PLS performs worse than PCR this might suggest that there are some outliers or unusual measurements, which could influence the statistics if one of these is removed to the validation set.Also, the number of samples has to be much greater than the number of components, plus the variability between samples sufficient, to allow sensible calibration models. A series of samples that are effectively replicates may not exhibit this trend. One experimental danger, however, with this conclusion is that a great deal of reliance is placed on the independent concentration estimate, in the case of this paper performed by GC–MS.The conclusions of Table 2 only state that using the PLS algorithm, a better mathematical model can be developed that predicts the GC–MS concentration estimates. If there are large errors in the GC–MS measurement, PLS may not necessarily be a superior approach for concentration estimates as it is influenced by the quality of the independent measurement. It is out of the scope of this paper to discuss the nature of GC–MS measurements. ANNs are harder to compare directly with multivariate methods.A sensible model requires several test and validation set combinations. As can be seen by the graphs, a single ANN run will probably result in a number of very poor predictions. Hence it is strongly recommended that the results of all these runs are averaged. The averaged estimates both for non-crossvalidated and cross- validated models result in poorer predictions than PLS and PCR in most cases. The single exception is the averaged PCR validation error for total PAHs.A possible reason is that pure PAHs can be predicted quite well. In some cases a pure PAH has several characteristic wavelengths, and it is even possible to produce quite accurate linear calibration at such wavelengths. The quality of a univariate model is primarily related to spectral overlap. In the absence of noise, it is always possible to obtain accurate calibration models using a limited number of wavelengths. For example, if there are only two components in a mixture, the ratio of absorbances at two wavelengths can be employed to determined the relative amounts of each component in the mixture.The distribution of concentrations in the mixture set is not relevant. However, for the total PAHs linear models are less easy to construct and a more empirical approach such as ANNs may function better, so that ANNs are not so bad in this case. It is recommended that for calibration of concentrations of single PAHs, PLS or possibly PCR are employed.ANN being a non-linear method exhibits few advantages. However, ANNs may perform reasonably well when predicting parameters such as a sum of total concentrations of a set of compounds, where a linear model may be less appropriate. For more complex mixtures, for example of 50 to 100 compounds, PLS or PCR may break down, and it is worth exploring ANNs under such circumstances. Conclusion This paper has highlighted the importance of a properly thought out scheme for cross-validation, and calculation of the associated errors.The particular dataset is predicted well by PLS and PCR, but neural networks might appear to work anomalously well if the wrong statistics are calculated. A great deal more information can be obtained using the type of error analysis proposed in this paper, including whether there truly is an underlying linear model. There is not a great deal of literature on confidence in, and estimates of, lack-of-fit for multivariate calibration, in contrast to the very substantial corresponding literature on univariate calibration.Monash University, Australia, is thanked for funded sabbatical leave for F.R.B. to visit Bristol. References 1 Cirovic, D. A., Brereton, R. G., Walsh, P. T., Ellwood, J. A., and Scobbie, E., Analyst, 1996, 121, 575. 2 Martens, H., and Naes, T., Multivariate Calibration, Wiley, New York, 1989. 3 H�oskuldsson, A., J. Chemom., 1988, 2, 211. 4 Wold, S., Geladi, P., Esbensen, K., and Ohman, J., J.Chemom., 1987, 1, 41. 5 Kowalski, B. R., and Seasholtz, M. B., J. Chemom., 1991, 5, 129. 6 Demir, C., and Brereton, R. G., Analyst, 1997, 122, 631. 7 Geladi, P., and Kowalski, B. R., Anal. Chim. Acta, 1986, 185, 1. 8 Brown, P. J., J. R. Stat. Soc. Ser. B., 1982, 44, 287. 9 Rumelhart, D. E., and McClelland, J. L., Parallel Distributed Processing, MIT Press, Cambridge, MA, 1986, vol. I. 10 Blank,T. B., and Brown, S. D., Anal.Chim. Acta., 1993, 277, 273. 11 Walczak, B., and Wegscheider, W., Anal. Chim. Acta., 1993, 283, 508. 12 Blank, T. B. and Brown, S. D., Anal. Chem., 1993, 65, 3081. 13 Burden, F. R., J. Chem. Inf.Comput. Sci., 1994, 34, 1229. 14 Deane, J. M., Multivariate Pattern Recognition in Chemometrics, illustrated by case studies, ed. Brereton, R. G., Elsevier, Amsterdam, 1992, ch. 5. 15 Stone, M. J., J. R. Stat. Soc. Ser. B., 1974, 36, 111. 16 Wold, S., Technometrics, 1978, 20, 397. 17 Krzanowski, W.J., Biometrics, 1987, 44, 575. 18 Gemperline, P. J., J. Chemom., 1989, 3, 549. 19 The MathWorks Inc., MA, USA. Paper 7/03565I Received May 22, 1997 Accepted July 28, 1997 1022 Analyst, October 1997, Vol. 122 Cross-validatory Selection of Test and Validation Sets in Multivariate Calibration and Neural Networks as Applied to Spectroscopy Frank R. Burdenab, Richard G. Breretonb and Peter T. Walshc a Chemistry Department, Monash University, Clayton, Victoria, Australia 3168 b School of Chemistry, University of Bristol, Cantock’s Close, Bristol, UK BS8 1TS c Health and Safety Laboratory, Health and Safety Executive, Broad Lane, Sheffield, UK S3 7HQ Cross-validated and non-cross-validated regression models using principal component regression (PCR), partial least squares (PLS) and artificial neural networks (ANN) have been used to relate the concentrations of polycyclic aromatic hydrocarbon pollutants to the electronic absorption spectra of coal tar pitch volatiles.The different trends in the cross-validated and n-cross-validated results are discussed as well as a method for the production of a true cross-validated neural network regression model. It is shown that the methods must be compared through the errors produced in the validation sets as well as those given for the final model. Various methods for calculation of errors are described and compared. The separation of training, validation and test sets into fully independent groups is emphasized.PLS outperforms PCR using all indicators. ANNs are inferior to multivariate techniques for individual compounds but are reasonably effective in predicting the sum of PAHs in the mixture set. Keywords: Polycyclic aromatic hydrocarbons; chemometric; neural networks; regression In a previous paper the concentrations of polycyclic aromatic hydrocarbons (PAHs) in coal tar pitch volatile sources obtained by the Health and Safety Executive (HSE)1 have been reported.The choice of PAHs in this paper is for illustration only : it does not imply any HSE acceptance or approval of the list or method. It was shown that multivariate methods for regression such as partial least squares (PLS) were superior to univariate single wavelength calibration for prediction of concentrations. In this paper we apply both principal components regression (PCR) and PLS2–8 and artificial neural networks (ANNs)9–13 to the estimation of concentrations of PAHs an experimental dataset by calibrating known concentrations estimated by GC–MS to electronic absorption spectra (EAS).This paper discusses cross-validation.14–18 There is surprisingly limited discussion of cross-validation in the chemometrics literature, except in the form of algorithmic descriptions of which there are numerous. There are two different reasons for cross-validation, often confused. The first is to find out how many components/iterations (in the case of neural networks) are necessary for an adequate model.The second is to compare the effectiveness of different models. The use of test sets is very common in classification studies, for example, where a model is formed on one group and the predictions tested on another group. A good way of testing the effectiveness of a method is to leave each possible group out at a time, until all groups are left out once, and average the quality of predictions.Some investigators use methods for leaving one sample out at a time, but in cases of large datasets, this can be prohibitive. This paper concentrates on methods for leaving a group of samples out at a time. There is much literature on neural networks in chemistry, and many investigators want to compare these to conventional chemometric approaches, e.g., for classification or calibration. However, a common approach must be found for method comparison and there are substantial problems here.Most common methods for neural networks remove a set of samples to test convergence of a model (called the test set in this paper). It is unfair then to use this set of samples for validation, as the network is trained to minimise the error on these samples. A third and independent group must then be selected for validating the model. If a ‘leave one sample out at a time’ approach is adopted there will be difficulties for the following reasons. First, the model obtained on a single test sample will be dependent on that sample not being an outlier.A minimum size of test set of around four is recommended. Second, the number of calculations will be large: for 100 samples, there will need to be 9 900 computations of the network if every possible combination of single samples is tested. Using too few test sets can result in quite unrepresentative models and analyses of errors. Hence, methods for cross validation that include several samples in each group must be employed.In this paper the previously published PAH dataset1 is used to demonstrate a number of methods for cross validation and error analysis. Method The PAH data set consisted of the EAS of 32 samples taken at 181 wavelengths at 1 nm intervals from 220 to 400 nm inclusive, the X matrix. The concentrations of 13 PAHs in each of the samples had previously been measured by GC–MS, and consists of the y data. In this paper, the estimated concentrations of (a) anthracene, (b) fluoranthene and (c) the sum of concentrations of 13 PAHs were analysed, each, in turn, forming a y vector, of length 32.Similar conclusions can be obtained from the other PAHs in the mixture, but the dataset described in this paper has been reduced for sake of brevity. A concentration summary is given in Table 1. Further experimental details are reported elsewhere,1 including more information on the overall dataset. All of the calculations were carried out using MATLAB routines written by the authors but making use of the well-tested toolbox published by B.Wise and the Mathworks Neural Network Toolbox.19 All the algorithms were validated against Table 1 Concentration summary Concentration/mg ml21 Total detectable Anthracene Fluoranthene PAHs Mean 19.8 160.2 903.8 s 10.8 64.4 375.6 Analyst, October 1997, Vol. 122 (1015–1022) 1015other sources, including in-house software in C and Visual Basic and the Propagator neural network package.Multivariate Methods Principal component regression Since the number of wavelengths far exceeds the number of samples, it is normal to perform data reduction prior to regression, both in order to reduce the size of the problem and to remove colinearities in the dataset, so PCR is the technique of preference to multiple linear regression, which is not discussed in this paper. Principal components analysis was performed on the centred but unstandardized data.After six components have been computed 99.9997% of the variance has been explained, so it was decided to keep only the first six components. Similar conclusions are reached for anthracene and fluoranthene. Note that the main aim of this paper is to compare various approaches for cross-validation and calculation of errors and not how crossvalidation can be employed to determine the optimum number of components. If T is the matrix consisting of the scores of the first six principal components then the regression coefficients are produced simply via the pseudoinverse as follows b = (TAT)21TAy so that �y t b y ij ija ja j a A = + = å1 where �yij is the estimated concentration of the jth sample, using the ith method for calculating the concentrations (see below for extended discussion of this), there are A significant components and the scores matrix is denoted by T.Partial least squares Partial least squares regression can also be employed, as an alternative to PCR.In the case discussed here, PLS1, regressing one y variable at a time, was felt to be more suitable than PLS2, involving regressing all y variables at a time, especially when direct comparisons between methods are required. The optimum number of PLS components can be established as follows. The criterion employed is the variance in the y (concentration) predictions rather than the X or spectral direction. In this way, PLS differs from PCR in which only the X variance can be used.For the purpose of determining the number of components, the ‘leave one out’ method for cross validation is employed, whereby the prediction error as each sample in turn is removed is estimated. The prediction on the training set will be reduced as more components are computed, whereas the prediction on the test set shows a minimum and then increases as too many components are employed. It is emphasized that this method is a fairly crude approach, and depends on the structure of the data, but is sufficiently good for determining how many components are useful to describe the data.From the values obtained using the sum of the concentration of 13 detected PAHs, a good number of PLS components to choose is six. It is emphasized that the main objective of this paper is strategy for cross-validation and comparison of models, and details of the dataset analyzed in this paper have already been published, so selection of the number of significant components is not discussed in detail below reason of brevity.Cross validation and calculation of errors In the cases described here a ‘leave n out’ technique for crossvalidation is employed. A number of samples (n) are removed from the dataset and a model estimated from the remaining data, which is used to validate the model. In turn, each possible group of samples, where any particular sample is included once, is removed, in turn, to provide average estimates of models and errors over the entire data set. It is important that the validation set has no influence on the regression equations. For PCR, and PLS, if there are N samples in total, and Nv samples are to be left out and used as a validation set, the training set of Nt = N 2Nv, must have any pre-processing, such as mean centering or standardization, applied after the validation samples have been removed; these preprocessing parameters such as means and standard deviations of the training set are then applied to the validation set.Note that this implies that means and standard deviations are calculated each time a validation set is removed from the full data, and that these do not necessarily correspond to the overall statistics; in this way the approach in this paper differs from that employed by some other authors. Note that when PCA or PLS is performed, this must also be repeated for each training set in turn and not on the entire dataset.The case for the ANNs is more complex and is discussed in more detail below. The first step is to randomize the order of the data. In many key experimental situations, data are often presented to an investigator in some form of experimental sequence. Without this first stage, a cross-validation may be performed in a biased manner. Once randomized, the new order of the data is maintained. We assume, in this paper, that the samples are reordered in a random manner, so sample j = 7 is the seventh randomized sample.Note that the random seed in all computations in this paper is identical, so that the order with which the samples are arranged is constant for comparison; obviously the calculation of errors depends in part on the method of ordering, the most important aspect being that the original experimental order is not followed. In order to include all possible samples in one validation set, M combinations of training and validation sets were produced where M = N/Nv and Nv is the number of samples in each of the validation sets.Ideally, N is a multiple of Nv, although other regimes, not reported in this paper, are possible. In the case discussed below, four samples are removed, in turn, each time cross-validation is performed, implying eight calculations as illustrated in Fig. 1. Note that removing two samples results in validation sets that are too small, and removing eight samples reduces the number of validation sets to four.For validation run number i (where i varies from 1 to 8), samples j = (i 2 1) 3 Nv + 1 to i 3 Nv are removed. The training set consists of the remaining Nt = 28 samples. For each of the M training/validation sets, a model is produced that predicts the concentration �yij for the ith set (i being a number between 1 and 8 in this case) and the jth sample (varying between 1 and 32). Three possible predictions can be made. (i) The predictions for the samples in the validation sets.Note that there will be one prediction per sample, as each sample is a member of only one validation set. (ii) The predictions for the samples in the training sets. For each sample there will be seven predictions. (iii) Predictions for the overall model, taking into account both validation and test samples, resulting in eight predictions per sample. It is common to transform these predictions into errors for the overall dataset, and five possible errors can be calculated.Fig. 1 Summary of cross-validation for multivariate methods; validation set is shaded. 1016 Analyst, October 1997, Vol. 1221. The standard error for prediction for the validation set given by SEPV = v v ( � ) ( ) y y N j ij j i N iN i M - = - + = å å 2 1 1 1 which in the case of this paper, is simply the root mean square error of prediction of the validation samples. 2. The standard error for the training set given by SEPTR = v v ( � ) ( � ) ( ) ( ) y y y y N M j ij j ij j iN N j i N i M - + - é ë êêê ù û úúú � - = + = - = å å å 2 2 1 1 1 1 1 This is the root mean square of prediction of the training set samples counted seven times, for each of the M 2 1 validation runs. 3. The overall model error given by SEPM = ( � ) y y N M j ij j N i M - � = = å å 2 1 1 which includes the both validation and test samples. Note that this is not the same as the error calculated by performing regression on the entire dataset (see below). An additional possibility is to average the values of the predicted data over all validation runs.The average prediction is given by � � / y y M j ij i M = = å1 A variance on this is to simply average the predictions for the training sets to give t j ij ij i jN M i j N y y y M � � � ( ) ( ) + - = + - å å v v =1 1 1 1 These averaged predictions will result in lower errors, which can be defined as follows. 4. The standard error for the averaged prediction of the training set SEPTRav = - = å ( � ) t j j j N y y N 2 1 5.The standard error for the overall averaged model given by SEPMav = - = å ( � ) y y N j j j N 2 1 Note that since each validation sample is removed only once, there is no corresponding average error for the validation set. It is not, of course, necessary to model the overall data using an average of cross-validated models, but PCR or PLS can beperformed on the entire dataset to give predictions of the form �yj where � [( ] y b t y b y j a ja a A a j a A = + = ¢ ¢ + = - = å å 1 1 x x P PP ) ( )-1 where øy is the mean concentration (or y value), øx is a vector of mean absorbances over all wavelengths, and P is a loadings matrix, as is normal for centred data.This error (called prediction error) is then given by SEPP = ( � ) y y N j j j N - = å 2 1 It is important to recognise that there are differences between this error and SEPMav as will be evident below. Artificial Neural Networks An alternative approach is to use neural networks for calibration. The example in this paper is relatively simple, so only a fairly basic approach is employed.The methods for calculations of errors can be applied to neural networks of any level of sophistication, but it is not the primary purpose of this paper to optimize the network. The first step is to define the inputs, outputs and number of hidden nodes (architecture) of the network. A back-propagation9 feed-forward network using one hidden layer was employed which made use of sigmoidal transfer functions of the form 1 1- where net = net e z w p p p P - = å1 and zp is the input to the given node, with wp the weights. The neural network program uses the back-propagation algorithm to find the weights. The output is simply the estimated concentration.In order to reduce the size of the problem each concentration was estimated separately, so that all calculations only had one output, to be comparable to the PLS and PCR results, which were performed separately for each compound.For the given problem, it was found that one hidden layer with a single node was optimum. The input layer of principal component scores, together with a bias node, was connected to the hidden node. The hidden node, together with another bias node was connected to the output node. The bias nodes had no input and delivered an output of 1. The question of the input to the network is an important one.Using all 181 wavelengths would result in a large number (184) of weights when bias and hidden nodes are included, which is clearly unjustified by the present data set consisting only of 32 samples. Hence, the data was reduced first using PCA. As above, only six (linear) principal components were kept as the input to the network. A seventh, bias nodequaling one was also added to the input, resulting in seven inputs. Hence, there are 7 (input/hidden) + 2 (hidden/output) = 9 weights.This is illustrated in Fig. 2. It is the common practice when training a neural network to randomize the initial set of weights and then allow the backpropagation algorithm to continuously refine their values in order to reduce the SEPTR of the network training set. This Analyst, October 1997, Vol. 122 1017process is stopped when the error of an independent test set starts to rise (signifying entry into a memorizing domain) and the weights at this test set minimum are retained.If the neural network is run again on the same set of data it is likely (if the input data is not fully independent) to arrive at similar errors with a different, though just as valid, set of weights. In order to start the neural network, a set of randomized weights is required. The initial randomized set of weights can produce a bad starting point for the back-propagation algorithm to seek a global minimum of the test set error so that it is essential for the cross-validation procedure that a good initial set be found.In the present work this was accomplished by producing many ( Å 100) initial sets and choosing the best of these. In order to ensure that any repeat calculation found the same starting point the same seed for the MATLAB random number generator was always used for the first random choice of weights. Cross validation The issue of cross validation is much more complex in the case of neural networks as compared with normal regression. In the application reported below, the data was divided into three sets. A training set consisting of 24 samples is used to obtain a model.It is important to understand that all preprocessing such as PCA is performed on this subset of data, and not on the overall 32 samples. The remaining samples are then divided into two sets of Nvt = 4 samples each. One, referred to below as the test set, is used to determine when the network has converged.(Note that the use of the term validation to describe the dataset used to determine when the training of a neural network should be terminated has been here replaced by test set so that the term validation set can be used for the PLS and PCR calculations. Thus the terms validation set and test set are used in an inverse manner to much of the literature involving neural networks). The error in the training set decreases as the weights are improved. The error in the test set converges to a minimum and then increases again.The network is judged to have converged when the test set error is lowest. However, unlike normal regression, it is not correct to compare the mean error of the test set for the purposes of validation. The reason for this is that the test set, although not directly responsible for a model, has an influence on when the network is judged to have been optimized, and, hence, this error will be low. A small error in the test set is not necessarily an indication that the network can successfully predict unknown samples.The third, validation, set, consists of four samples that are totally left out of the initial computations. The validation error is the error arising from these remaining four samples. The selection of the test and validation sets is illustrated in Fig. 3. The samples are first randomized as in the case of PCR and PLS. Subsequent to that the first four samples are removed to act as a validation set.Then, in sequence samples 5 to 8, 9 to 12 up to 29 to 32 are removed in turn as test sets. The procedure is repeated, removing samples 5 to 8 as a validation set, and then, successively, samples 1 to 4, 9 to 12, 13 to 16, etc. If the number in the validation and testing sets are equal to one another, Nvt, then the number of calculations, Q, where Nvt samples are extracted from the total randomized set for the validation set together with Nvt different samples for the test set, are Q = (M) (M 2 1) where M = N/Nvt = 8 and this is the number of runs that is necessary for the neural network to ensure that each sample is included in each validation and test set.In the case reported in this paper, 56 ( = 8 3 7) computations are required. Note that the mean of the training set is subtracted from the corresponding validation and test sets, and the loadings of the PCs computed from the training set used to calculate the inputs to these sets, which are then weighted by the appropriate numbers obtained from the training set to give predicted outputs.Non-cross validated neural networks ANNs can also be performed on non-cross validated data. In this case, the calculation is somewhat simpler. A set of four samples is removed in turn for the test set. These are used to determine when the network converges, and are removed in a similar fashion to the cross-validated PCR or PLS. Eight computations are performed in total, with each group of samples being removed in turn.Note, however, the test set has a different purpose to the validation set in PCR or PLS. The error in estimating these samples is minimised during the ANN calculation. These samples cannot strictly be used in crossvalidation as they have been used in assessing the performance of the model. Calculation of errors The calculation of errors is more sophisticated than in the case of straight PLS and PCR, but must be performed correctly for comparability.If done in the wrong way, neural networks might appear to work spuriously well, despite the evidence. It is essential to recognise that comparison of methods depends critically on how the ability to predict data is measured, and that there is a fundamental difference between how this prediction ability can be estimated using neural networks as described in this paper, and using standard regression. There are frequent claims in the literature about one method being more effective than another; however these claims are, in part, a function of how the quality of the predictions is calculated.In the cross-validated method proposed in this paper, there are Q ( = 56) validation/test runs. Every (M 2 1) ( = 7) runs, the validation set changes. For the first seven runs it consists of Fig. 2 Summary of neural network. Fig. 3 Summary of cross-validation and testing. Cycles for the neural network; validation set is shaded vertically, test set horizontally. 1018 Analyst, October 1997, Vol. 122samples 1 to 4, for runs 8 to 14 it consists of samples 5 to 8 and so on. A variable p = mod [i/(M 2 1)] + 1 can be calculated, where i is the run number, equal to 1 for runs 1 to 7, 2 for runs 8 to 15, etc., can be computed. The validation set consists of samples 4(p 2 1) + 1 to 4p. The test set consists of the remaining seven possible combinations of four samples. Several errors may be computed. 1. The standard error for the validation samples which is given by SEPV = ( � ) ( ) ( ) ( ) y y M j ij i m m j m j m m M - - = - + = - + = = å å å 2 7 1 1 7 4 1 1 4 1 1 which in the case of this paper, is the root mean square error of the validation samples.Note that each sample is repeated 7 ( =M 2 1) times, hence a more complex equation. 2. The overall error of prediction for all samples across all validation/test runs given by SEPM = ( � ) y y N Q j ij i Q j N - � = = å å 2 1 1 each sample being estimated 56 ( =Q) times. 3. The standard error for the training set, SEPTR is calculated for the 42 = (M 2 1) 3 (M 2 2) estimates of the training set, defined as SEPTR = ( - 1) ( ) tr ( � ) y y N M M j ij i j j N - � � - ' = å å 2 1 2 where jtr are the training runs for sample j. For example, for sample 9, these are runs 1, 3–8, 10–14, 22–23, 25–30, 32–37, 39–44, 46–51 and 53–56. 4. A fourth error is of interest. It is debatable whether the four test samples should be used in the overall error.This is because they have been used to determine the minimum model for crossvalidation. An alternative overall error excluding these four test samples each time can be calculated as follows SEPM = ts ¢ - � - Ï = å å ( � ) ( ) y y N M j ij i ere jts is the group of samples not belonging to the test set. For each sample, seven runs will be excluded, making 49 runs in total. As in the case of straight multivariate methods it is often useful to average the estimates over several runs.In many cases this procedure is important, as it is the only way to obtain an overall model. In PCR, cross-validation is often a separate step to producing a full predictive model. First the number of components or effectiveness of the model is determined and then the calculation, using an optimum number of components, is repeated again on the entire dataset. This is not possible for ANNs because the test set critically determines when the model converges.Removing a different test set results in a different optimum model. The algebraic definition of an overall model is extremely fraught using the methods described in this paper, because principal components will differ according to which samples removed for the test set. The PCs on a subset of 28 samples differ for each subset. Interesting features such as swap over of PCs, changing signs of scores and often completely different values for later PCs are encountered.Hence, the average estimate over several validation/test runs is of some significance. Unlike multivariate methods, each training and test set sample is removed seven times, not once, so all of the four errors above have corresponding and differing average estimates. 1. The standard error for the average estimate for the validation samples which is given by SEPVav v = - = å ( � ) y y N j j j N 2 1 where v � ( ) ( )( ) ( ) y M j i M s M s = - = - - + -å1 1 1 1 1 and s = mod[(i 2 1)/Nt] + 1, e.g., for sample 9 it equals 3, the validation set being represented in runs 15 to 21, as this sample belongs to the third group of four. 2. The overall error of average prediction for all samples across all validation/test runs given by SEPMav = - = å( � ) y y N j j j N 2 1 where � � / y y Q j ij i Q = = å1 3. The error of prediction for the average training set results, SEPTRav, can likewise be calculated, using t tr � � /( )( ) y y M M j ij i j = - - ' å 1 2 4.The equivalent error, SEPMAav can be calculated, removing the test samples. For the non-cross-validated data only two errors are strictly of interest. 1. The standard error for the training set given by SEPTR = ( t t y y y y N M j ij j ij j iN N j j N i N - + - é ë êêê ù û úúú � - = + = - = å å å � ) ( � ) ( ) ( ) 2 2 1 1 1 1 1 This is the root mean square of prediction of the training set samples counted seven times, for each of the M 2 1 validation runs. 2. The overall model error given by SEPM = =1 ( � ) y y N Q j ij j N i M - � = å å 2 1 Analyst, October 1997, Vol. 122 1019Fig. 4 Graphs of predicted versus observed for anthracene. Table 2 Summary of the RMS errors (mg ml21) Anthracene Fluoranthene Total detectable PAHs Multivariate methods— No cross-validation PCR PLS PCR PLS PCR PLS SEPP 1.582 1.379 7.436 5.150 56.604 47.938 Cross-validation PCR PLS PCR PLS PCR PLS Non-av Av Non-av Av Non-av Av Non-av Av Non-av Av Non-av Av SEPM 1.645 1.586 1.436 1.378 7.553 7.321 5.506 5.156 59.33 56.41 50.92 48.02 SEPTR 1.569 1.523 1.333 1.300 7.261 7.084 4.996 4.828 54.96 53.08 46.41 44.92 SEPV 2.101 2.101 2.010 2.010 9.343 9.343 8.231 8.231 83.76 83.76 75.29 75.29 Artificial neutral networks— No cross-validation Non-av Av Non-av Av Non-av Av SEPM 2.342 1.864 12.319 9.853 103.78 74.22 SEPTR 2.283 1.811 11.590 9.177 106.14 75.27 Cross-validation Non-av Av Non-av Av Non-av Av SEPM 2.953 2.062 16.08 10.81 105.63 72.86 SEPMA 2.817 1.923 15.14 9.98 104.37 70.82 SEPTR 2.731 1.803 14.64 9.59 97.84 70.27 SEPV 3.286 2.894 17.86 13.91 105.45 79.91 1020 Analyst, October 1997, Vol. 122which includes the both training and validation samples. The two equivalent errors on the averaged sample estimates can also be calculated. Results Analysis of Errors The results of the various errors are given in Table 2. The graphs of predicted versus observed concentrations for PCR and ANN for anthracene are given in Fig. 4.Only certain graphs are selected for brevity. A substantial number of conclusions are possible. For the multivariate methods, in all cases SEPV > SEPM > SEPP > SEPTR. This is expected for normal datasets. The validation error should be highest as the validation data was not used to form the model, and the training error least. SEPM should be close to SEPP. In all cases it is slightly higher, reflecting the fact that four samples are not included in computing the overall model and so their inclusion increases this error by a small amount.Averaging the estimates over all cross-validation runs is useful, and has an important influence over the error estimates. In order to get an overall averaged model from cross-validation it is useful to perform this operation, and the number of residuals for the averaged model over all cross-validated runs and the non-cross validated data can be compared directly. Since each sample is a member of only one validation set, SEPVav = SEPV.However, in all other cases, averaging reduces the error as expected, and as is clearly seen in the corresponding graphs. The averaged SEPM is now very close to SEPP in all cases. The amount of reduction in the error estimate on averaging reflects the underlying quality of the model. If the true model is completely linear, and all deviations from linearity are normally distributed with a mean of 0, the error should be reduced by A7 = 2.646 for the training set, reflecting the fact that each sample is included in seven training sets, and A8 = 2.828 for the overall model error, which is clearly not the case.The reason for this is that the underlying model is not exactly linear, indicating a small lack-of-fit. The reduction in error as sample estimates are averaged over cross-validation runs represents a valuable diagnostic tool. It is interesting to note in these results that SEPTR proportionally reduced by less than SEPM in all cases, as predicted, the average reduction in SEPTR being 3.0% and SEPM 4.6%, again suggesting that although there is a small but significant lack-of-fit, the dataset is reasonable.The results using ANNs are quite interesting. Without crossvalidation, the modelling error is only slightly higher than the training error. It is debatable as to which statistic best represents the true error. In this case, averaging the results of eight runs has quite a significant influence on the size of errors, reducing them by 20 to 30%.Normally distributed errors should reduce by 100 3 [1 2 (1/A8)] or around 65% on averaging. This indicates that the model improves considerably when performing repeat calculations using ANNs, as expected, but the amount by which the error reduces suggests that a perfect model will not be achieved even after averaging a larger number of runs (which could be done by randomizing the order of the original data again).It is debatable as to what to use as the predictive model to use for a neural network, whether to average the models for several runs or keep to the model of a single test run. Owing to the need to use a test set for to check for convergence, it is a requirement that some samples are left out of the computation each time. Developing a model removing just one set of samples will be unrepresentative of the dataset. The values for noncross- validated ANN errors in Table 2 are all higher than those for normal regression, suggesting that the averaged model obtained here is not as good as the models obtained for PLS and PCR. For cross-validated ANNs, SEPV > SEPM > SEPMA > SEPTR as expected in all cases.Averaging the sample estimates maintains this order also, which is interesting given the different number of samples in the training and validation datasets. Note Fig. 4 Continued— Analyst, October 1997, Vol. 122 1021that averaging has a greater influence on the errors than whether a sample is a member of a particular group (training, validation, etc.).ged cross-validated models are about comparable in size to the corresponding average noncross- validated errors. However, the non-averaged crossvalidated errors for anthracene and fluoranthene are significantly higher than the corresponding cross-validated errors. A possible reason is that only 24 samples or 3/4 of the original data are used for determining the model.This leads to a small number of highly outlying predictions as can be seen graphically. Because a root mean square error criterion is calculated, these outlying predictions have a major influence on the size of the error. In practical terms, this suggests that performing ANNs using one group of 24 samples has a chance of producing a very poor model. When 28 samples are used, this probability decreases. Application to Estimation of PAHs The methods in this paper can be employed to compare methods for estimation of PAHs using various chemometric techniques. In all cases, PLS outperforms PCR.It is important to calculate a number of indicators to ensure that similar trends are obeyed no matter which error is calculated. Had PCR proved superior using one or more indicators, this would lead to a less unambiguous conclusion. On the whole PLS is expected to outperform PCR providing the experimental dataset is well designed, as PLS takes into account variance in both the concentration and spectral dimensions.If PLS performs worse than PCR this might suggest that there are some outliers or unusual measurements, which could influence the statistics if one of these is removed to the validation set. Also, the number of samples has to be much greater than the number of components, plus the variability between samples sufficient, to allow sensible calibration models.A series of samples that are effectively replicates may not exhibit this trend. One experimental danger, however, with this conclusion is that a great deal of reliance is placed on the independent concentration estimate, in the case of this paper performed by GC–MS. The conclusions of Table 2 only state that using the PLS algorithm, a better mathematical model can be developed that predicts the GC–MS concentration estimates. If there are large errors in the GC–MS measurement, PLS may not necessarily be a superior approach for concentration estimates as it is influenced by the quality of the independent measurement.It is out of the scope of this paper to discuss the nature of GC–MS measurements. ANNs are harder to compare directly with multivariate methods. A sensible model requires several test and validation set combinations. As can be seen by the graphs, a single ANN run will probably result in a number of very poor predictions. Hence it is strongly recommended that the results of all these runs are averaged.The averaged estimates both for non-crossvalidated and cross- validated models result in poorer predictions than PLS and PCR in most cases. The single exception is the averaged PCR validation error for total PAHs. A possible reason is that pure PAHs can be predicted quite well. In some cases a pure PAH has several characteristic wavelengths, and it is even possible to produce quite accurate linear calibration at such wavelengths. The quality of a univariate model is primarily related to spectral overlap. In the absence of noise, it is always possible to obtain accurate calibration models using a limited number of wavelengths. For example, if there are only two components in a mixture, the ratio of absorbances at two wavelengths can be employed to determined the relative amounts of each component in the mixture. The distribution of concentrations in the mixture set is not relevant. However, for the total PAHs linear models are less easy to construct and a more empirical approach such as ANNs may function better, so that ANNs are not so bad in this case. It is recommended that for calibration of concentrations of single PAHs, PLS or possibly PCR are employed. ANN being a non-linear method exhibits few advantages. However, ANNs may perform reasonably well when predicting parameters such as a sum of total concentrations of a set of compounds, where a linear model may be less appropriate. For more complex mixtures, for example of 50 to 100 compounds, PLS or PCR may break down, and it is worth exploring ANNs under such circumstances. Conclusion This paper has highlighted the importance of a properly thought out scheme for cross-validation, and calculation of the associated errors. The particular dataset is predicted well by PLS and PCR, but neural networks might appear to work anomalously well if the wrong statistics are calculated. A great deal more information can be obtained using the type of error analysis proposed in this paper, including whether there truly is an underlying linear model. There is not a great deal of literature on confidence in, and estimates of, lack-of-fit for multivariate calibration, in contrast to the very substantial corresponding literature on univariate calibration. Monash University, Australia, is thanked for funded sabbatical leave for F.R.B. to visit Bristol. References 1 Cirovic, D. A., Brereton, R. G., Walsh, P. T., Ellwood, J. A., and Scobbie, E., Analyst, 1996, 121, 575. 2 Martens, H., and Naes, T., Multivariate Calibration, Wiley, New York, 1989. 3 H�oskuldsson, A., J. Chemom., 1988, 2, 211. 4 Wold, S., Geladi, P., Esbensen, K., and Ohman, J., J. Chemom., 1987, 1, 41. 5 Kowalski, B. R., and Seasholtz, M. B., J. Chemom., 1991, 5, 129. 6 Demir, C., and Brereton, R. G., Analyst, 1997, 122, 631. 7 Geladi, P., and Kowalski, B. R., Anal. Chim. Acta, 1986, 185, 1. 8 Brown, P. J., J. R. Stat. Soc. Ser. B., 1982, 44, 287. 9 Rumelhart, D. E., and McClelland, J. L., Parallel Distributed Processing, MIT Press, Cambridge, MA, 1986, vol. I. 10 Blank,T. B., and Brown, S. D., Anal. Chim. Acta., 1993, 277, 273. 11 Walczak, B., and Wegscheider, W., Anal. Chim. Acta., 1993, 283, 508. 12 Blank, T. B. and Brown, S. D., Anal. Chem., 1993, 65, 3081. 13 Burden, F. R., J. Chem. Inf.Comput. Sci., 1994, 34, 1229. 14 Deane, J. M., Multivariate Pattern Recognition in Chemometrics, illustrated by case studies, ed. Brereton, R. G., Elsevier, Amsterdam, 1992, ch. 5. 15 Stone, M. J., J. R. Stat. Soc. Ser. B., 1974, 36, 111. 16 Wold, S., Technometrics, 1978, 20, 397. 17 Krzanowski, W. J., Biometrics, 1987, 44, 575. 18 Gemperline, P. J., J. Chemom., 1989, 3, 549. 19 The MathWorks Inc., MA, USA. Paper 7/03565I Received May 22, 1997 Accepted July 28, 1997 1022 Analyst, October 1997, Vol. 1