首页   按字顺浏览 期刊浏览 卷期浏览 Tutorial review—Outliers in experimental data and their treatment
Tutorial review—Outliers in experimental data and their treatment

 

作者: James N. Miller,  

 

期刊: Analyst  (RSC Available online 1993)
卷期: Volume 118, issue 5  

页码: 455-461

 

ISSN:0003-2654

 

年代: 1993

 

DOI:10.1039/AN9931800455

 

出版商: RSC

 

数据来源: RSC

 

摘要:

ANALYST, MAY 1993, VOL. 118 455 Tutorial Review Outliers in Experimental Data and Their Treatment James N. Miller Department of Chemistry, Loughborough University of Technology, Loughborough, Leicestershire, UK LEI1 3TU This review summarizes critically the approaches available to the treatment of suspect outlying results in sets of experimental measurements. It covers the use of parametric methods such as the Dixon test (with comments on the problems of multiple outliers); the application of non-parametric statistics based on the median to by-pass outlier problems; and the application of robust statistical methods, which down-weig ht the importance of outliers. The extension of these approaches to outliers occurring in regression problems is also surveyed. Keywords: Outliers; exploratory data analysis; significance tests; non-parametric and robust statistics; regression methods Introduction The quantitative measurements which dominate modern analytical chemistry are inevitably subject to random varia- tion.As a result such measurements are almost invariably replicated. The number of replicates performed may be limited by factors such as the availability of the test material, reagents or time, but in every case the aims are the same. Replicates allow the magnitudes of the random variations to be estimated, for example by calculating standard deviations or confidence limits, and the means of the replicates are expected (if systematic errors are absent or have been corrected) to be closer to the true value of the measurand under study than the individual readings.In practice, when a series of replicate measurements is obtained, it is often found that one or more of the values seem to be substantially different from the others. The practical question then posed is clear: should such outlying results be rejected or not before the mean, standard deviation, etc., of the data are calculated? The dilemma involved is equally clear. The replicate measurements are regarded as a small statistical sample from a potentially infinite population: the latter is generally assumed to have a Gaussian or ‘normal’ distribution (see below for further comments on this assump- tion), in which readings near the mean value are much more likely than readings distant from the mean. Nonetheless, there is some probability of obtaining a value well removed from the mean, and such a reading may appear in the small sample of measurements made in practice.If this is so, what justification can there be for rejecting the suspect data point? These issues have caused concern and controversy among experimental scientists (not just analytical scientists) for many years, and indeed continue to generate new research, and further controversy! Three separate statistical approaches to the problem can be identified: (1) use of statistical significance tests that assume a Gaussian (or some other defined) error distribution for the population; (2) use of non-parametric statistical methods, which make no such assumptions; and (3) use of robust statistical methods, the underlying principles of which are fully discussed below.Each of these approaches can be criticized, and rather than recommend any one of them, this review seeks to summarize their principles, strengths and weaknesses. What is certain is that all analytical scientists must make a considered decision as to how they are going to treat potential outliers: equally, that decision must be a clearly stated part of reports, published papers, etc. One further obvious approach to the problem of outliers is to seek more data! If we have four replicates, one of which is suspect, our treatment of the latter may have a crucial effect on our final results. If we could make four further measure- ments (and assuming that none of these is suspect) our concerns are much reduced, as any statistical test applied will have more power, and in any case the decision as to whether to eliminate the outlier or not will have a smaller effect on the calculated mean, standard deviation, etc. Of course, it may not be possible to obtain more data if the material under study has been exhausted or irreversibly changed by previous measure- ments, but if more material is available then the first expedient the analytical scientist should consider when outliers arise is to obtain further readings.Before considering statistical approaches to outliers in more detail, the established methods of exploratory data analysis are often valuable. The simplest statistical packages available for personal computers will rapidly present (for example) dot plots (in which the data are effectively plotted as one- dimensional graphs) or box and whisker plots (which show the range, interquartile range, mean, confidence intervals, etc., and often highlight outliers immediately).These elementary visual data displays readily show up suspect values, and also give an immediate overview of the data distribution, thus helping a judgement on whether (for example) the assumption of normality is valid. It is worth noting that, in some cases, the suspect measurement can readily be explained. A check on the instrumentation used may, for example, reveal an intermittent fault. In other cases such as in the data set: 12.11,12.27,12.19, 21.21, 12.18, we can be fairly certain that a transcription error has occurred in the fourth result, which should really be 12.21. In such cases outliers can be omitted or corrected with a relatively easy conscience.The statistical methods now to be summarized need only to be used in those all too frequent cases where outliers occur without obvious explanation. Parametric Tests for Outliers The methods described in this section are probably still the most commonly used solutions for outlier detection.’-5 They assume that the data are a sample taken from a population with a Gaussian error distribution, and apply conventional significance test approaches to decisions on (in the simplest456 ANALYST, MAY 1993, VOL. 118 cases) the rejection of single outliers. The probability level used in the significance test determines, as usual, the likelihood of a type I error, i.e. , the probability that a suspect value will be rejected when in fact it should be retained.As in other significance tests applied in analytical science, the p = 0.05 level is most frequently used. The principal points of interest in such tests are the test statistics to be used. Their selection is not as easy as it seems. It would clearly be illogical to take the complete data set, i.e. , including the suspect results, use it to calculate, for example, 95% confidence limits, and then reject any measurements falling outside such limits. Even less defensible would be outlier rejection on the basis of confidence limits determined with the suspect values excluded! In practice, the test statistics usually applied are those described by Dixon over 40 years ago, and often described as ‘Dixon’s Q’.They utilize comparisons of the difference between the suspect value(s) and the values close to them and the over-all range of the results. For n = 3-7 (as always, n is used to describe the total number of measurements) with the ordered sample values labelled xl, x2, etc., the test statistic for a single outlier is: Qio = (x, - xn-i)/(x, - xi) or (x2 -xiY(xn - XI) (la) according to whether the suspect value is at the high or low end, respectively, of the data set. These equations can be replaced by: Qlo = I(suspect value - nearest value)I/range (lb) A simple example of this approach is instructive. Suppose that we obtain the results: 9.97, 10.02, 10.05, 10.07 and 10.27 cm3 in a titrimetric analysis. Given the precision with which titrimetry can be performed in expert hands, the last result, 10.27 cm3, must be regarded as suspect.The test statistic, Qlo, is given in this case by (10.27 - 10.07)/(10.27 - 9.97) = 0.667. The critical value for n = 5 andp = 0.05 is 0.710 (tables of such values are given in several of the books listed at the end of this review). In our example, the test statistic is less than the critical value, so the null hypothesis underlying the test (i.e., that all the results could come from the same population) has to be accepted: the doubtful result 10.27 cannot be rejected. This example is a good illustration of an important principle, viz., that when n is small, as it so often is in analytical work, one result has to be very different from the others before it can be rejected by criteria such as the Dixon test.In this case the suspect titre would have to be as high as 10.32 cm3 (when Qlo is 0.2Y0.35 = 0.714) before it could (just) be rejected at the p = 0.05 level. Such findings emphasize the need to obtain extra data wherever possible. The same set of results can be used to emphasize a point made earlier. When n is small, acceptance or rejection of an outlier makes a large difference to the results of mean and standard deviation calculations. For the five results 9.97, 10.02, 10.05, 10.07 and 10.32cm3, the mean and standard deviation if the last result is retained are 10.086 and 0.136 cm3, respectively. If the figure of 10.32 cm3 is rejected then the mean and standard deviation fall to 10.0275 and 0.043 cm3, respectively. Note that the estimated standard deviation, s, is reduced by more than two-thirds as a result of rejecting the outlier.As s is routinely used to estimate confidence limits, and in significance tests which compare means, variances, etc. , the importance of making a sensible judgement about the suspect result is very clear. It is at this point, however, that we have to confront some of the difficulties of the Dixon method and other parametric significance tests. One immediate problem is that, as n increases, there is a growing likelihood of obtaining two or more suspect outliers in the same data sample. The identifica- tion of multiple outliers is a serious problem (to which we shall return briefly below), but here we are interested only in the effects of increasing n on the identification of a single outlier.Most authorities recommend that modified forms of eqn. (1) are used. Hence for n = 8-12, we use: and for n 2 13, we use: Q22 = ( ~ 3 - XI)/(X, - 2 - xi) or (xn - xn - 2)/(~, - ~ 3 ) (3) according to whether x1 or x,, respectively, is the suspect value. Somewhat surprisingly, there is considerable controversy about the ranges of n values over which eqns. (1)-(3) should be used, and even over the critical values associated with these statistics. (This review follows the recommendations, although not the nomenclature, of the book by Wernimont and Spendley.5) Further difficulties in using this approach to outliers become apparent when we are suspicious about two (or more) of the measurements. Consider the situation where there are two values which (for example) are similar but suspiciously higher than the rest.In that case we clearly cannot simply take the highest result and determine Qlo from eqn. (l), becausc the numerator in the equation will be small, being given by the difference between the two suspicious values. This effect is known as masking. Two other approaches suggest themselves. One (the so-called block method) considers the two results as a pair, the test statistic being (for example) their mean divided by the mean of the whole set of measurements. The clear danger here is that we necessarily discard or retain both values, even in the situation where it might be right to retain one and reject the other. A consecutive procedure, on the other hand, takes the outliers one at a time: the one that is the nearer to the remainder of the data is tested first, and if it can be rejected, the more pronounced outlier is rejected together with it.If the nearer outlier is retained, the further outlier is tested separately. The relative merits of these block and consecutive procedures are still causing controversy. Still more complications arise when there are two suspect results at opposite ends of the data sample. Tests for all these situations are thoroughly summarized in the book by Barnett and Lewis’ but there is a growing feeling that such methods should, if possible, be by-passed by using the alternative approaches discussed in later sections. Before leaving the realm of parametric tests, however, we must re-emphasize that they do assume the presence of (in most cases) a Gaussian distribution of error.If this is not present, the tests are completely invalid. Take, for example, the following set of numbers: 1.26,1.58,2.29,2.51,3.02,3.98, 7.94. The value 7.94 certainly seems suspicious, and the Dixon test shows that, at p = 0.05 it could be rejected, if a Gaussian error distribution is assumed. However, suppose that, in reality, these numbers were a sample from a log-normal distribution, i.e. , one in which the logarithms of the numbers are normally distributed. Their logarithms (base 10, but any base would give the same result) are: 0.100, 0.199, 0.360, 0.400,0.480,0.600,0.900. Now there is no suggestion that the last result is suspicious: the appropriate Qlo value is 0.3/0.8 = 0.375, well below the critical value (p = 0.05) of 0.569.Hence a result that seems suspicious on the assumption of a particular error distribution may not be at all suspicious when the correct distribution is used. In summary, parametric significance tests are widely used, no doubt because of their superficial simplicity. However, even in their most elementary forms, they are controversial and need to be used with care. In the presence of multiple outliers still further complexities arise, and it is not surprising that the present trend is for such tests to be replaced by other approaches. Non-parametric Statistical Methods In the last section, it was noted that the use of significance tests which assume a particular distribution of errors is fraught withANALYST, MAY 1993, VOL. 118 457 danger.The obvious question therefore arises: why not use methods which utilize no such assumptions? Such tests have been widely available for many years, although much of their use has in practice been in the social sciences. These distribution-free or non-parametric techniques68 are charac- terized by their use of the median rather than the arithmetic mean as the ‘measure of central tendency’. The benefits of this approach are immediately apparent if we reconsider the data given in the previous section. In numerical order, the five results examined were 9.97,10.02,10.05,10.07 and 10.27. The median of these results, i.e., the middle value when the numbers are ordered, is 10.05. (If the number of results is even, the median is the average of the two middle values.) It is clear that the median is entirely unaffected by the suspect value, and will remain 10.05 as long as the highest value is >10.07.This valuable behaviour contrasts with that of the mean, as already noted, and extends to the situation in which there are two outliers: if the first value had been mistakenly recorded as 9.77, or the fourth value as 10.70, the median of the five numbers would remain as 10.05. These advantages extend to other methods which utilize medians directly, for example the Theil methods used in regression (see below). The use of the median therefore circumvents rather than confronts the problems of outlier identification, and is valuable when n is small. The most common non-parametric measure of dispersion ( i e . , the ‘spread’ of the results) is the inter-quartile range (IQR).If we imagine a set of ordered numbers to be divided into two groups, one above and one below the median, each of these groups can itself be divided into two by the quartiles: the difference between the two latter numbers is the IQR. As it does not involve the highest and lowest values this statistic also has the valuable property of being unaffected by outliers. It may also be shown that, for a Gaussian distribution of error, the IQR is about 1.350, where 0 is the population standard deviation: this relationship allows us to estimate the standard deviation of a set of data in a manner which is independent of outliers. Unfortunately, the IQR is a rather unrealistic concept for small samples, and there are a number of different conventions for calculating it, which give significantly different results when n is small.It has, therefore, seen relatively little use in analytical chemistry calculations. The number of non-parametric significance tests is very large indeed, and space only permits a summary of their strengths and weaknesses. Their value in dealing with suspicious results is well illustrated by the following example. Suppose that the levels of metal ion (in ng cm-3) in the water from 32 rivers in two areas is as follows. Area A: 29,42,60,80, 83, 110, 130, 168, 194,230, 260, 270, 275, 280, 350, 780; and area B: 122, 140, 160, 220, 245, 250, 260, 268, 348, 390, 420, 430, 445, 454, 482, 498. Is there any evidence that the metal ion levels in the rivers in the two areas differ? The first point to note about these measurements is that they are not replicates: the metal ion concentration has been measured only once for each water sample from each area.There are, therefore, almost no circumstances in which the apparently anomalous result of 780 ng cm-3 from area A could even be considered as an outlier in the usual sense. It may be very different from all the other area A data, but there will normally be no reason for rejecting it. It is also important to note that there is no reason to think that these samples are necessarily drawn from populations with Gaussian error distributions, so conventional parametric tests such as the t-test may not be justifiable. The best-known non-parametric approach to problems of this type is the Mann-Whitney U-test, sometimes known now as the Wilcoxon-Mann-Whitney test as it is closely related to the Wilcoxon Rank Sum test.The idea underlying the Mann-Whitney formulation is very simple. Inspection of the data shows that, in general, the metal ion levels from area A are lower (median = 181 ng cm-3) than those from area B (median = 308 ng cm-3). Hence the number of occasions on which individual area B results are exceeded by individual area A results should be fairly small. Performing the test simply involves counting the number of occasions on which this occurs. Hence the result 122 from area B is exceeded by 130 . . . 780 from area A, i.e., by ten area A values; the result 140 from area B is exceeded by nine area A values, and so on. Allowing for one ‘tie’ (i.e., when the two equal results of 260 ng cm-3 are compared), which counts 0.5 in our tally, the total number of cases in which area B data are exceeded by area A data is 66.5.Reference to statistical tables shows that, for n1 = n2 = 16, this test statistic must be less than or equal to 75 (p = 0.05) if the null hypothesis (z.e., that the two sets of measurements have equal medians) is to be rejected. This is clearly the case in this example, so the Mann-Whitney method suggests that the metal ion levels in the two areas are probably different. It is of interest that, at the same probability level, the t-test (just) fails to detect this difference. As already noted, the t-test may not be appropriate anyway, not, at least, until we have checked for the normality of the two sets of data, but it is clear that the Mann-Whitney method has the advantage that it has effectively down-weighted the significance of the anomalous area A result of 780 ng cm-3.The same Mann- Whitney result would have been obtained if this reading had taken any value 3499, i.e. , any value higher than the highest area B value. By contrast, the reading of 780 ng cm-3 has greatly inflated both the mean and the standard deviation of the results for area A. This property of the Mann-Whitney method (i.e. , of being little affected by anomalous results) is known as robustness, and is more fully explored in the next section. It arises because the test really considers not the absolute values of the measurements but their ranks, i.e., their numerical positions if the data are arranged in order.This is not immediately apparent from the way the test has been described here, but is clarified in the next example. Two different methods were utilized to determine the concentration of nitrate ion in a single sample of tap water. Each method was used six times, with the following results (in pg cm-3). Ion chromatography: 0.57, 0.60, 0.62, 0.66, 0.67, 0.68; and ion-selective electrode: 0.46, 0.50, 0.59, 0.69, 0.83, 0.85. In this case the question to be addressed is whether the two methods differ significantly in their precisions. (The median concentration is the same in each case, 0.64 yg ~ m - ~ . ) If we can assume a Gaussian error distribution for each data set, and if we set aside worries about the two possible outliers (0.83, 0.85) in the ion-selective electrode data, we could use the well-known F-test to make this comparison.The alterna- tive is the Siegel-Tukey method, in which the results are first written down as one continuous and ordered list with, for example, the ion chromatography results underlined: 0.46, 0.50,-=, 0.59,0.60,0.62,0.66,0.67,0.68,0.69,0.83,0.85. A little consideration suggests that, if the spread of the two sets of results was roughly the same, the underlined and non-underlined data would appear roughly at random across the list. In practice, the underlined results tend to be concentrated in the middle of the list. This is expressed numerically by the use of paired alternate ranking. The lowest result is ranked 1, the highest result is ranked 2, the second highest 3, the second lowest 4, the third lowest 5 , etc.These rankings are written down with the underlining retained: 1,4, 5, 8, 9, 12. 11. 10, 7. 6. 3. 2. / I , I , _ , I I - - - - - The sum of the underlined ranks is 5 + 9 + 12 + 11 + 10 + 7 = 54, and the sum of the remaining ranks is 1 + 4 + 8 + 6 + 3 + 2 = 24. (It is worth verifying that the two sums total 78, the sum of the first 12 natural numbers.) We must now subtract from each of these sums n(n + 1)/2 and rn(m + 1)/2, where n and rn are the numbers of measurements in the two data sets. This may seem unnecessary in this case, as n = rn = 6, but the test is formulated in this way to allow for cases where n # m. Hence we subtract 21 from each sum giving us 33 and 3, respectively. We now take the lower of these two results, i.e., 3 as our test statistic.The critical value (which is derived from458 ANALYST, MAY 1993, VOL. 118 the same set of tables as are used in the Wilcoxon-Mann- Whitney test described above) for n = rn = 6 at p = 0.05 is 5, i.e., our test statistic must be ~5 for the null hypothesis of equal spreads to be rejected. (Note that this test and the closely related Wilcoxon-Mann-Whitney method are unusual in that the critical region, i.e. , the region leading to rejection of the null hypothesis, occurs when the test statistic is less than or equal to the critical value in the tables.) In this example we have clearly demonstrated that the null hypothesis can be rejected. The Siegel-Tukey method is also robust: the ion-selective electrode value of 0.46 could have been replaced by any value <0.50, and the values 0.69, 0.83 and 0.85 could have been replaced by any values B0.68 without disturbing the rank order calculated above.However, this test is open to criticism on the grounds that it lacks power, even when, as in our example, the medians of the two sets of results are equal or close to each other. The power of a test is its ability to reject correctly a false null hypothesis. Generally speaking, signifi- cance tests are most powerful if they use the greatest amount of available information. In the Wilcoxon-Mann-Whitney and Siegel-Tukey tests we gain both simplicity (examples with small n, rn, can often be calculated mentally) and robustness by replacing the actual measured concentrations etc.by ranks, but the inevitable price paid is some loss of power. Many non-parametric methods are robust (although the opposite is not true), but some non-parametric methods even lack robustness. This is true of, for example, Tukey’s quick test, which is often used as a rule-of-thumb for comparing two sets of results such as the river water data given above. Tukey’s test declares that, for two statistical samples to be determined to come from different populations, the one with the higher median must have some data that are higher than all the measurements in the other sample, and the one with the lower median must have some results which are lower than all the data in the ‘higher’ sample. This is clearly not the case in our river water example, because of the anomalous result of 780ngcm-3 for area A, so the Tukey test would suggest, because of its lack of robustness, that the two samples were not significantly different.By contrast, some non-parametric methods have been specifically designed to identify outliers. These techniques, summarized in the classic text by Barnett and Lewis,l seem to have found little practical application. In the context of analytical measurements, the greatest objection to the use of non-parametric methods is that they discard too much information. The common use of ranking methods, and the associated loss of power, has already been mentioned. More fundamental is the belief that in many analytical situations the data do come, at least in part, from populations with Gaussian error distributions, but with contamination by outliers arising from gross errors. There is also increasing evidence that, in practice, many data sets come from heavy-tailed distributions, i.e., with an excess of measurements distant from the mean. Such a situation might arise from the superposition of two or more normal distribu- tions with similar means but with different standard devia- tions, an outcome that might occur if, for example, two or more individuals, pieces of equipment or sets of experimental conditions had been used in making the measurements. In these circumstances, the ideal way to treat suspect or outlying results would be to use methods which down-weight , without entirely rejecting, measurements which are far distant from the mean. Methods using this approach are summarized in the next section.Robust Statistics Robust statistical methodsg-11 have been studied extensively over the last two decades or more, but have only recently become widely available to practising scientists. This is because, despite the underlying simplicity of their principles, they are frequently iterative methods, which cannot usually be undertaken without a personal computer. It is, therefore, the widespread availability of ample computing power that has brought these methods into sharper focus (although many otherwise excellent software packages still omit robust statis- tics), and they may in future become the methods of choice for dealing with data sets containing suspicious results. Numerous robust tests have been developed; again, only a summary can be attempted here.Some very simple robust approaches do not require iterative computations. One fairly obvious calculation is that of the trimmed mean. This involves the deliberate omission of a certain percentage of the measurements (in practice 10-25% trimming is common: a ‘10% trimmed mean’ is the mean determined after the top 10% and the bottom 10% of the results have been omitted). This procedure has the obvious drawback of being arbitrary. How do we decide the extent of the trimming, why trim at both ends of the data sample when suspicious results may only occur at one end and why remove such results altogether when, as already noted, it might be better simply to reduce their weights? Moreover, trimming is of little value in many cases where the number of measure- ments is small.With a sample of five measurements, the least trimming that can be carried out is to eliminate the top and bottom measurements (i.e., 20% trimming) leaving only three data points: this is clearly unacceptable. A variant on the trimming approach is Winsorization, in which an outlying result is not removed, but ‘moved’ so that its residual (i.e. , the difference between the single result and the sample mean) is reduced so that it becomes the same as the second (or perhaps third) largest or smallest result. Although Winsorization is still sometimes used in regression problems (see below), these relatively crude methods have been largely supplanted by the more sophisticated iterative approaches. At this stage, it is important to introduce a robust dispersion estimate known as the median absolute deviation (MAD!).The MAD is given by: MAD = median[lxi - median(xi)I] (4) If we apply this equation to the data given at the beginning of this review (9.97, 10.02, 10.05, 10.07, 10.27; median 10.05), the individual absolute deviations from the median are 0.08, 0.03, 0, 0.02 and 0.22, respectively. The MAD is, therefore, the median of the five latter numbers, i . e . , 0.03. The MAD has several useful attributes. For example, it provides the basis of a crude outlier test, based on the ratio [Ixo - median(xi)l]/ MAD. If this ratio exceeds 5 for any single outlier, XO, then that outlier can be rejected. Note that this test suggests that, in the above data, the result 10.27 could be rejected, the ratio being 0.22/0.03: this conclusion disagrees with that of the Dixon method, rightly making us suspicious of both methods! A more important property of the MAD is that MAD/0.6745 can be shown to give a useful and robust estimate (6) of the population standard deviation, 0.In our example, MAD/ 0.6745 = 0.03/0.6745 = 0.0448. We can now make a robust estimate (p) of the population mean, p, using a distance function different from that usually used. In conventional statistics our estimate of the mean is obtained by minimizing a sum of squares (SS), Z( Ixi - p1)2. In this case the distance function is (xi - p)2: as we have already seen, it is the use of the sum of such squared terms which makes this estimate of the mean so sensitive to large errors. We now use an alternative distance function, simply Ixi - PI.Any measurement for which this function exceeds 1.50 is down-weighted. It should be noted that the value of the constant 1.5 is not obligatory, but is fairly general. In our example 1.50 = 1.5 x 0.0448 = 0.0672. The iterative process of calculating p needs an initial estimate, which we can conveniently take as the median of the measurements, 10.05. Deviations from this value exceeding 0.0672 therefore need down-weighting, a process achieved by replacing such resultsANALYST, MAY 1993, VOL. 118 459 by F k 0.0672 as appropriate. This process clearly moves the value 9.97 to a new value of 10.05 - 0.0672 = 9.9828, and the value 10.27 is reduced to 10.05 + 0.0672 = 10.1172. We therefore have a new set of five values, usually called pseudo-values; three remain unchanged (10.02, 10.05 and 10.07) but the other two are altered from their original values to those just calculated.A new mean can, therefore, be determined as a second estimate of p: note that although the initial ji estimate may be the median, all subsequent estimates of p are means. This second estimate is 10.048: further down-weighting may now be required if any of the pseudo- values lie outside the range 10.048 & 0.0672, i.e., 9.9808- 10.1152. The highest pseudo-value calculated in the first iteration (10.1172) lies just outside this range, and is, therefore, altered to 10.1152. The new set of pseudo-values (now 9.9828, 10.02, 10.05, 10.07 and 10.1152) gives us a third estimate of p, i.e., 10.0476. This value is so close to the previous estimate that we can safely take a rounded value of 10.048 as a robust estimate for $.Clearly, this calculation will be more complex with a larger set of measurements, and the p estimates may converge more slowly, hence the need for a personal computer to expedite the calculations. The principles underlying this method ('Huber's rn-estimator') are clear and simple, but other methods are available. In this case the use of MAD/0.0645 as a robust standard deviation estimate survived through each iteration, i.e., the assumed value of 6 was constant: in other methods a robust estimate of the mean is the starting point for iterative calculations of the standard deviation, and in still others both mean and standard deviation are calculated iteratively and simultaneously.In summary, robust statistics provide a sound and logical approach to the problem of suspect results, by down-weight- ing them to an extent that depends on their degree of departure from the remaining data. Although the computa- tions may be tedious to explain, they are simple to perform with the aid of suitable software. It must be added that several of the books on the theory of robust methods are forbidding in the extreme: preferable are two excellent summaries provided by the Analytical Methods Committee.9.'" Outliers in Regression The methods discussed in the previous sections are applicable to replicate measurements of the same signal, titre, intensity, etc. However, perhaps the commonest approach .to the handling of analytical data is to use regression techniques.12.13 These find application in two areas in particular: in the plotting of calibration graphs in quantitative analysis, and in method comparison studies.In calibration graphs a series of specimens of known concentration is examined, the instrumental signals resulting are recorded, and the data plotted with the signals on the y-axis and the standard concentrations on the x-axis. An appropriate calibration line is drawn, and then used to estimate the concentrations of test specimens by interpolation. The calculations in this applica- tion usually assume that errors occur only in the y-direction, i.e., that the standard concentrations are error-free: this assumption may not always be valid. In method comparison studies, a number of specimens are examined by each of the two methods under study, and the two sets of results obtained are plotted on the x- and y-axes. Each point on the graph therefore represents a single specimen examined by the two methods.If the methods gave identical results in every case, the graph would be a straight line with zero intercept, unit slope and a correlation coefficient of 1: of course such results are never obtained in practice, but the closeness of these statistics to their ideal values is used as a measure of the agreement between the two methods. In this instance it is obvious that measurement errors must be expected in both the x- and y-directions. In either of these applications of regression methods, it is 70 1 I I *LA 60 40 I 0 10 20 30 40 50 60 70 X Fig.1 A, Straight line and B, quadratic least-squares fits to a set of data points (for data see text). The straight line fit gives an adjusted R2 value of 99.4%, with the point (70,64) marked by MINITAB as a possible outlier. The quadratic fit gives an adjusted R2 value of 99.6%, with no outliers identified clear that outliers or suspicious results may occur, especially because in many cases restrictions of time or material mean that only single measurements or very small numbers of measurements are made on each standard or test specimen. The identification and/or treatment of such results is, there- fore, as important in regression calculations as it is in replicate measurements. Moreover, it is apparent that in regression calculations outlier problems are rather more complex.There are several reasons for this. One is that, as already noted, outliers may occur in the x- andor the y-directions. (It must be added that in multiple regression, e.g. , with the form y = a + blxl + b2x2 + bg3 + ... outliers are even harder to identify, but this problem will not be treated here.) A second problem is that, to an even greater extent than in replicate measure- ments, the status of possible outliers depends crucially on the model assumed. This is exemplified in Fig. 1, which shows a fairly simple calibration graph. If the analyst was absolutely certain on the basis of experience , underlying physico-chem- cia1 principles, etc. , that the plot should be linear, then the last point on the graph might be regarded as an outlier. However, as Fig.1 also shows, the points are excellently fitted by a quadratic model, in which case no question of outlier identification arises. Lastly, we must observe that some apparently obvious approaches to outlier identification are not really admissible. It might be supposed, for example, that in an unweighted regression calculation (i. e. , one in which the y-direction error is assumed to be the same at all values of x ) , outliers could simply be identified by treating the y-residuals as a set of replicate results, and applying the Dixon or other test methods. (The y-residuals are the - 91 values, i.e. , the y-direction distances between the experimental points and the fitted line at any x-value.) This method is unsound, however, as the y-residuals are not independent measurements, as they total zero.Despite these concerns, it is found in practice that the approaches to outlier diagnosis in regression mirror those used in replicate measurements. There are outlier tests, some fairly simple and some less so; non-parametric methods are avail- able; and robust regression methods have also found recent use in analytical work. However, an additional concept of particular importance to regression problems is worth intro- ducing at this point, viz., the breakdown point (BDP). Put simply (there is, of course, a mathematical definition also) the breakdown point is the percentage of outlier points on the graph that can be tolerated before the regression line determined is significantly altered. Common sense indicates that the maximum value of the BDP is 50% (beyond that level, who is to say which points are the genuine ones, and which are460 ANALYST, MAY 1993, VOL.118 the outliers?). It is easy to show that, in conventional least-squares regression, even a single outlier can greatly change the estimates of the regression coefficients: the BDP of this approach is, therefore, 0%. In analytical science, we would hope that in most cases the number of outliers will not be large, so any method with a BDP of, say, 220% will be of importance. The use of relatively simple test methods, based as so often in regression statistics on residual diagnostics, is exemplified in many elementary statistical packages. (As already noted, residuals are [y - 91, i.e., [experimental - fitted] values.) For example, MINITAB,14 widely used in both teaching and research, will list the residuals obtained in a least-squares regression calculation, and convert them into standardized residuals ( i e ., with mean = 0 and standard deviation = 1). Points whose standardized residuals are >2 or <-2 are highlighted in the printout of the results. Note that, as with ordinary residuals, standardized residuals are not indepen- dent, so this method of highlighting outliers has to be treated with reserve, especially when n, the number of points on the graph, is small. However, two additional ways of manipulating residuals do not suffer from this disadvantage. Studentized residuals, ri, are given by: ( 5 ) In this equation, ei stands for the original residuals; sylx is the residual standard deviation, [Eei2/(n - k - 1)]4, where k is the number of terms in x, x2, etc., in the regression equation, and hi is the leverage value of the ith point, given by: hi = (l/n) + (xi - %)2/(n - l)s,* where sx2 = X(xi - x)z/(n - 1).In simple linear regression, therefore, the leverage of a point is a measure of its distance from the mean of the xi values. It may be shown that, for a graph which is not forced through the origin, i.e., has a constant term, a, hi lies between lln and 1. MINITAB and other statistics packages highlight xi-values with high leverages, a facility of obvious value in identifying x-direction outliers when a regression line is used in method comparisons. Note that a point with a high leverage is not necessarily a y-direction outlier, and vice versa.A further modified form of residual is the jackknife residual, ri, given by: r-i = ei/[si(l - hi)$] (7) where si is analogous to s,/,, except that the ith point is omitted from the calculation. Both the Studentized and jackknife residuals have the property that they approximately follow a t-distribution, with (n - k - 1) and (n - k - 2) degrees of freedom, respectively. Moreover, if n is greater than about 30, the distributions of these residuals are approximately standard normal distributions. These properties make the identification of outliers relatively simple at any chosen p-value. Yet another and distinct approach is to use Cook’s distance, di, a statistic that measures the extent to which the regression coefficients (a, b, etc.) are influenced by the omission of individual points.The equation is: It can be shown that di is always 30, and should normally be less than 1; values >1 are regarded as worthy of further investigation as possible outliers. These and related methods are fully discussed by Kleinbaum et al. 13 A previous section of this review showed that non-para- metric methods based on the median were of value in by-passing, as it were, the problems associated with possible outliers. This advantage extends to non-parametric regression methods, of which the best known is that of Theil. His original (‘complete’) method involved taking all the possible pairs of points from the n points on the graph, and estimating the slopes of the lines joining them. The median of these n(n - 70 , I 9 / A 40 ~- 1 / I I I 1.I I 0 10 20 30 40 50 60 70 X Fig. 2 A, Least-squares and B, Theil incomplete methods applied to fit a straight line to the data plotted in Fig. 1 (for data see text) 1)/2 estimates was taken as the best estimate of the slope, b, of the regression line. Use of this estimate together with the coordinates of the n points and the equation a = y - bx provided n values of the slope, a, of the regression line, and again the median was chosen as the best estimate of a. This procedure is open to at least two objections, of which the simpler is the practical one, that the number of slope estimates is very large: even a graph with only eight points would provide 28 separate estimates of b. The second and more serious objection is that the method gives equal weight to the slope estimates from pairs of points that are close together and well separated on the graph; surely the latter estimates should carry more weight? Both problems are overcome by the simpler procedure used in Theil’s ‘incomplete’ method, in which n/2 slope estimates are given by using (xl, yl) and the point immediately above the median value of xi, ( x 2 , y2) and the second point above the median, and so on. If, as is usually the case in calibration graphs, the xi values are equally spaced, then the slope estimates obtained using this method will rightly be accorded equal weight.(If n is odd, then the point with the median value of xi is omitted from the calculation.) We can apply this method to the data used in Fig. 1, which were: x = 0,10,20,30,40,50,60,70; y = 1,9,21,30,41,49, 59, 64.The four slope estimates obtained from these measurements are clearly 40/40,40/40,38/40, and 34/40, so the median slope estimate is 39/40, 0.975. The eight individual intercept estimates, determined as described above, are then 1, -0.75, 1.5, 0.75, 2, 0.25, 0.5, and -4.25. The median of these values is 0.625. Hence the Theil method gives a straight line equation of y = 0.625 + 0.975~. The conventional least-squares method applied to the same data gives a result of y = 1.50 + 0.936~. Both these lines are plotted in Fig. 2: it is clear that the Theil method has effectively ignored the point (70, 64) which indeed does not figure directly in either the median estimate of b, or that of a. (This is not simply because it is the last point of the eight: the advantages of the Theil method apply to any of the points.) By contrast the least- squares line has given the point (70, 64) a weight equal to any other point, so the least-squares line passes closer to it than the non-parametric line.Again, therefore, the non-parametric method has effectively circumvented the outlier problem. [Note that the MINITAB output for this set of data highlights the point (70, 64) as a possible outlier, with a standardized residual of -2.14, and that, as already shown, a straight line plot may not be the most appropriate for this data set.] The BDP of the Theil method is about 30%, more than adequate for most purposes, and it can be improved by using iterative methods to estimate the median slope and intercept.It can be freely used in cases where there are both x- and y-direction errors. Moreover, the principles underlying theANALYST, MAY 1993, VOL. 118 46 1 method are fairly general, and can be extended to polynomial and multiple regression problems. Lastly, we consider briefly the use of robust regression methods. By analogy with the robust techniques discussed earlier, we expect these to modify suspect or outlying results to an extent that depends on their departure from normal behaviour. One approach that has been shown to be powerful, leading in some circumstances to methods with BDPs of 50% , is that of trimming. The least trimmed squares and least Winsorized squares methods both utilize the least-squares concept, but the residuals concerned are modified by trim- ming or Winsorization, respectively.In this case, and possibly in contrast to repeated measurements, there have been claims that trimming is the superior approach. Several other robust methods have been developed. Over a century ago, the least absolute values method was proposed, i.e., the sum of the residuals without regard to their signs is minimized. This fairly simple method has a BDP of 50% with respect to y-direction outliers, but its BDP in the x-direction is zero. A much more recent and widely applied method is the least median of squares technique, in which an iterative process is used to minimize the median value of e,2. The main disadvantage with this method in practice is that the con- vergence of values can be fairly slow, but this is not a major problem for the small data sets encountered in analytical science. These and other robust regression methods are summarized in another established text, by Rousseeuw and Leroy.11 The importance of these methods to analysts is underlined by an increasing number of research papers in which robust regression is used, or the various robust approaches compared. This, in turn, should encourage the wider availability of appropriate software. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 References Barnett, V., and Lewis, T., Outliers in Statistical Data, Wiley, New York, 2nd edn., 1984. Hawkins, D. M., Identification of Outliers, Chapman and Hall, London, 1980. Miller, J. C., and Miller, J. N . , Statistics fur Analytical Chemistry, Ellis Horwood, Chichester, 3rd edn., 1993. Anderson, R. L. , Practical Statistics for Analytical Chemistry, Van Nostrand Reinhold, New York, 1987. Use of Statistics to Develop and Evaluate Analytical Methods, eds. Wernimont, G . T . , and Spendley, W., Association of Official Analytical Chemists, Arlington, VA, 1985. Sprcnt, P., Quick Statistics, Penguin, Harmondsworth, 1981. Sprent, P., Applied Non-Parametric Statistical Methods, Chap- man and Hall, London, 1989. Conover, W. J . , Practical Non-Parametric Statistics, Wiley, New York, 2nd edn., 1980. Analytical Methods Committee, Analyst, 1989, 114, 1489. Analytical Methods Committee, Analyst, 1989, 114, 1497. Rousseeuw, P. J., and Leroy, A. M., Robust Regression and Outlier Detection, Wiley, New York, 1987. Draper, N. R., and Smith, H . , Applied Regression Analysis, Wiley. New York, 2nd edn., 1981. Kleinbaum, D. G., Kupper, L. L., and Muller, K. E., Applied Regression Analysis and Other Multivariable Methods, PWS- Kent, Boston, MA, 1988. MINITAB (Minitab Inc.), User's Manual, Addison Wesley, London. Paper 3100356F Received January 20, 1993 Accepted February 24, 1993

 

点击下载:  PDF (1216KB)



返 回