首页   按字顺浏览 期刊浏览 卷期浏览 Proficiency testing of analytical laboratories: organization and statistical assessment
Proficiency testing of analytical laboratories: organization and statistical assessment

 

作者:

 

期刊: Analyst  (RSC Available online 1992)
卷期: Volume 117, issue 1  

页码: 97-104

 

ISSN:0003-2654

 

年代: 1992

 

DOI:10.1039/AN9921700097

 

出版商: RSC

 

数据来源: RSC

 

摘要:

ANALYST, JANUARY 1992, VOL. 117 97 Proficiency Testing of Analytical Laboratories: Organization and Statistical Assessment Analytical Methods Committee* Royal Society of Chemistry, Burlington House, Piccadilly, London WI V OBN, UK Proficiency testing is becoming an integral feature of laboratory accreditation, which itself is now being advocated as a result of the development of the European Community Certification and Accreditation Policy. In analytical science, proficiency testing involves the regular circulation of test materials for analysis in the participating laboratories, and the subsequent assessment of the resulting data by the organizing body. It plays a vital role in the achievement and maintenance of appropriate data quality, in combination with the use of certified reference materials, validated methods and data quality control.In this report the technical background and organization of proficiency testing are presented, together with statistical methods for interpreting the results. Because of the rapid proliferation of proficiency testing schemes, it is recommended that anlytical chemists move towards a unified approach t o the presentation of results from trials, so that the meaning of the results of all schemes is immediately apparent. This is best achieved by the use of standard elementary statistics. Specific recommendations are given and a specimen protocol is appended. Keywords: Proficiency testing; accreditation; data quality The Analytical Methods Committee has received and has approved for publication the following report from its Statistical Sub-committee.Report The constitution of the Sub-committee responsible for the preparation of this report was: Dr. M. Thompson (Chairman), Dr. W. H. Evans, Mr. M. J. Gardner, Dr. E. J. Greenhow, Dr. R. Howarth, Dr. E. J. Newman, Professor B. D. Ripley, Mrs. K. Swan and Dr. R. Wood with Mr. J. J. Wilson as Secretary. The Analytical Methods Committee acknowledges the financial support from the Ministry of Agriculture, Fisheries and Food. The views expressed and the recommendations made in this paper are those of the Analytical Methods Committee and not necessarily those of the Ministry of Agriculture, Fisheries and Food. Introduction Tests are carried out in analytical laboratories in order to make important decisions. It is essential, therefore, to operate a scheme to monitor the reliability of data originating from these laboratories.Such a scheme reinforces an interest in quality control and provides the basis for review and correc- tive action in those laboratories where data do not meet criteria of acceptability. The continuing assessment of labora- tory competence provides a record of any improvements that have been achieved or, conversely, may alert a laboratory to declining performance and so prompt the introduction of remedial measures before the deterioration has become serious. This type of scheme is known as proficiency testing and a number of such schemes already exist for particular areas of interest such as clinical biochemistry, food analysis and environmental monitoring.A proficiency scheme tests the competence of a group of participating laboratories by a statistical evaluation of the data they obtain on analysing distributed materials. Each laboratory is then provided with a numerical indicator of its performance, together with informa- j: Corrcspondencc should be addressed to the Secretary, Analytical Methods Committec. Analytical Division. Royal Society of Chcm- istry, Burlington House, Piccadilly, London W1V OBN, UK. tion on the performance of the group as a whole, enabling proficiency relative to the group to be compared and evaluated. Proficiency testing is becoming an integral feature of laboratory accreditation, which itself is now being advocated as a result of the development of the European Community Certification and Accreditation Policy. Proficiency testing schemes usually operate a few rounds of tests each year. They are managed by a central body which is responsible for the design of the scheme, the preparation and validation of test materials, the production and distribution of instructions and test materials to the participating labora- tories, the collection and statistical analysis of the data obtained from the tests and feedback of the results to the participants.The main aspects of proficiency testing are considered in this paper. Approaches to the analysis of test data are discussed and presented, and an example of an outline protocol for a proficiency test is given. 1. Role of Proficiency Schemes in Quality Assurance 1.1. General Context of Proficiency Testing Proficiency testing is the use of results generated in interlaboratory test comparisons for the purpose of a continuing assessment of the technical competence of participating testing laboratories.1 Alternative terms used for proficiency testing are ‘quality assessment’ and ‘external quality assessment’. Proficiency testing is distinct from other interlaboratory tests, such as collaborative trials (used for validating a standard method) , certification trials (used to establish the true value of an analyte concentration in a reference material) or co-operative trials (used for laboratory assessment on a one-off basis). Numerous schemes for proficiency testing are already in use in several types of analytical laboratory. In the current climate of intense and widespread interest in the improvement of data quality, it seems likely that proficiency test schemes will proliferate greatly.A defect already apparent in existing schemes as a whole is the diversity of performance indices in use. It is therefore difficult rapidly to appreciate the meaning of an index in an unfamiliar scheme. In order to avoid a further deterioration of this situation, the universal use of a harmonized method for the assessment of proficiency schemes is greatly to be desired. At the present time the International Union of Pure and Applied Chemistry (IUPAC), the Inter-98 ANALYST, JANUARY 1992, VOL. 117 national Organization for Standardization (ISO) and the Association of Official Analytical Chemists (AOAC) are jointly pursuing such a harmonization.2 Proficiency testing must be seen in the general context of accreditation and quality assurance.In order to gain accreditation, an analytical laboratory has to demonstrate an effective quality assurance system, which includes participation in relevant proficiency testing schemes and the routine use of a properly designed quality control system. Proficiency testing takes place periodically, and is organized by a central authority. It is distinct from data quality control, which is an activity organized on a routine basis within individual laboratories. At the present time, however, it is clear that not all laboratories employ adequate quality control. One of the main purposes of proficiency testing is therefore strongly to encourage the proper use of quality control and to incorporate an external reference to guard against bias.The type of proficiency test envisaged in this paper is one where samples of test materials are distributed, on a regular basis, to the participating laboratories for unsupervised analysis within a short period from receipt. This is appropriate for trial measurements of the concentration or amount of analyte on a quasi-continuous scale, with which this paper is solely concerned. Other types of proficiency test could be employed in this context and indeed may be useful in conjunction with the type of test described here. 1.2. Aims of Proficiency Testing Two distinct aims of proficiency tests can be formulated: ( a ) to encourage good performance generally, and especially to encourage the use of proper routine quality control measures within individual laboratories; to provide feedback to the laboratories and encourage remedial action where shortcom- ings in performance are detected; (b) to provide a rational basis for the selection or licensing of laboratories for a specific task and likewise to disqualify laboratories from a specific task should their performance fall below a certain standard.These two aims are somewhat divergent, but the motivation is the same: the identification of laboratories that produce data of unacceptable quality. Within the scope of either of these two aims, a successful proficiency test must provide certain types of information for the participants and for the organizers: (i) it must enable a laboratory to compare its performance at a particular time with an appropriate external standard of performance; (ii) it must enable a laboratory to compare its performance at a particular time with its performance in the past; (iii) it must enable a laboratory to compare its performance with that of other laboratories at a particular time; (iv) it must enable the organizers to identify participants whose performance is unsatisfactory; and (v) it must enable the organizers to see whether there is general improvement in performance with time.It is also highly desirable that a uniform method of assessing the results of proficiency tests should be applied, so that the results from different schemes, involving various test materials, analytes and analyte concentration levels, are strictly comparable on an international basis and readily understood by analytical chemists everywhere.For the above-mentioned requirements to be achieved, it is clear that the assessment of proficiency must be expressed in terms of a score that can be readily interpreted in terms of well-known statistical measures. In order to create a scoring system that is harmonized from one proficiency scheme to another, it is essential that arbitrary scaling factors are not introduced and the statistics are used in their standard form. An appropriate degree of training in statistics must be presumed in professional analytical chemists for a harmonized approach to be achieved. 1.3. Limitations of Proficiency Tests Proficiency tests are not in themselves sufficient to ensure the production of high quality data.Firstly, it is clear that the interpretation of data from proficiency tests is subject to statistical uncertainty, and the criteria on which decisions will be based are to some degree arbitrary. Secondly, there is the possibility that not all of the data will be valid. For example, laboratories faced with the prospect of exclusion from a commercial market might be tempted to improve their performance index in unprofessional ways, for example by treating the test samples with special care or by collusion with other laboratories. Some of these practices would be difficult to eliminate. Thirdly, the scope of proficiency testing is necessarily limited by costs. In most laboratories this will mean that only a small proportion of the many different determinations con- ducted can be subjected to a proficiency test.Obviously, a test method can be chosen so that it can be considered representa- tive of a class of test materials or of analytical methods. Nevertheless, there is little alternative but to assume that the proficiency demonstrated in a relatively small number of tests will be representative of the behaviour of the laboratory in a much wider range of activities. All of these circumstances lead to the same conclusion: proficiency tests, useful though they may be, cannot be the exclusive basis for action beyond exhortation to remedial activity. Decisions to disqualify laboratories should be based on further, perhaps more comprehensive, trials and on inspections of quality control records, analytical protocols and the laboratory environment.In addition to the foregoing, it must be recognized that proficiency testing is only applicable to certain classes of analytical task. Broadly, it is restricted to analyses where a determination is carried out as a matter of routine, in a group of laboratories, and where comparability and/or trueness is important. Even where these conditions prevail, there may be technical difficulties (for instance related to the nature of the test materials) that prevent the execution of proficiency testing. 2. General Organization of Proficiency Tests The organizing body is responsible for drawing up the protocol, operating the scheme, taking any appropriate action as an outcome of the scheme, reviewing on a regular basis the effectiveness of the scheme, and, where necessary, amending the protocol.Advice on the drafting of the protocol and implementation of the scheme should be sought from a technical panel consisting of: (i) a manager, with responsibility to the organizing body for running the proficiency test, distribution of the results, any follow-up action that is required and record keeping; (ii) a statistical expert; (iii) representa- tives of government bodies, commercial firms or accreditation agencies with a legitimate interest in the conduct of the tests; and (iv) representatives of professional bodies. Members of the technical panel must be familiar with the methodology of proficiency testing and be suitably qualified. It is desirable that only a minority of members have commercial interest in the outcome of the scheme.The manager must conduct the scheme in such a way that privileged information is not divulged to members of the technical panel. Information that should be so restricted includes: (a) the identities of the participating laboratories; and ( b ) the exact composition of samples distributed to participants in advance of the reporting deadline. Where the technical panel needs to review individual results, laboratories must be identified only by a code number.ANALYST, JANUARY 1992. VOL. 117 99 2.1. Stages of a Proficiency Test Proficiency tests are organized in a sequence of clearly defined stages. ( i ) The organizing body lays down a protocol for the conduct of the tests, the interpretation of the data and any sub- sequent intervention.(ii) The protocol is circulated to intending participants. (iii) The organizers prepare and validate the test materials. Then, for each round of the test: (iv) The materials are distributed to the participating labora- tories in accordance with the protocol. ( v ) The laboratories analyse the materials by appropriate methods and return the data by the prescribed date. ( v i ) The data are analysed by the specified statistical methods. (vii) Each participant is in- formed of the outcome of the statistical analysis. (viii) The organizing body takes any action required by the protocol. (ix) The organizing body reviews its effectiveness and future strategy in the light of the results and comments from the participants. 2.2. Protocol The protocol is the definitive statement of the aims of a proficiency test and the steps taken in its execution.The protocol must be sufficiently detailed to allow no alternative interpretations. Statistical methods to be used must be specified exactly so that values of the calculated statistics from a data set can be reproduced exactly. Criteria for decisions must be stated exactly, together with the outcome of the decisions. Laboratories must be fully aware of the protocol, its aims and consequences, before they participate in the scheme. Provision must be made for participants to be briefed on the protocol initially and for the feedback of views during the operation of the scheme. An outline specimen protocol illustrating the requirements for an imaginary situation is given in Appendix 1.This is not intended as a model protocol for general use. 2.3. Preparation and Validation of Materials 2.3. I . Choice and preparation of materials The prime consideration in the choice of material is that it should be as far as possible representative of the type of material that is normally analysed, in respect of composition of the matrix and the concentration range or loading of the analyte. This is most easily accomplished with some manufac- tured materials (such as steels for example), where a range of analyte concentrations in appropriate matrices can be readily obtained. Stability of material between preparation and analysis must be ensured. Natural materials (e.g., flour) can usually be obtained in large bulk. However, a difficulty is liable to arise with natural materials in that they are easily obtainable at ‘normal’ concentrations of analyte, but much more difficult to obtain at elevated levcls.I t is often the latter situation that is the focus of attention of analytical methods. Where no satisfactory alternative exists, natural materials can be fortified by spiking with analyte. Fortification is relatively easy for some materials (for example, trace metals in water) but much more difficult to execute satisfactorily for others (trace additives in animal feedingstuffs) where the achicvement of homogeneity is a problem. Some limitations of spiking are discussed subsequently (Section 3.1.). Spiking is a valuable method where the analytical measure under consideration is the amount of analyte rather than its concentration.This is particularly true when the spike can be added to an inert and analyte-free matrix such as a filter. Certified reference materials will usually not be suitable for use in proficiency tests. 2.3.2. Quality of test material Materials need to be tested before distribution for mean level of analyte and for homogeniety. The mean level determina- tion is merely a check that the material is appropriate for the needs of the proficiency test and is distinct from the establishment of the true value. However, the assumption of effective homogeneity underlies all interptetation of test data and so must be established for each separate batch of test material. These tests are most effectively carried out in a single competent laboratory.The experimental design for such a test is a replicated randomized trial. After a bulk material is subdivided for distribution, a random selection of 10-20 of the containers should be taken and the contents of each subjected to replicate analyses. This enables the between-sample variance to be estimated by analysis of variance. This variance ideally should be small in comparison with the magnitude of the target variance of reproducibility used in the subsequent proficiency test (Section 3.2.). After validation, the materials should be stored and distributed under conditions that minimize the effects of any instability of the sample. 2.4. Distribution of Samples There is no experimentally established optimum frequency for the distribution of samples. Informed opinion is that the minimum frequency should be four rounds per year.Tests that were less frequent would probably be ineffective in reinforcing the perceived need for maintaining quality standards or for following up marginally poor performance. A frequency of one round per month for any particular type of analysis is the maximum that is likely to be effective. Postal circulation of samples and results would impose an absolute minimum of about 2 weeks for a round to be completed and it seems unlikely that a laboratory would have time to respond to the results of one round if another followed within a 2 week period. Over-frequent rounds might have the counter-produc- tive effect of discouraging laboratories from conducting independent routine quality control. If the organizers have at their disposal just one large batch of a particular test material, the participants will become aware of the consensus value after the first round and the credibility of the results in successive rounds would be compromised. Therefore, the organizers should procure several batches of nominally similar materials, for example from different suppliers, and with small differences in the analyte level.These could be distributed in a random manner, so that the participants would have no advance information of the true concentration of the analyte. An extension of this idea would render very difficult any possible collusion between laboratories within a single round of the test. If laboratories were to receive either of two similar test materials, selected at random by the organizing body, they would have no logical basis for adjusting their results, because they would not know whether they have samples from a common source or not.The amount of material distributed to each laboratory must be sufficient for normal analytical practice. 2.5. Collection and Analysis of Data A detailed consideration of the statistical treatment of data is deferred until Section 3 of this paper. However, some preliminary considerations are recorded here. It is assumed that the data for statistical analysis are on a quasi-continuous scale, i.e., that the digital resolution of the data is at least one order of magnitude smaller than the inherent variability of the measurements. Hence in this paper range-data, such as might be obtained by visual comparison, are not considered.It is the responsibility of the organizers to instruct the participants how to report the data by the provision of results sheets. It is considered that attempts to measure repeatability in proficiency tests are superfluous. Repeatability could be100 ANALYST, JANUARY 1992, VOL. 117 monitored by duplication within a laboratory, but the infor- mation gained would be suspect unless the duplication were blind and the pairing not obvious from the analytical results. This would require extra effort on the part of the laboratories and organizers, effort that might be better spent on a wider variety of analyses. Moreover, numerous trials have shown that repeatability is usually small compared with variation between laboratories.Even laboratories reporting wildly inaccurate results can usually get their blind duplicates to agree satisfactorily. Finally, it would be virtually impossible to prevent laboratories reporting the means of many individual results in an attempt to improve their repeatability. Replica- tion is therefore not recommended in the general situation, but could be allowed if particular circumstances suggest it. 2.6. Feedback to Participants Participants need to be informed of the outcome of a round of the trial as soon as possible after the closing date for the reporting of results. They need to have information on their Performance in relation to performance targets and on the general performance of all other laboratories. This should be presented in the form of a pre-defined common scoring system that can be applied to any type of analysis and in the form of readily understandable graphical output.An output should show the participant’s raw data as originally entered into a computer system by the organizers, so that the participant can check for transcription errors in data entry and, if necessary, ask for a revised assessment. Participants must be informed of the method used to estimate the true value. An example output that might be sent to a participant is shown in Appendix 2. 3. Approaches to Data Analysis in Proficiency Testing 3.1. Estimates of True Value The first stage in producing a score from a result x (a single measurement of analyte concentration in a test material) is obtaining the estimate of the bias, thus: bias = x - X where X is the true concentration or amount of analyte.The efficacy of any proficiency test depends on using a reliable value for X . Severa! methods are available for establishing a working estimate (X) of X for natural or artificial materials. (i) The addition of a known amount or concentration of analyte to a base material containing none. This method is completely satisfactory in many instances, especially when it is the amount of analyte rather than concentration that is subject to testing. However, problems arise in other situations, namely: (a) it is necessary that the base material is in fact effectively free from analyte; ( b ) it may be difficult to mix the analyte homogeneously into the base material where this is required; and ( c ) the speciation of the analyte added may be different to that found in actual test materials where the analyte may be chemically bound to the matrix.Hence the use of a material containing the analyte in its natural (or normally occurring) form is preferred where this is possible. (ii) The use of a consensus value produced by a group of expert or referee laboratories using best possible methods. This is probably the closest approach to true values for representative materials under practical circumstances. There are obvious reasons for using such a value if it is available. There are also arguments against using it, namely: ( a ) it may be expensive to execute; and (b) there might be lingering doubts about the validity of the consensus value, especially among the participants.(iii) The use of a consensus value, produced in each round of the proficiency test, and based on the results obtained by the participants. The consensus is usually estimated as the mean of the observations remaining after outliers have been detected and eliminated, but other possible estimators include the robust mean and the modal value. The consensus of partici- pants is clearly the cheapest estimator to obtain. Objections that can be levelled against such a value are: ( a ) there may not be a real consensus among the participants; and ( b ) the consensus may be biased because of the general use of faulty methodology. Neither of these conditions is rare in the determination of trace constituents. The choice between these methods of evaluating &depends on circumstances.It is usually advisable to have one other estimate in addition to the consensus of participants. Any significant deviations observed between the estimates must be carefully considered by the technical panel. The choice between a consensus either from expert labora- tories or from the participants depends, in part, on whether the aim of the proficiency test is to encourage the production of true results or merely to obtain conformity among the participants. In spite of the extra cost, it is felt that considerably more attention should be paid to trueness than has been hitherto. In an empirical method, e . g . , the determination of ‘fat’, the true result (within the limits of measurement uncertainty) is produced by a correct execution of the method.Empirical methods are used when the analyte is ill-defined chemically. It is clear that in these circumstances the analyte content is only defined if the method is simultaneously specified. Empirical methods can give rise to special problems in proticiency trials when a choice of such methods is available. If X is obtained from expert laboratories and the participants use a different empirical method, a bias may be apparent in the results even when no fault in execution is present. Likewise, if participants are free to choose between empirical methods, no valid consensus may be evident among them. Several recourses ?re available to overcome this problem: (i) a separate value of Xis produced for each empirical method used; (ii) participants are instructed to use a prescribed method; or (iii) participants are warned that a bias may be the result of using a different empirical method.3.2. Formation of a z-Score Most proficiency testing schemes proceed by comparing the estimate of the bias with a standard error. An obvious approach is to form the z-score given by z = (x - */o where CJ is a standard deviation. The value of o could be chosen either as an estimate of the actual variation encoun- tered in a particular round ( 5 ) of a trial or a target representing the maximum allowed variation consistent with valid data. In the former situation, ( S ) should be estimated from the results of the laboratories after outlier elimination, or by robust methods,3 for each analyte/material/round combina- tion. A value of j: will therefore vary from round to round (hopefully steadily decreasing). In consequence, the z-score for a laboratory could not be compared directly from round to round.However, the bias ( x - X ) for a single analyte/material combination could be usefully compared round by round for a laboratory and the corresponding value of S would indicate general improvement in ‘reproducibility’ round by round. A fixed target value for o is greatly preferable, however, and can be arrived at in several ways. (i) o could be fixed ar- bitrarily, with a value based on a perception of how laboratories should perform. The problem with this criterion is that perceptions change with time, and laboratory perfor- mance may improve with advances in analytical technology. The value of o may therefore need to be changed occasionally, disturbing the continuity of the scoring scheme.However, there is some evidence that laboratory performance responds favourably to a stepwise increase in performance standards. (ii) o could be an estimate of the precision required for a specific task of data interpretation. This is the most satisfac-ANALYST, JANUARY 1992, VOL. 117 101 tory type of criterion, it if can be formulated, because it relates directly to the required information content of the data. (iii) Where a standard method is prescribed for the analysis, CY could be equated with oR, the standard deviation of reproduci- bility obtained during a collaborative trial. (iv) o could be derived from a model of precision, such as the Horwitz curve.4 However, although this model provides a general picture of reproducibility, substantial deviation from it may be experi- enced for particular methods.It should be used only with considerable caution for the present purposes, if no other information is available. A fixed value for o has the advantage that the z-scores derived from it can be compared from round to round to demonstrate general trends for a laboratory. 3.3. Interpretation of z-Scores If 2 and CI are good estimates of the population mean and standard deviation (or are known to be true), then z will be approximately normally distributed with a mean of zero and a unit standard deviation. Analytical results can be described as ‘well behaved’ when they comply with this condition. An absolute value of z (lzl) greater than three suggests poor performance in terms of accuracy.This judgement depends on the assumption of the normal distribution, which, outliers apart, seems to be justified in practice. Because z is standardized, it is comparable for all analytes, test materials and analytical methods. Because of this com- parability, values of z obtained from diverse materials and concentration ranges can, with due caution (see below), be combined to give a composite score for a laboratory in one round of a proficiency test. Moreover, the meaning of z-scores can be immediately appreciated, i.e., values of lz1<1 would be very common and values of lz1>3 would be very rare in well behaved systems. 3.4. Alternative Score An alternative type of scoring, which here is called Q-scoring, is based not on the standardized value but on the relative bias, namely Q = ( x - ?)/A where x and 2 have their previous meaning.This type of score is used when the participants in a proficiency test have diverse standards of performance and there is no basis for a commcy value of o. Q is centred on zero with a standard error of S/X. Whereas relative bias is an obvious measure, its statistical significance depends on the value of S/X. Methods suitable for trace amounts of analyte are likely to show much larger standard errors than are the more precise methods for major constituents. Therefore, values of Q from different sources may not be comparable or capable of valid combination unless the assumption that the SIX values are comparable can be justified.This would be a reasonable assumption for a single analyte/material/method combination where the range of analyte concentrations in the different materials fell above about 20 times the system detection limit for the analyte.5 In favourable situations the assumption could be extended to include several analytes determined by a common method. Where it is possible, the use of z-scores is recommended in preference to the use of @scores. 3.5. Combination of Several z-Scores I t is common for several different analyses to be required within each round of a proficiency test. Although each individual test furnishes useful information, many participants want a single figure of merit that will summarize the over-all performance of the laboratory within a round.This approach may be appropriate for the assessment of long-term trends. However, there is a danger that such a combination score will be misinterpreted by non-experts, especially outside the context of the individual scores. Therefore, the general use of combination scores is not recommended, but it is recognized that they may have specific applications if used with due caution. It is especially emphasized that there are limitations and weaknesses in any scheme that combines z-scores from dissimilar analytical methods. If a single score out of several produced by a laboratory were significant, the combined score may well be not significant. In some respects this is a useful feature, in that an occasional lapse in a single method is downweighted in the combined score.However, there is a danger that a laboratory may be consistently at fault only in a particular method and frequently report an unacceptable value for that method in successive rounds of the trial. This factor may well be obscured by the combination of scores. Of the various methods used to combine z-scores, the following are statistically well-founded and can be used within the limitations discussed above. (i) The sum of scores, SZ = X z . (ii) The sum of squared scores, SSZ = 2 ~ 2 . (iii) The sum of absolute values of the scores, SAZ = ClzI. These statistics fall into two classes. The first class (contain- ing only SZ) uses information about the signs of the z-scores, whereas the alternative class (SSZ and SAZ) provides information about only the size of scores, i.e., the magnitude of biases.Of the latter, the sum of the squares is more tractable mathematically and is therefore the preferred statistic although it is rather sensitive to single outliers. The SAZ method may be especially useful if there are extreme outliers or many outlying laboratories, but its distribution is complicated and its use is not, therefore, recommended. 3.5.1. Sum of scores, SZ The distribution of SZ is zero-centred with variance m, where m is the number of scores being combined. Hence SZ could not be interpreted on the same scale as the z-scores. However, a simple scaling restores the unit variance, giving a rescaled sum of scores RSZ = Zz/v%z, which harmonizes the scaling, i . e . , both z and RSZ can be interpreted as standard normal deviates.The SZ and RSZ methods have the advantage of using the information in the signs of the biases. Hence if a set of z-scores were (1.5, 1.5, 1.5, 1.5), the individual results would be regarded as non-significant positive scores. However, re- garded as a group, the joint probability of observing four such deviations together would be small. This is reflected in the RSZ value of 3.0, which indicates a significant event. This information would be useful in detecting a small consistent bias in an analytical system, but would not be useful in combining results from several different systems, where a consistent bias would not be expected and is unlikely to be meaningful. Another feature of the RSZ is the tendency for errors of opposite sign to cancel.In a well-behaved situation (i.e., when the laboratory is performing without bias according to the designated 0 value) this causes no problems. If the laboratory were producing badly behaved results, however, the possibil- ity arises of the fortuitous cancellation of significantly large z-values. Such an occurrence would be very rare by chance. These restrictions on the use of RSZ serve to emphasize the problems of using combination scores derived from various analytical tests. When such a score is used, it should be considered simultaneously with the individual scores. 3.5.2. Sum of squared scores, SSZ This combination score has a chi-squared (~2) distribution with m degrees of freedom for well-behaved results. Hence there is no simple possibility for interpreting the score on a common scale with the z-scores.However, the quantiles of the102 ANALYST, JANUARY 1992, VOL. 117 x 2 distribution can be found in most compilations of statistical tables. The SSZ method takes no account of the signs of the z-values, because of the squared terms. Hence, in the example considered previously, where the z-scores are (1.5, 1.5, 1.5, 1 S ) , it is found that SSZ = 9.0, a value that is not significant at the 5% level and does not draw sufficient attention to the unusual nature of the results as a group. However, in proficiency tests, concern is directed much more towards the magnitude of deviations than their direction, hence SSZ seems appropriate for this use. Moreoever, the problem of chance cancellation of significant z-scores of opposite sign is elimi- nated.Hence the SSZ has advantages as a combination score for diverse analytical tests and is to an extent complementary to RSZ. 3.6. Running Scores Although the combination scores discussed above give a numerical account of the performance of a laboratory in a single round of the proficiency test, for some purposes it may be useful to have a more general indicator of the performance of a laboratory. Although the value of such indicators is questionable they can be constructed simply and give a kind of average impression of the scores over several rounds of the test. For example, a running SSZ covering the current (nth) round and the previous k rounds could be constructed as follows: where zij is the z-score for the ith material in the jth round. In a well-behaved system, RSSZ would have the distribution xzrn(k + 1 1 , which could be used to set action limits on the experimental values.The running score has the alleged advantage that instances of poor performance restricted to one round are smoothed out somewhat, allowing an over-all appraisal of performance. On the other hand, an isolated serious deviation will have a ‘memory effect’ in a simple running score that will persist until ( k + 1) more rounds of the trial have passed. This might have the effect of causing a laboratory persistently to fail a test on the basis of the running score, long after the problem has been rectified. Two strategies for avoiding undue emphasis on an isolated bad round can be formulated. Firstly, individual or combined scores can be restrained within certain limits.For example, a rule such as: if IzI > 3 then z’ = +3 could be applied, the sign being the same as that of z , where z is the raw value of a z-score and the modified value z’ is limited to the range k3. The actual limit used could be set in such a way that an isolated event does not raise the running score above a critical decision level for an otherwise well-behaved system. As a second strategy for avoiding memory effects, the scores could be ‘filtered’ so that results from rounds further in the past would have a smaller effect on the running score. For example, exponential smoothing uses: m calculated by 2, = (1 - a)zn + air, - 1 where a is a parameter between 0 and 1, controlling the degree of smoothing.3.7. Classification, Ranking and Other Assessment of Pro- ficiency Data 3.7.1. Classification If the frequency distribution of a proficiency score is known or can be assumed, significance can be attributed to results according to the quantiles of that distribution. For example, in a well-behaved analytical system, z-scores or values of RSZ would be expected to fall outside the range -2 < z < 2 only in about 5% of instances and outside the range -3 < z < 3 only in about 0.3%. In the latter situation the probability could be interpreted as so small for a ‘well-behaved’ system that it almost certainly represents badly behaved results. Hence a classification based on z-scores could be made: IZJ d 2 Satisfactory 2 < I z 1 < 3 l z 1 2 3 Unsatisfactory The same classification could be applied to scores such as SSZ and RSSZ that have x 2 distributions (Appendix 3).Care is required in practice because our knowledge of the relevant probabilities rests pn two questionable assumptions: (i) that appropriate values X and o are being used; and (ii) that the underlying distribution of analytical errors is normal, apart from outliers. In addition, the division of a continuous measure into a few named classes has little to commend it from the scientific point of view. Questionable 3.7.2. Ranking Laboratories participating in a round of a proficiency trial can be ranked on their combined score for the round or on a running score. Such a ranked list could be used for encourag- ing better performance in poorly ranked laboratories by providing an invidious comparison among the participants.However, ranking is not recommended as it is an inefficient use of the information available. A histogram is a more effective method of presenting the same data. 4. Recommendations 4.1. Proficiency tests make an important contribution to the accuracy of analytical data and should therefore be im- plemented wherever it is appropriate and technically feasible. Proficiency tests must be conducted by a properly constituted organizing body, advised by a technical panel, according to a written protocol. The protocol prescribes the conduct of the test and the consequences of participation and must provide for adequate feedback of information to the participants. 4.2. Proficiency tests must be designed to provide informa- tion on the performance of a laboratory in relation to that of other laboratories; in relation to prescribed standards of performance; and in relation to its own past performance. In addition, proficiency tests must provide information on changes in the level of performance of individual laboratories or groups of laboratories.Emphasis should rest on trueness as the main requirement in analysis rather than simple consis- tency amongst laboratories. 4.3. Proficiency tests should be conducted not less than quarterly. 4.4. The agreement and general use of a harmonized scoring system for the appraisal of the results of proficiency tests is an important target for the immediate future. To this end, the scores used must be standard statistics in their basic form, without arbitrary scaling factors. The z-score is recommended as the basic score for proficiency tests.4.5. Care must be taken in the selection and interpretation of combination scores. Where applicable, the combination scores and running scores described in sections 3.5 and 3.6 should be used. 4.6. The use of alternative batches of materials is recom- mended as a precaution against the reporting of falsified data.ANALYST, JANUARY 1992, VOL. 117 103 4.7. Proficiency testing schemes have inherent limitations and their findings need to be considered alongside wider evidence for decisions relating to licensing and accreditation. APPENDIX 1 Example of an Outline Protocol for a Proficiency Test This is intended to be an example of a hypothetical scheme: numerical details have been specified for the purpose of illustration only.Real schemes will have to take account of factors specific to their area. 1. Name of the Scheme The scheme will be called FLPTS (Food Laboratory Pro- ficiency Testing Scheme). 2. Distribution of Materials and Return of Results There will be four distributions of materials per year, dispatched by post on the Monday of the first full working week of January, April, July and October. Results must reach the, organizers by the last day of the respective month. A statistical analysis of the results will be dispatched to partici- pants within 2 weeks of the closing dates. 3. Analyses Required The four analyses required in each round will be: ( i ) aflatoxin in peanut butter; (ii) lead in milk powder; (iii) fat in a dried meat product; and (iv) Kjeldahl nitrogen in a cereal product.4. Methods of Analysis Fat shall be determined by BS4401: Part 4 (1970): Method B. No particular methods are prescribed for the other analytes, but they should be the methods used in routine analysis. Participants must provide an outline of the method actually used or give a reference to a documented method. Participants must report a single result, in the same form as provided for a client. 5. Number of Batches of Material The organizers will employ several similar batches of each material simultaneously, so that: (i) within a round, labora- tories do not all receive samples from the same batch; and (ii) a laboratory does not, except by chance, receive samples from the same batch in successive rounds.6. Assignment of Reference Values Estimates of true analyte concentration 8will be arrived at for each batch of material as the robust mean (see below) of the results of six expert laboratories. Reference values for the standard deviation of reproducibility (a) will be derived as follows: ( i ) a1 = 1 + 0.381 pg kg-1 (ii) (52 = 0.2 +A0.0582 pg g-1 (iii) 03 = 0.04X2 % m/m ( Z V ) 04 = 0.015X4 % m/m 7. Statistical Analysis in the nth Round of the Trial 7. I . Robust means and standard deviations For each of the materials calculate: x* = 3 / 0 2 t = v%(/((x" - XI)/$ where f and 9 are the appropriate robust means and standard deviations, respectively, calculated by the method recom- mended by the Analytical Methods Committee, X and o are the reference values defined in Section 6 above and L is the total number of laboratories returning results.The organizers will compare the values of x* ?nd t with reference distributions to check that the values of X and o are sensible. 7.2. Z-Scores Each individual result ( x ) is standardized by conversion into a z-score thus: z = ( x - */a 7.3. Sum of squared z-scores (SSZ) For each laboratory, the z-scores for the round are combined to give an over-all score for the round: ssz = zz2 This score will be used only for the purpose of compiling long-term trends. 7.4. Rescaled sum of z-scores (RSZ) For each laboratory, the z-score for an individual type of material is combined with the corresponding values for the previous three rounds to give a rescaled running score for that material. Hence RSZ = Z z / a 8. Decision Limits Remedial action will be recommended when any of the z-scores or RSZ fall outside the range -3 to 3. APPENDIX 2 Example Output That Might be Sent to a Participant Laboratory FOOD LABORATORY PROFICIENCY TESTING SCHEME ROUND NO. 14 JANUARY 1991 LABORATORY NO. 31 REPORTED ASSIGNED ASSIGNED ANALYTE RESULT VALUE SIGMA 2-SCORE RSZ AFLATOXIN (TOT) 10.7 14.4 5.32 -0.70 -2.14 (PPB) LEAD (PPM) 0.0 0.41 0.22 - 1.86 0.03 FAT (% M/M) 9.78 9.50 0.38 0.74 0.88 NITROGEN (% M/M) 2.61 2.55 0.038 1.57 1.30 SSZ FOR ROUND 6.96104 ANALYST, JANUARY 1992, VOL. 117 ~ FLPTSROUNDNO 14 ' JANUARY 1991 AFLATOXIN Z-SCORES LABORATORY NO 31 = X LOWER LIMIT FLPTS ROUND NO 14 JANUARY 1991 FAT Z-SCORES LABORATORY NO LOWER LIMIT 31 = X FLPTS ROUND NO 14 JANUARY 1991 LEAD Z-SCORES LABORATORY NO LOWER LIMIT 31 = X APPENDIX 3 Points of the 22 Distribution Values of the critical point for scores resulting from combining n z-scores. A combined score is satisfactory if its magnitude is less than A, questionable between A and B, and unsatisfactory over B. The values are upper 4.55% and 0.27% points of the x 2 distribution corresponding to two-sided Z values of 2 and 3" n 2 3 4 5 6 7 8 9 10 A 6.18 8.02 9.72 11.31 12.85 14.34 15.79 17.21 18.61 B 11.83 14.16 16.25 18.21 20.06 21.85 23.57 25.26 26.90 n 11 12 13 14 15 16 17 18 19 20 A 19.99 21.35 22.70 24.03 25.35 26.66 27.96 29.25 30.53 31.80 B 28.5 1 30.10 31.66 33.20 34.71 36.22 37.70 39.17 40.63 42.08 References 1 IS0 Guide 43-1984 (E), Development and Operation of Labor-a- tory Proficiency Testing. ISO, 1984. 2 Proceedings of the Third International Symposium on the Har-monisarion of Quufiry Assurance Systems in Chemical Analysis. Washington, DC, 1989 (ISO/REMCO 184). Analytical Methods Committee. Anulysr 1989. 114, 1693. Boyer, K. W.. Horwitz. W.. and Albert. R., Anal. Cliem.. 1985, 57, 454. Analytical Methods Committee. Analyst 1987. 112. 199. 3 4 5 Paper /I045160 Received August 29, 1991 * The values were calculated by Professor B. D. Ripley

 

点击下载:  PDF (1228KB)



返 回