Test results of EQAS samples reported by a Laboratory are transformed into a z-score. Absolute z-scores below 2 are satisfactory, between 2 and 3 are questionable and over 3 are unsatisfactory (see my blog on z-scores). Under normal circumstances we expect that by natural variability scores will vary between -2 and 2 and that by exception a value below -2 or over +2 is found. However, also in the case that all z-scores obtained over time remain within the interval [-2,+2] there might still be a problem with the laboratory.
Indeed, it is very unlikely that a laboratory would generate a series of z-scores all with equal sign. This indicates the presence of a laboratory (+method) bias. This can be checked by evaluating the RSZ = Sum(z)/sqr(m) statistics (m = number of z-scores) (RSZ = Rescaled Sum of Z). Moreover it is very unlikely that all the time z-scores are found at one of the extremes of the [-2,+2] interval. This is evaluated by calculating the RSSZ statistics (Reduced Sum of Squares of Z) = Sum(z2)/m. Both RSZ and RSSZ test values can be evaluated by using the appropriate statistical test.
It is quite striking that several accreditation bodies (e.g. the Standards Concil of Canada, the European Department for the Quality of Medicines of the Council of Europe and the CANMET-MMSL ISO document ) incorporate this time aspect in their guidelines and WADA does not. This despite the fact that in doping testing the detection of systematic errors (or bias) is of great importance.
We should nevertheless warn against an over-interpretation of z-scores
- Comparing z-scores between rounds or between laboratories has to be done with great caution. A single laboratory operating consistently in line with the fitness for purpose criterion would typically produce z-scores in successive rounds covering the range –2 to +2: the following set [0.6, -0.8, 0.3, 1.7, 0.7, -0.1] would be typical. The small ups and downs between the scores do not indicate a change in performance – they arise by chance. So 1.7 is not ‘worse’ than 0.3: it does not indicate deterioration in performance!!
- Because of this ‘natural variation’ it is not sensible to make a ‘league table’ of laboratories (or to attribute points) based on their z-scores in a round. It is not valid to claim that a laboratory scoring 0.3 in a round is better than another scoring 1.7.
- Judgments based on a time-averaged z-scores require caution as well. Averages of z-scores obtained on a number of different analytes should not be used: they may well hide the fact that one of the analytes consistently gives a poor z-score. Averages of scores from the same analyte over several rounds need expert interpretation on a statistical basis, as we indicated above.