1. Introduction
Diagnostic and screening tests used in clinical medicine may be evaluated by many metrics based on the use of test thresholds and tabulated in a 2x2 contingency table comparing the index test to a reference standard (e.g. diagnostic criteria) [
1,
2]. All of these metrics derived from binary classification have both advantages and shortcomings or limitations. For example, sensitivity (Sens) and specificity (Spec), probably the most commonly used test metrics (e.g. in Cochrane Study systematic evaluations) presuppose a known diagnosis, contrary to the typical clinical situation when such tests are administered, and hence are difficult to apply directly to the particular doctor-patient encounter. Moreover, these values will vary with the chosen test cut-off. Predictive values, positive (PPV) and negative (NPV), which are more patient-oriented measures than Sens and Spec, are dependent on prevalence of the condition in the population under study. To try to address some of these shortcomings, a new metric called the Efficiency Index has been proposed and developed [
3,
4]. The basic concept is as follows.
The 2x2 contingency table (
Figure 1) categorises all outcomes as per standard theory of signal detection [
5] as: true positives (TP) or hits; true negatives (TN) or correct rejections; false positives (FP) or false hits; and false negatives (FN) or misses. From these four cells, two conditions or relations between the index test and the reference standard may be distinguished: consistency, or matching, of outcomes (+/+ or True Positive, and -/- or True Negative); and contradiction, or mismatching (+/- or False Positive, and -/+ or False Negative).
From these two conditions, the paired complementary parameters of accuracy (Acc) or fraction correct, and inaccuracy (Inacc) or fraction incorrect or error rate, may be derived, as:
Kraemer previously denoted the sum of (TP + TN) by the term “efficiency” [
6]; accordingly, the sum of (FP + FN) may be termed “inefficiency”. Hence the ratio of efficiency to inefficiency may be denoted as the “efficiency index” (EI) [
3,
4], as:
The boundary values of EI are thus 0 (when Acc = 0; Inacc = 1) and ∞ (Acc = 1, Inacc = 0), denoting respectively a useless classifier and a perfect classifier. EI values have an inflection point at 1, whereby a value >1 indicates correct classification and a value of <1 indicates incorrect classification, such that values >>1 are desirable and a value of ∞ is an optimal classifier [
3]. EI is of the form x/(1 – x) and hence is an odds ratio.
Because the “efficiency index” terminology is potentially ambiguous (there is a similarly named physical index of speed achieved in relation to power output, and various other efficiency indexes are described for energy efficiency or in business and finance), one might instead of EI use the term “selectivity” since EI effectively selects true outcomes over false outcomes. Another possible way around any nomenclature issues is to calculate an “inefficiency index” (InI) defined as:
Like EI, values of InI have an inflection point at 1, but in this formulation an InI value <1 indicates correct classification and a value of >1 indicates incorrect classification, such that values InI values <<1 are desirable and a value of 0 is an optimal classifier (Inacc = 0, Acc = 1). InI is of the form (1 – x)/x and hence is an odds against ratio.
The EI metric has a number of potential advantages. One is the simple calculation of confidence (or compatibility) intervals (CI) for EI values by applying the log method [
7] to the base data from the four cells of the 2x2 contingency table. For the 95% CI, the formula for EI is [
4]:
where:
Classification of EI values, both qualitatively and semi-quantitatively is also possible. The boundary values of EI (0, ∞) are the same as those for likelihood ratios (LRs). LR values may be categorised qualitatively, as slight, moderate, large, or very large [
8]. It has been shown that this qualitative classification may also be applicable to EI values [
3] (
Table 1).
EI is a form of odds ratio; its boundary values (0, ∞) are the same as those for another outcome parameter derived from the 2x2 contingency table, the diagnostic odds ratio (DOR) or cross-product ratio [
9]. DORs may be categorised qualitatively, as small, medium, large, or very large [
10]. The suggested qualitative classification scheme for DORs might therefore also be applicable to EI values (
Table 1). Also, just as log(DOR) is sometimes used to compensate for small values in one or more cells of the 2x2 contingency table, one may calculate log(EI) in similar circumstances from Acc and Inacc, specifically from their logit, where:
A more quantitative classification of EI values may also be applied, based on the system for LRs derived by McGee [
11] which calculates the (approximate) difference in pre- and post-test odds, since post-test odds = pre-test odds x LR. McGee showed that LR values of 2, 5, and 10 increased the probability of diagnosis by approximately 15%, 30%, and 45% respectively, whereas LR values of 0.5, 0.2, and 0.1 decreased the probability of diagnosis by approximately 15%, 30% and 45% respectively. These figures derive from the almost linear relationship of probability and the natural logarithm of odds over the range 0.1 to 0.9. With the appropriate modification, the percentage change in probability of diagnosis may be calculated for EI values independent of pre-test probability [
3], as:
The EI construct may be extended to other formulations. Although Sens and Spec, as strictly columnar ratios in the 2x2 contingency table, are mathematically independent of prevalence, P, Acc and Inacc as diagonal ratios are dependent on P:
EI is thus dependent on disease prevalence [
4]. Both Sens and Spec, and hence Acc and Inacc, are also dependent on the level of the test, Q [
6], or threshold. Balanced and unbiased EI measures (discussed in Methods, Sect. 2.2.1 and 2.2.4 respectively) were previously introduced to try to take into account values of P and Q [
4]. In these and other studies, EI values have found application in the assessment of cognitive screening instruments used in the assessment of patients with memory complaints [
3,
4,
12].
The aims of the current paper were twofold: firstly, to extend the EI construct to other formulations, namely “balanced level” and “quality” variants (discussed in Methods, Sect. 2.2.2 and 2.2.3 respectively); and secondly to apply all the described EI variants, both previously described (standard, balanced, unbiased [
3,
4]) and new (balanced level, quality) to the dataset of a large prospective test accuracy study of a cognitive screening instrument, the Mini-Addenbrooke’s Cognitive Examination.
3. Results
In the MACE study dataset, at the optimal MACE cut-off of ≤20/30 (calculated from the maximal value for the Youden index [
15]), the outcomes were TP = 104, FP = 188, FN = 10, and TN = 453. Hence:
The dataset was used to calculate values for EI, BEI, BLEI, QEI, and UEI across the range of meaningful MACE cut-offs (
Table 2) and these were displayed graphically (
Figure 2; extended from [
4]).
These data show that those measures which do not take into account the value of Q, specifically EI and BEI, have higher maximal values and may therefore give a more optimistic view of test outcomes [
4].
Conversely, measures taking into account Q, specifically BLEI, QEI, and UEI, are more stringent measures which give a more pessimistic view of test outcomes. QEI and UEI give relatively stable values across the range of MACE cut-offs.
4. Discussion
Reanalysis of the dataset from the MACE study has shown the feasibility of calculating the different EI formulations. All these variants are based on different formulations of Acc and its complement Inacc, and the extended forms try to take into account values of base rate or prevalence, P, and/or the level or bias of the test, Q. Like EI, they all have boundary values of 0 and ∞. Of these EI formulations, the quality and unbiased formulations (QEI, UEI) were found in this dataset to be the most stringent measures of test outcome when compared to the standard and balanced formulations (EI, BEI), with the balanced level formulation (BLEI) falling in between.
In what circumstances might EI or its extended formulations prove useful in clinical practice? It has been suggested [
3,
4] that EI may effectively communicate risk to both clinicians and patients and their families, specifically the risk of correct diagnosis versus misdiagnosis for a particular test, in a way that is more transparent than for sensitivity, specificity, predictive values, and likelihood ratios. EI values have been calculated for various cognitive screening instruments [
3,
4,
12]. This measure may be of particular value when the administration of invasive tests or tests associated with morbidity and even mortality is being proposed.
5. Limitations
EI as a measure for the evaluation of classifiers has hitherto been applied only to medical classifiers. More widespread application will therefore be required to assess the utility of EI and its various extensions.
Comparison with other frequently used measures of classifier evaluation, such as area under the receiver operating characteristic (ROC) curve [
19] or the F measure [
20], as well as other odds ratios and logit transform, might also help to define the place of EI in the catalogue of measures available for these purposes [
1,
2].
Further refinements of EI and its variants may also be explored, for example to address its application to imbalanced datasets [
21] (although in the current examples there was class imbalance in the dataset, with prevalence of dementia P = 0.15 [
15]).
For example, high EI values could result from very high numbers of TN alone even if numbers of TP were modest as long as numbers of FP and FN were few, a situation which may be encountered for example when handling administrative health datasets [
22] and polygenic hazard scores [
23]. Addressing the class imbalance problem using methods which oversample the minority class (in the current example TP cases), such as variants of the synthetic majority oversampling technique, SMOTE [
21], or which undersample the majority class (in the current example TN) might be applicable. Comparisons of EI with measures such as the F measure [
2,
20] or the critical success index [
2,
24] which eschew TN values might be particularly pertinent in this class imbalance situation.
As well as statistical tests, data semantics may be crucial in applied research, such that a new semantics may need to be identified to improve the classifier [
25]. In the particular case of cognitive screening instruments, as examined here, it is known that test performance may be influenced by factors other than cognitive ability alone, such as level of fatigue, mood disorder, and primary visual or hearing difficulty. Restriction of the dataset to take into account such confounding factors might be one strategy when further examining EI and its variants. Another might be to introduce a time element, with repeat testing or sequential testing to ensure intra-test reliability as a criterion for data semantics.