1. Introduction
The 2x2 contingency table remains the most fundamental of data structures in the categorical data analysis literature. Due to its relative simplicity in summarizing the joint frequencies of two dichotomous variables, understanding the nature of the association between these variables has a long and rich history. One may refer to the key texts of [
1](pp. 219–225), [
2](Chapter 2), [
3] (Chapter 3), [
4] (Chapter 6), [
5](Chapter 2), [
6](Section 2.1–2.3), [
7,
8](Section 4.4) and [
9] (Chapter 4) for comprehensive and varied discussions on the analysis of 2x2 contingency tables. The past three decades have seen new techniques developed for the analysis of 2x2 contingency tables that involve analyzing the association between the variables when the joint frequencies are known (or assumed) unknown or missing. This branch of analysis is known as
ecological inference (EI) and is especially popular in the political and social sciences where marginal, or aggregate, data of dichotomous variables is all that is available for analysis; see, for example [
10,
11,
12,
13,
14]. As a result, EI has seen this area of research also gain momentum as a topic of research in the statistics literature; see, for example, [
15,
16,
17,
18,
19,
20,
21,
22]. Furthermore, the growing availability of R and Python packages has also helped to provide researchers with the tools necessary to perform EI; see, for example, [
23,
24,
25,
26,
27,
28,
29]. Despite the relative youth of EI in the statistical discipline, studying the association structure between two dichotomous variables given only the marginal information was raised earlier by [
30]. His view was that the marginal information of a single 2x2 table provides only ancillary information if inferring the joint cell frequencies was of interest. [
31,
32] also considered this same issue. [
33](p. 447) agreed with Fisher’s conclusion although argued that, for “extreme” marginal frequencies, the estimation of the cell values was possible. [
34] demonstrated that, when only the marginal information is available, the maximum likelihood estimate of the joint cell values do not exist unless one of the cells is zero. Others to have considered this issue are [
35,
36,
37].
A common feature of the EI strategies proposed to date is that they all rely on assumptions of the missing data that is either unknown, untestable, or both. In fact, [
20] (p. 198), in a study of EI techniques in regard to an empirical evaluation using data describing gender and voter turnout at New Zealand elections (between 1893–1919), said
“all EI methods make assumptions about the data to compensate for the loss of information due to aggregation”
Noteworthy also is that none of the EI strategies mentioned above are applicable for the analysis of a single 2
2 contingency table. Therefore, rather than focusing on the estimation of the cell values of 2
2 tables, the focus can be and has been redirected to determining the association structure between the two dichotomous variables, given only the marginal information. In doing so, [
38,
39] developed an index that does exactly this and referred to it as the
aggregated association index or the AAI. The AAI quantifies, on a [0, 100] scale, the extent of association that may be present in the table, based only on the marginal information. It does this by identifying those cell values that lead to a statistically significant association between the variables keeping in mind that the permissible cell values are constrained by a special case of Fréchet bounds, [
40].
Further development of the AAI has since been undertaken by [
41,
42,
43,
44]. See [
45] for an application of the AAI to the 1893 election data of New Zealand, the first country to permit female voting. We also refer the interested reader to [
46] who presented a novel application of the AAI for the clustering of stratified aggregated data using the New Zealand voter turnout data (1893–1919). These applications and developments were reported in [
47] and elaborated on earlier in [
48].
In this paper, we discuss the role of the AAI for assessing how likely the two dichotomous variables of a 2
2 contingency table are statistically significant, at the α level of significance, given only the row and column totals. The major contribution of this paper is the development of a new index, the
aggregate informative index (AII) which quantifies how much information, on a [0, 100] scale, there is in the row and column totals of a 2x2 contingency table for concluding that a statistically significant association exists between the variables. It is established in this paper that, unlike Pearson’s (and other forms of the) chi-squared statistic and the AAI, the new index, AII, is immune to changes in the sample size. The applicability of the AII is demonstrated by using the real-life classic data sets of R.A. Fisher’s criminal twin data [
30] and Irving Selikoff’s asbestosis data [
49].
This paper has been divided into 6 further sections. In
Section 2, we define the notation of a 2
2 table.
Section 3 provides a brief discussion of the AAI, its theory and some of its properties, while
Section 4 defines and describes the development of the AII; the origins of this index can be found in [
50](Chapter 10).
Section 5 and
Section 6 empirically study the features of the AII using Fisher’s criminal twin data [
30] and Selikoff’s asbestosis data [
49], respectively. Some final comments are made in
Section 7.
2. The 22 Contingency Table
Consider a 2x2 contingency table,
N, of sample size n. Denote
to be the joint frequency of the
cell so that its relative joint frequency is
for
and
. Define the
row and
column marginal frequency by
and
respectively, such that
, is total sample size. We shall also denote the
row and
column relative marginal frequency by
and
respectively.
Table 1 provides a description of notation used in this paper.
For the purposes of investigating how informative the marginal frequencies of
N are for analyzing the association between the row and column variables, we consider the conditional probability
and
. Here,
is the conditional probability of the classification of an individual/unit into “Column 1” given that it has been classified into “Row 1”. Similarly,
is the conditional probability of an individual/unit being classified into “Column 1” given that they have been classified in “Row 2”. Under the hypothesis of independence between the two dichotomous variables, the expected value of
is denoted by
. We shall also consider the overall mean cell frequency of the four cells of
Table 1, which we denote by
. Therefore, the overall mean cell proportion for the
cell is
0.25.
When the cell values of
Table 1 are not known, then
lies within the Fréchet bounds
These bounds have been considered for the analysis of the 2x2 contingency table, especially in the EI literature; see, for example, [
13,
51]. By considering (1),
is therefore bounded by,
Using only the row and column marginal information of a 2x2 table, [
39] showed that when a test of the association between the variables is made at the
level of significance, the bounds of
are narrowed to
Here is the 1 - α percentile of the chi-squared distribution with one degree of freedom.
2. Aggregate Association Index (AAI)
By considering only the marginal information of a single 2x2 table, [
38,
39] developed the AAI. The AAI is bounded by [0, 100] and quantifies, for a given level of significance α, how likely a particular set of fixed marginal frequencies will enable the analyst to conclude that there exists a statistically significant association between the two dichotomous variables. An AAI close to zero indicates that there is virtually no information in the margins to suggest that such an association might exist, while an AAI close to 100 reflects that such an association is very likely to exist. This section briefly outlines the AAI and shows the impact that the sample size, n, and extreme margins have on its magnitude.
When the four cell frequencies of
N are unknown,
is also unknown, but bounded by (2). Therefore, we may consider the Pearson’s chi-square statistic as a function of
such that
See, for example, [
38](eq. (16)). Therefore, by graphically depicting the relationship between (4) and
, we obtain a parabolic curve with positive concavity. This curve is referred to as the
AAI curve and is depicted
Figure 1. Since we are interested in detecting where there exists a statistically significant association between the row and column variables of
Table 1, this can then be assessed by observing those
values which exceed the critical value of
, but lie under the AAI curve. This region is represented by the shaded area of
Figure 1. Therefore, the proportion of this shaded area, when compared with the total area under the curve, is
and is the AAI of
N. [
39] showed that (5) can be alternatively, and equivalent expressed free of the integrals so that
where
The maximum value that the AAI can attain is 100 when the extent of association between both variables is very high. Similarly, the minimum possible value of the AAI is zero and indicates that, given only the marginal information of
Table 1, the likelihood of a statistically significant association existing between the variables, at the α level of significance is very low. [
39](
Section 4) showed that the AAI can also be partitioned as follows
where
is the
aggregate positive association index and is that part of
that reflects the extent to which the marginal information of N reflects a statistically significant positive association at the α level of significance. Similarly,
is the
aggregate negative association index and reflects a statistically significant negative association at this value of α.
An important issue that needs to be considered when calculating an AAI is that the sample size of
N, n, has an impact on its magnitude. One can see that the Pearson’s chi-square statistic, (4), is greatly influenced by the magnitude of the sample size n; see also [
2](p. 56). As the sample size increases so does Pearson’s chi-squared statistic, a feature described by [
52]. Therefore, for a fixed level of significance, α, the AAI will also increase; so that doubling, say, the original sample size will double the magnitude of the chi-squared statistic. Like Pearson’s chi-squared statistic, this can create problems when assessing the association structure of the variables given only the marginal information. To help reduce this impact of the sample size on the magnitude of the AAI, [
42] derives alternative definitions of the AAI, (5). We shall not describe these alternatives here.
4. Aggregate Informative Index (AII)
To accommodate the feature that any change in the sample size of N impacts on the magnitude of Pearson’s statistic and, therefore, the AAI this section introduces a new index that assesses how informative, on a scale from 0 to 100, the marginal frequencies of a 2x2 contingency table are for concluding whether a statistically significant association exists between the variables of the table. This index is referred to as the aggregate informative index, or the AII. To develop this index, we first need to establish a “benchmark” quantity that reflects no information in the marginal totals of N.
4.1. The Benchmark Situation (No-Information)
For any given sample size, n, of a 2x2 contingency table, the individuals/units can be classified into each of the two row and two column categories in a variety of ways. Here we shall define the benchmark situation to be the case where the sample size is equally distributed between the two row categories and the two column categories. For example, in the case where n is even, the benchmark situation arises when . With no further information on the classifications made in the contingency table and assuming that the individuals/units are uniformly distributed between the two categories, this benchmark situation is considered to be the most conservative option. Allocations based on other criteria may also be considered to define the benchmark situation, but to keep the description of our new index simple, we shall not consider them here.
As described by [
39], when only marginal information is available, the benchmark situation is also where the least amount of information on the association structure exists. It is also then equally likely that the dichotomous variables are positively or negatively associated. As one moves closer to the case where the allocation of the sample size amongst the categories is deemed to be “extreme” (for example, when
or
), the information contain in the margins for establishing whether a statistically significant association exists between the variables becomes more apparent. Based on the underlying structure of the AAI, we shall now quantify how informative the marginal information is by comparing them with the benchmark situation.
In the benchmark situation, the expected cell frequency of the
cell, under the null hypothesis of independence between the two dichotomous variables, is identical to the overall mean cell value of the cells. That is,
. Therefore, in the benchmark situation,
is bounded by
while Pearson’s chi-squared statistic is a parabolic function of
with positive concavity such that
Therefore, the AAI curve that describes this relationship is symmetric around and this is also where attains its minimum value of zero. The AAI curve depicted using (7) is referred to as the benchmark curve. In this benchmark case, the maximum value of will be equal to the sample size, n, and this arises at the bounds of (6).
Figure 2 provides a visual comparison of the benchmark and AAI curves given the margins of an unspecified 2x2 contingency table. The shaded region reflects how much information there is in the row and column totals to conclude that the association between the dichotomous variables is statistically significant at the α level of significance. We now describe how to quantify the area of this shaded region.
4.2. The Aggregate Informative Index
The rationale underlying the AII is to quantify the area arising from any deviation of the AAI curve from the benchmark situation. This area is quantified relative to the maximum possible area between the benchmark and AAI curves and is defined by
so that . For (11), the numerator (denoted D) is the area under the curve specified by the difference between the benchmark curve and the AAI curve and is dependent on the range of possible values. The denominator of (11) (denoted M) is the maximum possible area under the AAI or benchmark curve.
If the AII is close to 100 then the features of the AAI curve as different to the benchmark curve as can be. Thus, the marginal information of the 2x2 contingency table varies considerably from the benchmark situation. Hence, the marginal information is deemed to be informative for determining the statistical significance of the association between the variables. Conversely, an AII close to (or equal) to zero shows that the marginal information is consistent with the benchmark situation. Therefore, the marginal information of the 2x2 contingency table is deemed to be not very informative for determining the association between the variables.
We can simplify the AAI of (11) by removing the integrals in the expression. In doing so
and
where
Therefore,
can be alternatively, and equivalently, expressed as
Therefore, the AII, (11), may be alternatively, and equivalently, expressed without the need for the integrals so that
Since the magnitude of
,
and
do not depend on the sample size, n, the magnitude of the AII is independent of n. Therefore, unlike the AAI and Pearson’s chi-squared statistic, any change in the sample size of the 2x2 contingency table does not impact on the magnitude of the AII. This feature is shown in the two applications discussed in detail in
Section 5 and
Section 6.
When visualizing the AII, identifying the points where the benchmark and AAI curves intersect is important. Since both curves are parabolic, there will be either a single or two points of intersection. Suppose we consider the case where there are two points of intersection, denoted by
and
. They can be derived by solving for
when
. Doing so yields
Depending on the configuration of the marginal information, there may also be a single point of intersection between the benchmark and AII curves.
6. Application 2: Selikoff’s Asbestosis Data
In 1963, a study was conducted that involved collecting data from 1117 insulation workers in New York. This landmark epidemiological study, and its findings published by Irving Selikoff in 1981 [
49], established the link between long-term occupational exposure to asbestos fibers and the severity of asbestosis the workers were diagnosed with. This data, summarized in
Table 4, has also been a topic of statistical discussion by [
53,
54]; where the latter studied the asbestosis data in terms of the AAI. For
Table 4,
and
so that, unlike the columns, the row marginal relative frequencies are notably different from
when compared with the marginal relative frequencies of
Table 2.
A study of
Table 4 shows that, when the cell frequencies are known, a chi-squared test of independence yields a p-value that is less 0.0001. Therefore, the association between the two dichotomous variables of
Table 4 is statistically significant. In fact, this association is positive (confirmed by testing the correlation between the variables), a conclusion which helps to confirm Selikoff’s now famous “20-year rule” [
49](p. 948); this “rule” reflects the finding that workers who were exposed to asbestos fibers for at least 20 years are at a higher risk of being diagnosed with asbestosis than workers who were exposed to the fibers for less than 20 years.
Suppose that the joint cell frequencies of
Table 4 are assumed unknown. Given the marginal information of the data the AAI is
(at the 5% level of significance) with
and
showing the association is slightly more likely to be positive than negative, given the marginal information of
Table 4.. Such a very high AAI value indicates that, given only the marginal information, it is highly likely that the association between the years of occupational exposure and whether a worker is diagnosed with asbestosis is statistically significant at the 5% level of significance. The magnitude of the AAI may in fact be due to the large sample size. However, it is also possible that the distribution of the marginal information is also very informative when making a conclusion about the association based solely on this information. To investigate how informative the marginal information is, we shall determine the AII of
Table 4.
By considering (4) and (7) for
Table 4, the AAI and benchmark curves are defined by the Pearson chi-squared statistics
and
respectively.
Figure 4 provides a graphical depiction of these benchmark and AAI curves. Using (2), the bounds of
for the AAI curve is
and the points of intersection between the AAI curve and the benchmark curve exist at
= 0.51 and
= 0.56. It is apparent from
Figure 4 that the area under the AAI curve, based on the marginal information in
Table 2, is vastly different from the area under the benchmark. In fact, D = 262.44 while M = 371.67 giving an AII of
Therefore, the configuration of the marginal information of
Table 4 suggests that they are informative for helping to detect the association structure of the two dichotomous variables when the cell frequencies are unknown.
7. Discussion
This paper presents the development of a new index, the aggregate informative index (AII), that quantifies on a [0, 100] scale how informative the marginal information of a 2x2 contingency table is for detecting a statistically significant association between the variables. The calculation of the AII is shown to depend only on the relative marginal frequencies and is independent of the sample size. We have also shown that the AII is highly responsive to changes in the configuration of the relative marginal frequencies.
Future development of the AII can see it expanded for assessing how informative the marginal information of an IxJ contingency table is, where I > 2 and J > 2. Formalizing the mathematical links between the AII and the AAI also requires attention. At present, the AII is expressed in terms of the conditional proportion,
, although we see no reason why other measures cannot be considered to quantify the AII. These include the classic odds ratio or a more general linear transformation of
. Such extensions would supplement the work on the AAI by [
41,
43].
Author Contributions
Conceptualization, S.C.; methodology, S.C., E.J.B and I.L.H.; software, E.J.B.; validation, S.C., E.J.B. and I.L.H.; formal analysis, S.C. and E.J.B.; investigation, E.J.B. and I.L.H.; writing—original draft preparation, S.C.; writing—review and editing, E.J.B. and I.L.H.; visualization, E.J.B.; supervision, E.J.B. and I.L.H. All authors have read and agreed to the published version of the manuscript.