3.1. ANOVAs to Compare Memory and Metamemory Outcomes Across Conditions (These Analyses Are Not Related to Study Hypotheses But Are Included for Demonstrating the Validity of Our Data)
Recall.
Table 1 shows the mean scores from the memory test. As shown, performance was low for both the word list (M = 8.47, SD = 6.60, 28.23%) and the picture list (M = 9.97, SD = 7.20, 33.23%). Nevertheless, recall was higher for the pictures than for the words, showing a well-established phenomenon referred to as the picture superiority effect (e.g., Paivio & Csapo, 1973) [
26]. A reason for the low recall level was that there were participants with zero recall. Because excluding these participants did not make a difference in the subsequent correlational analyses, these participants were kept. These participants with zero recall were evenly distributed across conditions: 11 for immediate word recall, 17 for delayed word recall, 11 for immediate picture recall, and 12 for delayed picture recall. A 2 (list type: words and pictures) x 2 (judgment type: immediate and delayed) repeated- measures ANOVA showed that the main effect of list type was significant, F(1, 106) = 4.38, p = .04, η
p2 = .04, indicating that recall was higher for the pictures (M = 4.94, SD = 3.58) than for the words (M = 4.23, SD = 3.30). Further, the main effect of judgment type was significant, F(1, 106) = 17.72, p < .001, ηp2 = .14, indicating that recall was higher for the immediate (M = 4.94, SD = 3.15) than for the delayed judgment type (M = 4.24, SD = 3.03). The interaction was not significant F(1, 106) = 0.01, p = .93, ηp2 = .00.
JOLs. Next, JOL ratings were analyzed in terms of actual ratings, relative accuracy, and absolute accuracy for immediate and delayed judgments. As
Table 1 shows, the mean JOL ratings were higher for the immediate than for the delayed judgments for both the words (immediate = 54.28 SD = 22.53 versus delayed = 33.57, SD = 19.08) and pictures (immediate = 56.85, SD = 20.38 versus delayed = 39.58, SD = 20.13). A 2 (list type: words and pictures) x 2 (judgment type: immediate and delayed) repeated-measures ANOVA showed that the main effect of list type was significant, F(1, 107) = 11.14, p = .001, η
p2 = .09, indicating that the ratings were higher for the pictures (M = 48.22, SD = 15.91) than for the word list (M = 43.93, SD = 17.13). Further, the main effect of judgment type was significant, F(1, 107) = 99.50, p < .001, ηp2 = .48, indicating that the ratings were higher for the immediate (M = 55.57, SD = 18.85) than for the delayed judgment (M = 36.58, SD = 17.26). The interaction was not significant F(1, 107) = 1.54, p = .22, η
p2 = .01.
The relative accuracy was computed by Goodman-Kruskal gamma correlation between JOL ratings and recall for each participant, which measured whether higher JOL ratings were associated with higher recall regardless of the actual ratings. For example, gamma can be similar regardless of whether an item is rated 30% or 60% as long as recall of this item is higher than other items that are rated lower; that is, what is important is that higher ratings are associated with higher recall whereas lower ratings are associated with lower recall. Similar to other correlational measures, gamma scores range from - 1 to + 1, with + 1 indicating the perfect calibration. Additionally, one weakness of gamma correlation is that gamma sometimes becomes undefined because the denominator of the formula becomes zero, and therefore, participants with undefined gamma have to be excluded from the analysis.
As shown in
Table 1, gamma was higher for the delayed JOLs than for the immediate JOLs for both the words (immediate = .33, SD = .49 versus delayed = .82, SD = .41) and the pictures (immediate = .39, SD = .36 versus delayed = .86, SD = .24), showing a well-established phenomenon referred to as the delayed JOL effect (e.g., Dunlosky & Nelson, 1992) [
28]. A 2 (list type: words and pictures) x 2 (judgment type: immediate and delayed) repeated-measures ANOVA showed that the main effect of list type was not significant, F(1, 70) = 1.19, p = .28, η
p2 = .02, indicating that gamma was similar between the words (M = .57, SD = 0.28) and the pictures (M = .62, SD = 0.22). The main effect of judgment type was significant, F(1, 70) = 95.83, p < .001, η
p2 = .58, indicating that gamma was higher for the delayed (M = .84, SD = 0.24) than for the immediate judgment (M = .36, SD = 0.29). The interaction was not significant F(1, 70) = 0.04, p = .84, η
p2 = .001.
The absolute accuracy was measured by averaging JOL ratings (0 to 100%) for each participant and subtracting the mean percent recall from the mean JOLs (which was based on 0% to 100%). If the value is zero, this means that calibration was perfect (that is, the JOL rating matched recall). Note that this way of measuring absolute accuracy is problematic. That is, if one participant overestimates recall but another participant underestimates recall, it may give an appearance of good accuracy as a group when averaged across participants. To avoid this shortcoming, absolute values were used to compare different conditions to avoid positive and negative values canceling out each other.
Table 2 shows mean absolute JOL accuracy, and as shown, absolute accuracy was higher (i.e., a lower mean closer to zero) for the delayed judgments than for the immediate judgments for both the words (immediate = 36.45, SD = 23.83 versus delayed = 14.31, SD = 13.06) and the pictures (immediate = 33.72, SD = 24.26 versus delayed = 14.34, SD = 11.37), again showing the delayed JOL effect. A 2 (list type: words and pictures) x 2 (judgment type: immediate and delayed) repeated-measures ANOVA showed that the main effect of list type was not significant, F(1, 106) = 0.80, p = .37, η
p2 = .01, indicating that absolute JOL accuracy was similar between the words (M = 25.48, SD = 15.45) and the pictures (M = 24.03, SD = 15.02). The main effect of judgment type was significant, F(1, 106) = 160.26, p < .001, η
p2 = .60, indicating that accuracy was higher for the delayed (M = 14.33, SD = 9.96) than for the immediate judgment (M = 35.19, SD = 19.24). The interaction was not significant F(1, 106) = 0.99, p = .32, η
p2 = .01.
Global JOLs. For global JOL ratings, participants were asked to predict how many items they would recall out of 30 items. Note that for the global judgments, the items used for immediate and delayed JOLs were not separated.
Table 2 shows that global JOL ratings were similar between words (M = 8.81, SD = 5.04) and pictures (M = 9.28, SD = 5.74), t(107) = 0.91, p = .37, d = 0.09. Accuracy of global JOLs was measured by subtracting the number of correct recall from the global JOLs for each participant and converting the values to absolute values to eliminate the sign. The results showed that accuracy was similar between the words (M = 4.74, SD = 4.05) and pictures (M = 5.06, SD = 3.94), t(106) = 0.63, p = .53, d = 0.06.
RCJ. RCJ ratings were higher for the pictures (M = 2.40, SD = 0.87) than for the words (M = 2.18, SD = 0.78), t(107) = 2.86, p < .01, d = 0.28, reflecting the picture superiority effect (see
Table 2).
3.2. Correlations Between IORQ and Memory and Metamemory Outcomes
Correlations. The following variables were correlated with the IORQ overall scores (based on 15 items) to examine whether there is a relationship between attitudes toward remembering and other memory/metamemory variables: (1) word recall on immediate test, (2) word recall on delayed test, (3) picture recall on immediate test, (4) picture recall on delayed test, (5) immediate JOL ratings on words, (6) delayed JOL ratings on words, (7) immediate JOL ratings on pictures, (8) delayed JOL ratings on pictures, (9) gamma scores for immediate JOL for words, (10) gamma scores for delayed JOL for words, (11) gamma scores for immediate JOL for pictures, (12) gamma scores for delayed JOL for pictures, (13) absolute accuracy for immediate JOL for words, (14) absolute accuracy for delayed JOL for words, (15) absolute accuracy for immediate JOL for pictures, (16) absolute accuracy for delayed JOL for pictures, (17) global JOL for words, (18) global JOL for pictures, (19) global JOL accuracy for words, (20) global JOL accuracy for pictures, (21) RCJ ratings for words, and (22) RCJ ratings for pictures.
The correlation matrix for these variables is illustrated in Table 3, which is included in supplemental materials due to its size. Only two variables, absolute accuracy for JOL for delayed words and absolute accuracy for JOL for delayed pictures, showed significant relationships with the IORQ scores. For both variables, the correlations were positive and significant: for the words, r(105) = .23, p = .02, and the pictures r(106) = .23, p = .02. Scatter plots are shown in
Figure 1 (for the words) and
Figure 2 (for the pictures). Both figures show that absolute accuracy of JOL becomes lower (i.e., the score deviates more from zero) as IORQ scores increase. In other words, absolute accuracy of JOL ratings became lower as participants rated the importance of remembering higher. Nevertheless, overall, the results of these correlational analyses fail to support Hypotheses 1 thru 3 and 5. That is, higher IORQ scores would be positively correlated with higher accuracy and performance.
3.3. Mental Workload Results
The NASA TLX was used to measure subjective mental workload. This scale is based on a 21-point, measuring 6 dimensions: mental demand, physical demand, temporal demand, performance, effort, and frustration. Because the tasks in this study did not involve physical activities, physical demand was omitted. The ratings for mental demand showed a higher mean for the words (M = 14.86, SD = 4.23) than for the pictures (M = 13.47, SD = 4.70), t(107) = 3.65, p < .001, d = 0.35. The ratings for temporal demand also showed a higher mean for the words (M = 10.46, SD = 4.83) than for the pictures (M = 9.67, SD = 4.81), t(107) = 2.28, p = .03, d = 0.22. The dimension of performance was reverse-scored to make a higher score to be associated with higher performance, and the means were similar between the words (M = 13.13, SD = 5.43) and pictures (M = 12.82, SD = 5.51), t(107) = 0.47, p = .64, d = 0.05. The ratings for effort showed a higher mean for the words (M = 13.49, SD = 4.52) than for the pictures (M = 12.80, SD = 4.73), t(107) = 2.14, p = .04, d = 0.21. Lastly, the ratings for frustration was higher for the words (M = 11.10, SD = 5.63) than for the pictures (M = 10.82, SD = 5.56). The differences were consistent with the notion that the pictures were easier to remember than the words. The reliability of NASA TLX was modest for both the words (α = .64) and the pictures (α = .73). Table 4 shows the correlation between the IORQ ratings and the ratings on each dimension. As shown, none of the correlations were significant, failing to support Hypothesis 4.