Introduction
Breast cancer is the most common cancer diagnosed in women worldwide (
1). Prevention and screening have decreased breast cancer mortality by 3 to 35%, depending on the country and the study (
2,
3). However, mammography screening has limitations such as missed cancers and interval cancers (
4), overdiagnosis resulting in overtreatement (
5), subjective and variable human interpretation (
6,
7), and workload challenges (
8,
9).
Over the last few years, with the advent of deep learning and convolutional neural networks, artificial intelligence (AI) for medical research has advanced. Several studies have evaluated the performance of AI in mammography screening. According to Rodriguez Luiz et al. (
10), the performances of AI algorithms were shown to be better than the performance of average radiologists but worse than that of expert radiologists. Additionally, 94% of AI systems were found to be less accurate than radiologists according to a
British Medical Journal study (
11). On the other hand, AI assessment combined with radiologist expertise seems to be more efficient than AI alone or radiologist alone (
12,
13), as shown by Watanabe et al. (
14).
The expected benefits of AI are an overall improvement in radiologist performance, in particular the performance of average readers (10), an aid in the diagnosis of subtle cancers and an improvement in reading time (
15). The limits of the use of AI are the absence of consideration of clinical data and previous mammograms (
16) and a persistence of false positives leading to overalerting and overdiagnosis (
11).
To our knowledge, only one independent Swedish study has evaluated the performance of 3 commercialized artificial intelligence algorithms
as independent mammography readers compared to radiologists. That study showed that one of the AI algorithms was more efficient than radiologists alone (
17).
Currently, in Europe, several companies offer breast cancer detection software that has obtained a CE (European conformity) marking. In clinical use, AI systems target lesions and suggest cancer risk categories, generally based on three levels of risk. The intermediate category often concentrates a large proportion of false-positive lesions that may affect radiologist interpretation. It would seem necessary to improve this categorization by determining a threshold that would allow effective detection while limiting false positives. Furthermore, locating the lesions is rarely required.
The main aim of the current study was to compare the performance of 3 AI algorithms based on the same dataset at a free operative threshold. The second outcome was to evaluate performance variations when modifying the operating threshold.
Material and methods
Data selection and sample size
The study sample was extracted retrospectively from a dataset of Valenciennes Hospital (France). The oldest exam dates back to June 2012 and the most recent from March 2020. All mammograms were acquired with a Hologic Selenia 3D Dimension® system. The dataset only included screening exams. All patients were eligible for this retrospective institutional review board (IRB)-approved study (IRB number, CRM-2304-335). Written informed consent was waived by IRB.
The included women were aged from 40 to 74 years, asymptomatic, without any history of personal breast cancer and had a complete screening examination prior to diagnosis. Women with multifocal cancers were excluded.
Mammograms and patients’ medical records were reviewed by an expert radiologist with 15 years of experience in breast imaging and 6 years of experience as a second reader in the French organized screening program. He checked the inclusion criteria to mark suspicious cancer lesions and assigned a BI-RADS score (American College of Radiology classification) (
18).
A total of 314 bilateral mammograms that met the inclusion criteria were randomly selected to be included in the study. The sample was enriched in cancer cases, reaching a prevalence of 19.6% (
Table 1).
Malignant cases were composed of 60.2% masses, 26% calcifications, 5.7% focal asymmetries and 8.1% architectural distortions. 89.4% of cancerous lesions were initially classified as BI-RADS 4 (suspicious) or 5 (highly suggestive of malignancy).
The gold standard was defined by histology, i.e., a positive biopsy for cancer cases and a 2-year negative control mammogram for noncancer cases.
AI system
Among six available AI programs with 2D models, three agreed to take part in the study (Incepto, Therapixel, Hera-Mi) and three did not wish to participate or were unable to do so for technical or logistical reasons (I-CAD, Hologic, Lunit).
The following AI programs were used in the study :
- Transpara v.1.7.3 from the French company Incepto© and developed by the Dutch company Screenpoint©; this program circumscribes the lesion and gives a region score from 1 to 98 and a global risk category per patient: low risk if the region score is between 1 and 43, intermediate risk if the score is between 43 and 75 and high risk if the score is between 75 and 98.
- Mammoscreen™ v.1.2 from the French company Therapixel©; this program targets the lesion and gives a malignancy score on a scale from 1 to 10 per lesion and per breast. Three categories are identified: low risk from 1 to 4, intermediate risk from 5 to 7 and high risk from 8 to 10.
- Breast-SlimView® v1.8.0 from the French company Hera-Mi©; this program generates a synthetic image by blurring the normal breast to highlight the suspect zone.
All of these programs have received a CE marking.
For statistical analysis, the AIs were anonymized and randomly named AI 1 to 3.
Study design
The 314 anonymized mammograms from the dataset were used by AI constructors in January 2023. Each mammogram was the result of 2-view full-field digital mammography (FFDM) of each breast without tomosynthesis or any other clinical information.
The dataset was available on a shared space with secure access for AI algorithm processing. The results were returned within 48 hours of the download link being sent.
Each algorithm defined an optimal cancer detection threshold for screening to distinguish the positive and negative lesions. A lesion was considered positive if its score was above the threshold. If several lesions were positive on the same breast, only the most suspect lesion was considered. A summary spreadsheet, reported the results of each mammographic incidence, including the label (1 if there was a positive lesion, 0 if not), a probability score between 0 and 1 and coordinates of the positive lesion. A screenshot of each case was also provided with the positive lesion marked.
Data analysis
A reviewer processed the data verification and analysis.
Constructor spreadsheets were compared to the ground truth file with the aid of an informatic script in Python code. The coordinates of the positive lesions were also validated with the script with a 15 mm deviation tolerance from the center of the marked lesion.
The results were verified with the screenshots provided.
The results were reported in tables by transcribing the number of true positives (TPs), true negatives (TNs), false positives (FPs) and false negatives (FNs).
Analysis was evaluated per breast, i.e., cancer lesions were considered to be correctly classified if they were correctly marked on at least one occurrence for each breast.
Statistical analysis
Categorical data are represented as numbers (percentage). Continuous variables are represented as the median (range). All statistical analyses were blinded. The diagnostic performance of each dataset was evaluated on a per breast basis using sensitivity (Se), specificity (Sp), positive predictive value (PPV), negative predictive value (NPV), and accuracy (Acc). These features were compared using a two-sided chi-square test. All statistical analyses were performed using SPSS version 23 and MedCalc. A p value < 0.05 was considered significant. Balanced accuracy was calculated as the average of sensitivity and specificity in the case of unbalanced classes.
Discussion
The main outcome of this study revealed significant differences between AI systems depending on the intrinsic performance of algorithms and on the threshold chosen.
If we consider all the performance parameters of breast cancer screening algorithms at the initially chosen threshold, AI 1 achieved a compromise between an acceptable detection rate (Se=74%) and a correct specificity (Sp=79%), generating a moderate PPV (46.2%) that would partly limit overalerting.
In contrast, AI 2 had excellent accuracy, based on its high specificity (98.4%) and PPV (88.9%), but a low sensitivity in the context of cancer screening (52%).
Finally, AI 3 had a moderate sensitivity (69.9%), but its low specificity (45.4%) and PPV (23.8%) had a definite impact on the number of false positives and thus led to overalerting. Indeed, these false-positive lesions restricted radiologist interpretation and could lead to overmedicalization, loss of time and decreased radiologist confidence in the system in clinical practice.
Overall, the sensitivity rates of the AI algorithms appeared to be lower compared to the literature data, ranging from 67% to 81.9% at a fixed specificity rate of 96.6% in the Swedish study (
18) and 96.2% compared to a radiologist specificity of 66.9% in Lotter’s study (
19), However, a retrospective study published in
Nature found lower sensitivities for AI systems (56%) for a specificity of 84% (
20), which is comparable to our study.
These rates could be explained by the fact that the dataset came from an expert center with several cancers that were difficult to detect, such as two cancers visible only on tomosynthesis, initially classified as BI-RADS 2, and 11 cancers classified as BI-RADS 3. Moreover, the correct location of lesions compared to those marked by the expert radiologist was required whereas in recent studies, the analysis was only performed per positive mammogram.
Sensitivity depends on false negatives. In our study, they can be distinguished into two categories. The first category consists of lesions that were detected by the AI algorithms and classified as negative because their score was below the chosen threshold, as shown in
Figure 1, where 2 AI algorithms marked the right lesion but were considered misclassified. The second category includes lesions that were not detected by the algorithm, as shown in
Figure 2, where none of the algorithms found the right lesion.
While most studies impose a fixed value of sensitivity or specificity to compare algorithm performances (
10,
19,
21), the choice of the “optimal” threshold for each algorithm was left to the constructors. To choose it, constructors aim for a recall rate that is approximately 20 to 40% for a cancer-enriched cohort. Currently, the recall rate in the United States is more than approximately 10% in a screening population (
22).
The choice of the threshold affects AIs performances. By lowering AI 2’s threshold, the sensitivity was significantly improved, reaching values without a significant difference with AI 1 at the threshold of 0.5 (
Table 4). After a discussion with AI 2’s constructor, retrospectively, the correct threshold should have been lower than the chosen one, to have the best compromise between sensitivity and specificity. This example illustrates the impact of the threshold choice. We understand that a “perfect model” that would detect all cancers without generating any false-positive results does not exist.
The strength of our study is that we conducted a clinical study of several AI solutions currently marketed, according to similar exercise modalities (with respect to the 48-hours deadline for the return of results) with verification of the correctness of the results by the screenshots provided.
However, this was a retrospective and single-center study, with a dataset enriched in cancers and an analysis only of 2D mammograms. The limitations are the low number of cases in the database and an artificial high prevalence due to an a cancer-enriched cohort which differs from an actual screening setting. Furthermore, only 2D AI models have been evaluated while studies have shown the usefulness of tomosynthesis, which increase the number of cancers detected in the radiologist’s clinical practice (
23) and the importance of medical history (
17). AI algorithms are currently being developed with the integration of 3D analysis and prior mammograms, which will require further studies.
At present, the use of AI tools for breast cancer screening is intended more for centers performing routine mammography as an aid to vigilance. They could improve practices and performance depending on the expertise and experience of the radiologist, which may vary from center to center. However, the added value for an expert center remains to be demonstrated.
Real-life conditions may lead to new biases such as the automation bias of radiologists when using an AI system that could influence decisions made by radiologists, leading to an impairment of performance (
24). This tendency had already been noticed with the use of CADs, which decreased the sensitivity of radiologists (
25,
26).
Despite promising performance, the place of AI models in patient care remains to be determined, as highlighted in the state of the art by Sechopoulos et al. (
27) based on large-scale prospective studies at several centers under actual screening conditions (
28). Recent prospective studies seem to demonstrate the non-inferiority of AI-supported screening compared with a standard double reading, with lower workload (
29). This raises the question of the evolution of screening towards the replacement of the second reading.