Consistency of breast density categories in serial screening mammograms: A comparison between automated and human assessment
A computerized density measure provides more consistent readings than radiologists
In light of the subjectivity and variation that Sprague et al. had recently described for radiologist density assessment, a more reliable measure of density could prove useful. New research from Katarina Holland and a team from The Netherlands could help to find such an alternative. They set out to explore how much variation there is between radiologists when reading serially-acquired mammograms and whether an automated density assessment would provide better consistency. 500 women were randomly selected from the Dutch breast screening program for whom two sets of mammograms were acquired—one “prior” set and one “current” set, with an average 30-month interval between them. Four radiologists individually scored the mammograms using BI-RADS 4th edition (while being blinded to whether a mammogram was a “current” or a “prior”). In addition, to better replicate clinical practice a “group reading” was performed. In these group readings, each mammogram set was scored by randomly selecting one of the four radiologists’ scores and assigning it to the study; the intention here was to emulate the fact that serial mammograms are usually read by different radiologists. Finally, all mammograms were measured using Volpara 1.5.0, producing a Volpara Density Grade (VDG) for each study. Density for both the radiologists and Volpara was assessed either on a two-category (“fatty” versus “dense”) or a four-category (BI-RADS/VDG 1, 2, 3 or 4) scale.
Volpara produced a significantly higher portion of women who did not exhibit a change between two-point density categories (90.4% of women) compared to the group reading of radiologists (86.8%). This may reflect the fact that Volpara produced more consistent density readings than radiologists did—an idea supported by the fact that Volpara’s agreement to its own readings between serial exams was significantly higher than the group radiologist readings were to each other. On a two-category scale, Volpara maintained a kappa agreement value of 0.8 across screening exams, while the group radiologist readings only had a kappa of 0.7; on a four-category scale the kappa values were 0.85 and 0.75 respectively. When women did exhibit a density change between screens, most of the instances of change were from the to “dense” to “fatty” category (this happened in approximately 70% of cases of density change). This would be expected—over time, the dense fibroglandular tissue in the breast involutes to be replaced by fat and thus the overall density decreases. Change in the other direction (“dense” to “fatty”) occurred as well, though less frequently—this could potentially be caused by use of hormonal replacement therapy, weight loss or measurement error. There was no significant difference between the direction of density change between Volpara measurements and those of the radiologists.
Overall, the indication was that an automated computer measurement such as Volpara provides better consistency than density readings done by radiologists (when emulating clinical screening practice). Thus, the use of an automated density measurement algorithm could prove desirable in screening practice—not only because of improved efficiency in terms of time and cost but also because of enhanced reliability. Furthermore, this indicates that Volpara could prove useful in temporal measurements of breast density. These could be valuable for examining breast cancer risk, by looking at the occurrence of age-related involution or looking at response to adjuvant or neoadjuvant hormonal therapy, which can be reflected in breast density.