Background
There is interest in artificial intelligence algorithms (machine learning and deep learning) to provide automation of neuroimaging in effort to improve accuracy, reduce bias and aid in clinical decision-making.1
Structural imaging has the potential to reduce variance and improve diagnostic and prognostic inferences from MRI scans. However, the tools must be trained and validated in a manner that provides generalizability to broader populations. When a single data set or small number of individuals are used to train the programs, the results may be overestimated. A systematic review was conducted that describes and compares the available tools, with the aim to assess their translational potential into real-world clinical settings.2 Of 8 tools identified 2 were not approved for medical use and one had no associated references. Most of the tools were found to have been validated using a small number of cases and a single data set. They compared the tools based on the number of validation methods for which was conducted. None of the tools account for intrascanner variability resulting from differences in the scanners, magnetic field and acquisition parameters and therefore lack generalizability. The author concludes that the majority of available tools make use of multivariant machine learning methods and have potential to open up new possibilities in personalized medicine. However, they caution results should be interpreted with vigilance due to the limitations in these studies especially related to small sample size and poor methodology. They also caution that results must be interpreted in light of the patient’s clinical history and symptomatology.
The American Society of Functional Neuroradiology (ASFNR) and the American Society of Neuroradiology (ASNR) acknowledge the challenges with artificial intelligence in neurology an created an Artificial intelligence Workshop Technology Workgroup.3 This group published a critical appraisal of Artificial Intelligence (AI)-enabled imaging tools using the levels of evidence system in the American Journal of Neuroradiology. They call for critical appraisal of enabled image tools throughout the life cycle from development to implementation using systematic, standardized and an objective approach that can verify both the technical and clinical efficacy of the tool. A challenge in developing AI models is access to comprehensive and large data sets that can be utilized to train the technology. This data should represent the intended population and provide a diverse group from which the data may be extrapolated. This paper provides a resource for clinicians to aid in critical assessment of AI technologies to ensure safe and effective implementation into healthcare practices.
FDA cleared devices as of the publication date of this LCD:
- NeuroQuantTM Medical Image Processing Software is registered as a Class II device under FDA 510(k) intended for “automatic labeling, visualization and volumetric quantification of segmentable brain structures from a set of magnetic resonance images (MRI)”. NeuroQuant 4.0 uses AI modalities of machine learning and deep learning to aid in identifying complex patterns in imaging data.4
- Icobrain aria is registered as a Class II device under FDA 510(k) pathway.5 It is described as a software-only device for assisting radiologist with the detection and quantification of amyloid-related imaging abnormalities (ARIA) on brain MRI scans for patients undergoing a amyloid-beta directed antibody therapy. Icobrain aria automatically processes inputs from brain MRI scans from 2 time points and calculates the ARIA-E (edema/sulcal effusion): the length of the longest axis computed from the segmented ARIA-E abnormalities, and the number of brain sites affected by ARIA-E; and ARIA-H (hemorrhage/superficial siderosis): the count of stable and new T2*-GRE hypo-intensities indicated as microhemorrhages or superficial siderosis.5 Using these measurements the ARIA radiographic severity is automatically derived based on deep learning technology and reported electronically. The intended use of the device is “as a computer-assisted detection and diagnosis software to be used as a concurrent reading aid to help trained radiologists in the detection, assessment, and characterization of ARIA. The software provides information about the presence, location, size, severity and changes of ARIA-E and ARIA-H. Patient management decisions should not be made solely based on analysis by ICO brain aria.” The device is not intended to replace radiologist review of images or clinical judgment and is not intended to be used to segment macro hemorrhages (diameter 10mm or more).6
- Icobrain is registered as a Class II device under FDA 510(k) pathway.7-10 It is intended for “automatic labeling, visualization and volumetric quantification of segmental brain structures from a set of MRI images”. The predicate device is NeuroQuant.
- DeepBrain is registered as a Class II device under FDA 510(k) pathway.11 The device is intended for “automatic labeling, quantification and visualization of segmental brain structures from MRI images.” It is intended to be used by trained health professionals.
- Siemens Morphometry Analysis is registered as a Class II device under FDA 510(k) pathway.12 This product is a syngo based post-acquisition imaging processing software for “viewing, manipulating, evaluating and analyzing MRI, MR PET, CT, PE, CT-PET and MR spectra using deep learning algorithms”.
Other FDA cleared devices may also be available not but listed as they were not found in the literature search.
The literature search for the evidence related to quantitative analysis of brain MRI was conducted using PubMed and EBSCO using search teams: Alzheimer’s, ARIA, imaging, artificial intelligence or AI, automated or software or computation or deep learning, machine learning or artificial neural network, multiple sclerosis, MRI and AI, brain or neurology, AI or MRI. Searches under known product that are commercially available in the United States included names NeuroQuant, NeuroGage, Icobrain, icobrain aria, Jung Diagnostic, quantib, Qure, and volBrain was conducted. There were no RCT identified. Unpublished reports, posters, abstracts, case reports and small case series were omitted from the review unless there was no other evidence available to consider. Review papers were utilized in background and summary but not considered for evidence review. There were no guidelines or recommendations on the use of automated software identified.
Alzheimer’s Disease
Amyloid beta (Aβ)-directed antibody therapies, such as aducanumab, lecanemab and donanemab, are approved in the United States (U.S.) for the treatment of patients with mild cognitive dementia due to Alzheimer’s disease. There is evidence of slowing disease progression and improvement in clinical outcomes for those treated with mild cognitive impairment from Alzheimer’s dementia (AD), however it is known that Aβ-directed antibody therapies increase the risk of ARIA and ARIA related complications.13 ARIA complications include autoimmune or inflammatory conditions, seizures, or disorders associated with extensive white matter pathology. Symptoms and signs of ARIA can include new focal neurological signs, headache, confusion, altered mental status, dizziness, nausea, vomiting, fatigue, blurred vision or vision disturbances, gait disturbance or seizure.
As novel therapies for the management of AD emerge the need for surveillance for complications has developed. Radiologists have had to develop criteria and become familiar with the appearance of amyloid-related image abnormalities, develop appropriate imaging protocols and clinicians also must determine the optimal pathways for management of ARIA.14 This is an area of ongoing investigations and several studies have contributed to the current knowledge of these challenges, but literature remains sparce.
Multiple grading schemes to determine the severity of ARIA have been proposed.15-18 The ARIA Radiographic Severity categories ARIA into mild, moderate and severe and this classification system was used in the pivotal clinical trials for anti-amyloid immunotherapies and by the FDA for drug approval. While becoming the accepted standard for classification this has not published in peer-reviewed literature at the date of this LCD other than as a research poster. 19,20 A comparison of the Barkhof Grand Total Scale (BGTS) or the 3- or 5-point Severity Scales of ARIA-E demonstrated a high degree of correlation between the scales.17
To monitor for ARIA the Appropriate Use Criteria for Aducanumab recommends MRI before the 5th, 7th, 9th and 12th infusions to improve detection.21 The criteria recommends discontinuation of aducanumab for any macro-hemorrhage, more than 1 area of superficial siderosis, more than 10 microhemorrhages occurring since the initiation of treatment, more than 2 episodes of ARIA, severe symptoms of ARIA or development of any medical condition requiring anticoagulation. The protocol allows continuation of aducanumab for mild ARIA-E or ARIA-H with monthly MRI monitoring and discontinuation for worsening symptoms.
The Appropriate Use criteria for Lecanemab recommends obtaining MRI scans at baseline and prior to the 5th, 7th, 14th and 26th infusions. They explain that 81% of ARIA-E occur early and resolve spontaneously within 4 months of radiographic detection.22 The protocol allows continuation of aducanumab for mild ARIA-E or ARIA-H with monthly MRI monitoring and discontinuation for worsening symptoms. Once the ARIA resolves or stabilizes monthly imaging can be discontinued. The criteria states that the imaging should be read by knowledgeable MRI readers proficient in detection and interpretation of ARIA or clinicals skilled in the conduction of lumbar puncture.
The Appropriate Use recommendations for Donanemab should be performed prior to the 2nd, 3rd, 4th and 7th infusions and 12th in those at high risk for ARIA. The protocol allows continuation of aducanumab for mild ARIA of edema type mild ARIA-E or ARIA-H with monthly MRI monitoring and discontinuation for worsening symptoms.6
Since ARIA is a new entity, standard for radiographic interpretation of ARIA are being developed and published. Those interpreting the imaging need education and training to ensure accuracy and consistency in reporting. The American College of Radiology, the Alzheimer’s Association, and the Radiological Society of North America are all offering training and continued medical education.19
Radiology
A retrospective report on ARIA was conducted reviewing the imaging of 262 subjects in Phase 2 studies investigating subjects with mild to moderate AD treated with bapineuzumab, a humanized monoclonal antibody against amyloid β. Two neuroradiologists independently reviewed 2572 MRI scans from 262 participants. The readers were masked to the patient's treatment arm. Patients were included in the risk analysis (n=210) if they did not have evidence of ARIA-E in their pretreatment MRI, had received bapineuzumab, or had at least one MRI scan after treatment. Thirty-six patients (17%) developed ARIA-E during treatment with bapineunzumab, of which 28 patients (78%) were asymptomatic, while 8 were symptomatic. Fifteen of these with ARIA-E detected (42%) on re-read of MRI were not detected previously. All of these patients were asymptomatic and had fewer brain regions involved (mean 1.3, SD=0.5) than patients identified during the clinical studies (mean 2.6, SD=2.4, p=0.0193). Thirteen of the patients whose ARIA-E findings were not detected during the clinical trial continued the bapinezumab infusions for up to 2 years and remained asymptomatic.18
Using the same study population investigators sought to describe imaging characteristics of ARIA-E and ARIA-H identified.15 The rate of ARIA-H was reported as 12.4% (26/210). They also found that in 49% of those with ARIA-E there was also associated appearance of ARIA-H. The authors conclude this may suggest a common pathophysiologic mechanism. All scans were reviewed by local MR imaging readers and subsequently independently reviewed by the same 2 neuroradiologist as part of the study protocol. They reported the inter-reader kappa value of 0.76 indicated high inter-reader reliability with 94% agreement between neuroradiologist regarding the presence or absence of ARIA-E.
Using this same population a retrospective analysis of 242 patients the incidence of ARIA-E was detected more frequently by trained neuroradiologist as compared to local site radiologist.23 The MRI were performed in patients with mild to moderate AD in a Phase III trial of bapineuzumab. Seventy-six cases of ARIA-E were not detected on the initial read and reported on the final MRI review with the expert radiologist including 51 cases not identified by central/local readers. These represented low radiologic severity. A final read analysis found that the readers’ ability to detect ARIA-E improved as the study duration increased resulting in the majority of occurrences of ARIA-E later in the study to be identified by the local side radiologist, suggesting that ability to detect ARIA improved with increasing experience. It is unclear if the outside readers were using the same imaging criteria, the duration and type of training received, or if this finding resulted in any treatment changes between the groups. This supports the need for appropriate training for radiologist reading these studies and continued standardization of findings to ensure consistency between readers. The clinical significance of this finding is not determined as none of those with mild ARIA-E had symptoms despite continuation of therapy and the sample size was too small to generalize to a larger population.
A volumetric analysis of structural MRI images of the brain was tested and reported to have a high correlation to independent computer-aided manual segmentation for detection of atrophy changes seen in mild AD.24 Using Open Access series of Imaging Studies database 40 subjects with mild probable AD were compared to health controls. These images were processed by the NeuroQuant software package. The investigators reported that volumetric results obtained by the software offers a high correlation and could benefit in evaluation of brain atrophy. The study is limited by very small sample size, lack of human subjects and correlation to expert readers in the same population and represents very low-quality evidence.
A study reviewing the MRIs of 122 patients with dementia compared readings with NeuroQuant and radiology readings in effort to determine if the automated software could determine if the dementia was AD or other type. They concluded that the software could not be used alone as the changes in brain segments were not specific for AD.25
A prospective study evaluated 40 patients brain MRIs on 6 scanners from 5 institutions with both NeuroQuant quantitative analysis and neuroradiologist readings. Image processing was conducted with FAST-DL a DOCOM-based convolution neural network-dependent deep learning AI enhancement software product called SubtleMR. Clinical classification performance was compared for standard of care scans, FAST-DL and NeuroQuant. The authors reported FAST-DL was statistically superior to standard of care in subjective image quality for perceived signal to noise ratio (SNR), sharpness, artifact reduction, anatomic/lesion conspicuity, and image contrast (all P values < 0.008), despite a 60% reduction in sequence scan time. They conclude that deep learning can provide 60% faster image acquisitions with statically perceived image quality with accuracy comparable to standard of care scans.26 This study is limited by small samples size, use of a vendor-based software for comparison, risk of bias and conflicts of interest with investigators.
Clinical Validity (or Technical efficacy)
Sima et al.27 conducted a diagnostic study to assess the clinical performance of an AI–based software tool for assisting radiological interpretation of brain MRI scans in patients monitored for ARIA. This study enrolled 16 US Board of Radiology–certified radiologists to perform radiological reading with (assisted) and without the software (unassisted). A total of 199 cases, where each case consisted of a pre-dosing baseline and a postdosing follow-up MRI of patients from aducanumab clinical trials PRIME (NCT01677572), EMERGE (NCT02484547), and ENGAGE (NCT02477800) were retrospectively evaluated. End points were the difference in diagnostic accuracy between assisted and unassisted detection of ARIA-E and ARIA-H independently, assessed with the area under the receiver operating characteristic curve (AUC).
Demographics included mean age was 70.4 (7.2) years; 105 (52.8%) were female; 23 (11.6%) were Asian, 1 (0.5%) was Black, 157 (78.9%) were White, and 18 (9.0%) were other or unreported race and ethnicity. Among the 16 radiological readers included, 2 were specialized neuroradiologists (12.5%), 11 were male individuals (68.8%), 7 were individuals working in academic hospitals (43.8%), and they had a mean (SD) of 9.5 (5.1) years of experience. Radiologists assisted by the software were significantly superior in detecting ARIA than unassisted radiologists, with a mean assisted AUC of 0.87 (95% CI, 0.84-0.91) for ARIA-E detection (AUC improvement of 0.05 [95% CI, 0.02-0.08]; P = .001]) and 0.83 (95% CI, 0.78-0.87) for ARIA-H detection (AUC improvement of 0.04 [95% CI, 0.02-0.07]; P = .001). Sensitivity was higher in assisted reading compared with unassisted reading (87% vs 71% for ARIA-E detection; 79% vs 69% for ARIA-H detection). Specificity remained above 80% for the detection of both ARIA types. The software had the greatest improvement in detection of mild cases (70% compared to 47%). The unassisted readers distinguished ARIA grades well but had higher inter-reader agreement with software assistance. Time for reading was similar in both groups. The authors concluded that radiological reading performance for ARIA detection and diagnosis was significantly better when using the AI-based assistive software. The study is limited by concerns for indirectness (small sample size), lack of generalizability (limited representation in the population), and risk of software errors due to computer learning from individual readers which may have been trained on discrepant data. The software was not trained to detect cerebral hemorrhages larger than 1 cm. The software assists but does not replace the need for qualified radiologists to read the studies. An additional validation study compared isobrain to Free Surfer with comparable results.28 The lead author is employed by the device maker.
A study compared the MRIs from healthy controls (n=90) to those with subjective cognitive decline (n=930), mild cognitive impairment (n=357), and AD (n=820). Icobrain dm results were compared to FreeSurfer software and the investigators reported isobrain dm had less failures, was faster and improved clinical accuracy.29 FreeSurfer is not available clinically.
An empirical study in South Korea analyzed the MRI for 98 patients with AD using VUNO Med-DeepBrain AD (DBAD) deep learning algorithm. The compared the results of the (DBAD) imaging reads to that of 3 expert readers and reported comparable accuracy (87.1% for DBAD and 84.3% for ME), sensitivity (93.3% for DBAD and 80.0% for ME), and specificity (85.5% for DBAD and 85.5% for ME).30
Clinical Utility (or Clinical efficacy)
There are no published studies to date on clinical utility. The software was found to be particularly helpful in improving detecting on mild cases of ARIA-E, however it is not determined if this improves patient outcomes.27 The ARIA the Appropriate Use Criteria allows continuation of aducanumab for mild ARIA of ARIA-E or ARIA-H with monthly MRI monitoring and discontinuation for worsening symptoms. There is a lack of evidence continuation of medication in these cases correlates with improved outcomes or if this increases the risk of serious adverse events.
Multiple Sclerosis
Multiple sclerosis is an autoimmune demyelinating disease impacting the central nervous system and diagnosis is made by MRI findings, laboratory findings and clinical data. This has led to investigations into developing machine learning tools to aid in the diagnosis of multiple sclerosis (MS) have been ongoing.
Several review papers have identified multiple investigations for AI models and the potential this technology may bring but acknowledge these investigations did not yield a clinically usable model.31-34
A 2018 systematic review included 30 articles of which 18% utilized artificial neural network method reporting an overall high sensitivity, specificity and accuracy of the reasoning methods.35
A 2022 systematic review included 38 studies focusing on deep learning or AI to analyze any modalities with purpose of diagnosing MS. They conclude this is a growing field and can result in drastic improvements in the future.36 Another systematic review with meta-analysis from the same group in 2023 included 41 articles (n=5989) and reported a high precision in MS diagnosis for AI studies (95%CI: 88%, 97%) suggesting that AI can aid the clinician in accurate diagnosis of MS.37 The meta-analysis is limited by very high heterogeneity with overall I2=93% limiting the validity of these results. The authors conclude that more studies are necessary to create a generalizable algorithm.
Another 2022 systematic review included 66 papers which addressed developing classifiers for MS identification or measuring its progression. They also acknowledge the potential benefits of this approach if applied appropriately and provides guidance for further research.1
A retrospective study compared the MRIs for patients with MS were analyzed with icobrain software platform for 6826 MRIs with 1207 MRI pairs meeting inclusion criteria. The investigators reported that icobrain could be utilized for percentage brain volume change based on strict selection criteria.38 Another explored the potential role of icobrain and use of an MS app MS to inform treatment changes in a small population.39
Volumetric data for patients with MS were analyzed on the same and different-scanner MRI pairs. Of 6826 MRIs, 85% had appropriate volumetric sequences and 4446 serial MRI pairs were analyzed and 3335 (75%) met inclusion criteria. The percentage brain volume change (PBVC) of the included MRI pairs showed variance of 0.78 % for same-scanner pairs and 0.80 % for different-scanner pairs, but further selection of included MRI pairs with the best variance resulted in 1885 (42%) MRI pairs with PBVC variance of 0.34%. The authors acknowledge the challenges and limitations for brain volumetry measurements and need for standardization to perform adequately. The authors conclude icobrain should be utilized for PBVC determination only on selected MRIs with best alignment similarity and with strict selection criteria for the included MRI pairs to reduce PBVC variability.38
Brain tumors
Two poster abstracts and several review papers were identified.40-43
A novel AI-driven application to aid in brain tumor detection from MRI images reports on the development of EfficientNetB2. This report focuses on the proposed technology and evaluation of performance in non-clinical setting.44 EfficentNetB2 is not FDA cleared as a medical device.
Epilepsy
Comparison of MRI images read by neuroradiologist and analyzed with NeuroQuant software in 144 patients with temporal lobe epilepsy (TLE) was performed. The investigators found similar specificity to neuroradiologist visual MR imaging analysis (90.4% versus 91.6%; P = .99) but a lower sensitivity (69.0% versus 93.0%, P < .001). The positive predictive value of NeuroQuant analysis was comparable with visual MR imaging analysis (84.0% versus 89.1%), whereas the negative predictive value was not comparable (79.8% versus 95.0%). They conclude that the neuroradiologist had a higher sensitivity likely due to the software’s inability to evaluate changes in hippocampal T2 signal or architecture.45 They conclude the technology may aid in evaluations when a neuroradiologist is not available, however product information states this is intended as an adjuvant, not replacement for the radiologist.
A prospective study measured volumetric MRI imaging data for 34 patients with TLE and compared to 116 control subjects.46 Structural volumes were calculated using automated quantitative MRI imaging analysis software (NeuroQuant). Results of the quantitative MRI imaging were compared with visual detection of atrophy and histological specimens if available. Quantitative MRI imaging results compared to visual inspection of the volumetric MRI imaging studies by two experienced neuroradiologists had a concordance between hippocampal asymmetry (91-97%). They reported the software discriminated patients with TLE from control subjects with high sensitivity (86.7%–89.5%) and specificity (92.2%–94.1%). The authors conclude that the software can provide “an expert eye” in centers that lack expertise, however the FDA indications for this software indicates it is intended to be an adjunct to the radiologist reader, not a substitute. Limitations of the study include lack of generalizability to non-expert readers, small sample size, lack of confirmative histological confirmation available for 12 patients (35%).
A retrospective report included 36 patients with mesial temporal sclerosis (MTS) which is important to detect for temporal lobe epilepsy as it often guides surgical intervention. One of the features of MTS is hippocampal volume loss. Using electronic medical records researchers scanned patients with proven MTS and analyzed the imaging with volumetric assessment software (NeuroQuant). They reported an estimated accuracy of the neuroradiologist as 72.6% with Kappa statistic of 0.512 (95% CI, 0.388–0.787). They conclude that the NeuroQuant software compared favorably with trained neuroradiologists in predicting MTS.47
Other literature identified was limited to review papers, case reports and series and not included in this assessment.
Traumatic Brain Injury
Twenty MRIs images from patients with mild to moderate traumatic brain injury (TBI) were analyzed with NeuroQuant automated software and compared to attending radiologist interpretation. The investigators reported radiologist’s traditional approach found at least one sign of atrophy in 10.0% of patients compared to NeuroQuant finding this in 50.0% of patients concluding higher sensitivity of Neuroquant.48 Subsequent expanded study with 24 subjects found similar results.49 These studies are limited by very small sample size, and lack of knowledge if the atrophy was caused by TBI or other conditions that can cause similar findings. The authors state “we have never seen an MRI report on a patient that used a qualitative rating scale to assess level of atrophy or ventricular enlargement. With the rapid advances in computer-based technology, instead of focusing on understanding and developing the approach based on qualitative ratings, it may be more advantageous to focus on the computer-automated approaches.” In the absence of qualitative ratings, the comparison of the gold standard visual approach and qualitative approach is not established and to the claim that the software was 50% more accurate is not validated.
Comparison between technologies
A review paper focused on AI for brain neuroanatomical segmentation in MRI imaging concludes high accuracy and fast performance overall. This technology is challenged by robustness to anatomical variability and pathology related to lack of large datasets necessary for sufficient training.43 One of the challenges in developing automated volumetry software is lack of a gold standard for similar brain measurements to establish if the software correlates with reality. In current software product measurements of the brain segments are made with different methods and tools and therefore lack standardized measures for comparison. Efforts to understand the performance of different software modalities is undergoing as consistency between programs is an important component to create reliable standards which can be applied to clinical practice.
Multiple investigations compared the inter-method reliability between NeuroQuant and FreeSurfer computer-automated programs for measuring MRI brain volume. These demonstrate high inter-method reliability between the modalities with 2/21 brain regions being less reliable.50,51 Using 56 MRIs in patients with AD or MS investigators compared results between NeuroQuant and volBrain automated brain analysis software and found high reliability except in the thalamus and amygdala where reliability has been proven to be poor. Using 115 MRIs with clinically isolated syndrome both measured with NeuroQuant and FMRIB's Integrated Registration Segmentation Tool (FIRST) found some variability between the modalities with larger volumes achieving better agreement.52 Another investigation compared NeuroQuant to DeepBrain and found significant differences in many brain regions.30 A retrospective report compared 87 patients images with memory impairment with FreeSurfer, NeuroQuant, and Heuron AD and found significant differences between the programs. Heuron AD indication is for PET amyloid scans.53
A study compared brain volumetrics in MS using Structural Image Evaluation using Normalisation of Atrophy-Cross-sectional (SIENAX) to NeuroQuant and MSmetrix.54 SIENAX is widely used in cross-sectional MS studies, but clinical applications limited. The authors compare the performance of NeuroQuant and MSmetrix to SIENAX and concluded comparable results.
FreeSurfer, volBrain and FIRST are not FDA cleared and used for research purposes.