Specific Lab Tests
Cxbladder (Detect, Triage, Monitor, Resolve)
The key weakness of the Cxbladder tests is found within the test design, but before discussing the flaws in test design, the concept of GEPs requires further definition.
In brief, GEPs measure the quantity of gene-specific mRNA transcripts in a specimen (e.g., how many mRNA transcripts of the gene HOXA13 are found in the urine). Since the mRNA transcripts from different genes may be identified and counted, GEPs can approximate how much transcriptional activity is occurring per gene (how much and how fast mRNA is being transcribed from the DNA at that time). In the context of urine, mRNA is found either within whole cells (e.g., desquamated urothelium) or extracellularly (e.g. from disrupted cells). Thus, in urine, a GEP result represents all mRNA from all cells within the urine and from cells lining the urinary tract. Therefore, the GEP’s measurement of mRNA in urine is an indirect assessment that represents the sum total of all transcriptional activity from both cells freely floating in the urine and cells that line the urinary tract (from kidney to urethra). As an indirect assessment, these results would be founded upon many different assumptions such as:
- There is a normal pattern of mRNA transcriptional activity at baseline that is consistent across all healthy patients
- An increase or decrease in mRNA transcription above or below baseline represents pathophysiology
- Specific patterns of abnormal changes in mRNA transcription may be ascribed to specific pathologic processes
- Each specific, abnormal pattern of mRNA transcription applies to one and only one pathologic process
- When abnormal patterns of mRNA transcription overlap, each coexisting pathologic process they represent still may be clearly identified through the patterns
Cxbladder tests are founded on the premise that differences in gene expression between urothelial cancer and non-urothelial cancer (including non-neoplastic tissue) can be measured in urine to determine if urothelial cancer is present or not present. The authors suggest that a well-designed test would be able to not only discriminate between patients with urothelial cancer (as a specific abnormal pattern of mRNA transcriptional activity) and healthy unaffected patients (normal pattern), but also between patients with urothelial cancer and patients affected by other diseases (other distinct abnormal patterns) whether non-malignant (e.g. urinary tract infection) or malignant (e.g. renal cell carcinoma). As will be shown below, the literature for development and validation of Cxbladder tests does not support the assumptions underlying their tests and does not prove the tests’ discriminating power between patients with and without TCC.
However, before discussing the Cxbladder tests, it is critical to understand the limitations of the 2008 publication from Holyoake and colleagues since the uRNA-D test was used to create the Cxbladder tests.28,29 The main difference between uRNA-D and Cxbladder was the addition of a single gene, CXCR2, to the Cxbladder tests, creating a 5-gene expression profile.28,29
In their 2008 paper, Holyoake and colleagues sought to answer 2 questions: First, could an mRNA expression test allow accurate identification of TCC in urine specimens; and second, could this test also differentiate between low grade and high grade TCC? The researchers proceeded to first identify candidate genes (14 genes selected out of about 26,600 genes) by comparing TCC tissue to normal tissue. Next the researchers whittled the 14 candidate genes down to 4 genes (CDC2, MDK, IGFBP5, and HOXA13) by comparing urine from patients with TCC to urine from patients without TCC. Successful and convincing execution of test development would in part require addressing known pitfalls (as described above) of mRNA expression profile tests. Holyoake and colleagues failed to prove their test overcame these pitfalls.28
Firstly, in the early development phase of the test (choosing 14 genes from over 26,000), Holyoake and colleagues utilized tissue, not urine. The methodology presumed that the ureter epithelium (tissue) taken from patients with kidney cancer would supply an mRNA expression profile comparable to urine from a patient without bladder cancer. This approach discounted the differences between tissue from a single cell type (urothelium) and urine, which contains both cell-free mRNA and cell-bound mRNA, all from a variety of urothelial and non-urothelial sources (e.g., kidney and prostate gland). A more accurate approach, if using tissue to design the test, would be to compare mRNA profiles between urothelial cancer and normal bladder urothelium from the same patient to minimize the confounding differences. After test design with tissue, there would need to be confirmation that mRNA expression profiling of tissue translated to urine testing, which could be best characterized by comparing tissue profiles with urine profiles of the same patient.28
Secondly, in the test finalization phase, urine from patients with TCC was compared to urine from patients with other diseases affecting the urologic tract, both malignant and non-malignant. No urine from healthy patients was used to design the final test. Moreover, the non-TCC malignancies were not identified in this paper (e.g., no diagnoses of prostate cancer or kidney cancer). Therefore, potential genes for an mRNA profile were discovered by comparing TCC tissue to benign ureter tissue and then subsequently honed to a final test design by comparing urine from patients with bladder cancer to urine from patients with other diseases (both malignant and non-malignant) without comparing to urine from healthy patients.28
Thirdly, of the over 26,000 genes investigated, ultimately only 4 genes (CDC2, MDK, IGFBP5, and HOXA13) were selected. In isolation, the selected genes were not considered unique to development of urothelial carcinoma. For instance, the authors stated “TOP2A and CDC2, which are involved in DNA synthesis and cell cycle control, showed very high overexpression across the majority of tumors examined”. In fact, when selecting genes, the authors most frequently focused on the power of a gene to discriminate in 1or more aspects of their test (e.g., HOXA13 and IGFBP5 were the best genes for discriminating between Ta tumors and T1-T4 tumors), but they often failed to adequately discuss the significance of the gene itself in the development of urothelial carcinoma. In the paper’s discussion, each of the 4 selected genes were described briefly. A single literature citation each was provided for 3 of the 4 genes stating that there were no assertions CDC2, IGFBP5, and MDK were unique to urothelial carcinogenesis. Altogether, this demonstrates that the test is based on correlation not causation and is thus an indirect assessment of the presence of TCC.28
Beyond these pitfalls in the development of an mRNA expression profile test, Holyoake and colleagues study introduced bias throughout their data, such as in the selection of patients, design of the test structure, and establishment of gold-standard references.28
Patients included in the test finalization portion of the study all received flexible cystoscopy and all presented with symptoms concerning for urinary tract disease. No asymptomatic patients were included. This selection process demonstrates potential bias in excluding baseline, “normal”, asymptomatic controls and selecting against patients with diseases that did not rise to a level of concern requiring cystoscopy. Also, note that the patients were selected from a Japanese population at a single institution in Kyoto, potentially limiting the relevance and applicability of the test in other dissimilar populations, such as the predominantly Caucasian but still highly diverse population of USA Medicare patients. The generalizability of results from a Japanese patient population to more heterogenous populations is questionable, thereby reducing the certainty of translating these results to the United States.28
A second source of bias was found in the design of uRNA-D, which was optimized by fixing the specificity of the test at 85% after collecting and reviewing the results. In this case, bias would be introduced into the estimates of test performance (e.g. overly optimistic assessment of test accuracy), thus potentially affecting the applicability of the index test for patient populations not represented by the optimization.28
A third source of bias was found in the selection and interpretation of the reference standards (cystoscopic and histologic results). Very few of the patient workups for TCC and other diseases were detailed in the study. While it is very likely that other diagnostic modalities, such as radiology, were employed to diagnose non-TCC disease, the study failed to detail these workups. In fact, the types of non-TCC malignancies were not classified in this paper. Moreover, TCC itself is not a monolithic disease but rather a heterogenous cancer with many different origins that can include environmental and/or genetic etiologies. Thus, different subtypes of TCC would have different behaviors such as increased aggressiveness or increased likelihood of metastasis. This paper primarily focused on the stage and grade of disease without consideration of the other complexities within the category of TCC. Thus, the study’s selection of reference standards could have introduced bias into the accuracy, performance, and applicability of the uRNA-D test.28
In summary, the 2008 paper from Holyoake and colleagues failed to convincingly prove that their 4-gene mRNA expression profile could accurately distinguish between patients with TCC and patients without TCC, and could distinguish between patients with low and high grade TCC. Subsequent literature regarding this test, or tests utilizing this study’s data, would need to address these shortfalls before accepting the accuracy and applicability of the 4-gene mRNA expression profile.28
In 2012, the first paper describing the validation of a Cxbladder test was published by O’Sullivan and colleagues. The study was designed to assess whether Cxbladder and its predecessor, uRNA-D, were more sensitive in detecting bladder cancer than other tests (cytology, NMP22 BladderChek, and NMP22 ELISA) and if they were able to properly stratify positive bladder cancer specimens into low and high tumor grade. As a validation study for Cxbladder, O’Sullivan and colleagues needed to address the shortfalls of the uRNA-D test since 4 of the 5 genes used in Cxbladder were insufficiently evaluated in the Holyoake study.28,29
As stated above, the 2008 development of uRNA-D failed to address key assumptions inherent to mRNA expression profiles. Cxbladder was developed using the same patient specimens, test design strategy, and reference standards as used in the 2008 study. Notably, O’Sullivan and colleagues did not revisit these biases, and, thus, they tacitly reintroduced the biases of the 2008 study into the development of their 2012 Cxbladder test.28,29
Additionally, new biases were introduced in the 2012 paper via the patient population used in validating Cxbladder. The study consecutively enrolled patients from 9 urology clinics in Australia, and all subjects were evaluated using cystoscopy, cytology, uRNA-D, Cxbladder, and other bladder marker tests (NMP22 BladderChek and NMP22 ELISA). Per the study, “[p]atients were eligible … if they had a recent history of primary gross hematuria requiring further investigation for possible urological cancer, were age 45 years or older and had no history of urinary tract malignancy.”29 However, patients with gross hematuria occurring under 24 hours from testing, patients with “prior genitourinary manipulation,” and patients with urinary tract infection all were excluded.29 These exclusions related to blood and inflammation were particularly interesting considering that Holyoake and colleagues developed uRNA by “select[ing] markers with low expression in blood and inflammatory cells”28 and O’Sullivan augmented the Cxbladder test with a fifth marker, CXCR2, “that is highly expressed in neutrophils, and is predicted to improve discrimination between patients with nonmalignant disease and those with early stage, low grade UC.”29 In the O’Sullivan study, they found “[c]ontrol patients with microhematuria were more likely to have false-positive tests than those without microhematuria (p 0.002),” which begs the question as to why the mRNA content of blood would render a false positive result rather than obscuring other potentially relevant mRNA from other cells (namely, increased false negatives).29 Also note that most future studies evaluating Cxbladder tests perpetuated the exclusion of patients with urinary tract infections and active gross hematuria (urine samples visually discolored by blood). Those studies that permitted testing of patients with significant inflammation of the urinary tract found that false positive rates appeared greatly increased, with one study demonstrating a false positive rate of 59% in patients with significant inflammation.41,42 Given that malignancies are often intimately associated with inflammation, one should question whether the development of the uRNA and Cxbladder tests adequately addressed discrimination between mRNA expression profiles from tumor cells and associated inflammatory cells. Therefore, the study from O’Sullivan and colleagues does NOT prove the premise that Cxbladder effectively discriminates between mRNA expression profiles of tumor cells versus white and red blood cells.29
Additionally, the exclusion of patients with a history of urinary malignancy weakens the validation of Cxbladder in this 2012 study, as well as validations for subsequent Cxbladder tests since each new Cxbladder test builds on its predecessors and often reuse patient populations. In fact, if we look at assessment of non-urothelial neoplasms throughout all major published uDNA and Cxbladder studies, we see the following:
- Holyoake 2008: 33 undefined cancers were noted28
- O’Sullivan 2012: 7 other neoplasms (undefined) were noted, all concurrent with urothelial carcinomas29
- Kavalieris 2015: Non-urothelial neoplasms were not discussed (study population included 517 patients from the O’Sullivan 2012 study)29,30
- Breen 2015: Non-urothelial neoplasms were not discussed (study population included patients from the O’Sullivan 2012 study)29,34
- Kavalieris 2017: Non-urothelial neoplasms were not discussed (same patient population as Lotan 2017)31,35
- Lotan 2017: Non-urothelial neoplasms were not discussed (same patient population as Kavalieris 2017)31,35
- Konety 2019: In some subpopulations, patients with history of prostate or renal cell carcinoma were excluded from the study; otherwise, non-urothelial neoplasms were not discussed (study population included patients from O’Sullivan 2012 and Kavalieris/Lotan 2017)29,31,35,38
- Davidson 2019: Other non-bladder malignancies and neoplasms were identified (but not subclassified) in a study evaluating hematuria; notably, Cxbladder-Triage was positive in most of these other malignancies (7 of 9 total) and neoplasms (2 of 3 total)41
- Koya 2020: Non-urothelial neoplasms were not discussed39
- Davidson 2020: Other non-bladder malignancies and neoplasms were identified in the study but data was not presented to allow association of these other malignancies and neoplasms with their respective positive or negative results from Cxbladder.42
- Raman 2021: In some subpopulations, patients with history of prostate or renal cell carcinoma were excluded from the study; otherwise, non-urothelial neoplasms were not discussed (study population included patients from O’Sullivan 2012 and Konety 2021)29,32,38
- Lotan 2022: In some subpopulations, patients with history of prostate or renal cell carcinoma were excluded from the study; otherwise, non-urothelial neoplasms were not discussed33
- Li 2023: Some patients (24 of 92 patients) noted to have “other cancers”; these “other” types of cancers are not described or significantly discussed with the exception of 1 defined instance from a patient with breast cancer who missed their 9 month follow-up due to conflict with breast cancer treatment40
There are numerous potential malignancies that can contribute to urine’s genetic composition, including but not limited to renal cell cancer, bladder cancer, and prostate cancer. Pacific Edge Diagnostics utilized the same 33 undefined malignancies to develop both the uRNA-D test and first Cxbladder test. Throughout all validations of Cxbladder tests from Pacific Edge Diagnostics, only 7 undefined, non-TCC neoplasms (described in O’Sullivan’s 2012 paper) were used.29 Moreover, these 7 neoplasms were concurrent with urothelial carcinoma, and thus the validation of Cxbladder tests did not validate the tests’ ability to distinguish between TCC and non-TCC neoplasms. Highlighting this validation oversight, in a study by Davidson and colleagues in 2019 (which was not funded by Pacific Edge Diagnostics), 7 of 9 patients with malignant prostate or kidney lesions received false positive Cxbladder results.41 Therefore, currently available literature does NOT prove the premise that Cxbladder effectively discriminates between TCC and non-TCC malignancies.
Altogether, the patient exclusions found in the 2012 study create a significant bias in development and validation of the Cxbladder line of tests and indicate a failure to thoroughly assess significant confounding variables for their 5 gene expression profile.29
Following these foundational studies, Pacific Edge Diagnostics sought to establish a line of Cxbladder tests with subsequent literature supporting the clinical validity and utility of the tests. However, successive studies utilizing the 5 gene expression profile still needed to establish its analytic and clinical validity since the foundational studies failed to do so. As seen above in the discussion of whether Cxbladder tests can effectively discriminate between TCC and non-TCC malignancies, successive studies failed to remedy core issues with the 5 gene expression profile. As such, the credibility and applicability of these successive studies and associated tests (including Cxbladder Detect, Triage, Monitor, and enhanced Triage) cannot be established when foundational traits of the test are in question.
When a line of tests fails to truly discriminate between the disease of interest and all other conditions, normal or pathophysiologic, there is increased concern that the tests could cause patient harm. Unsurprisingly, Cxbladder tests generally have low PPVs (down to 15-16% as seen in Konety, et al 2019 and Lotan, et al 2023) and high numbers of false positives (in Konety’s paper there were 464 false positive results as compared to 86 true positive results and in Lotan’s paper there were 110 false positive results as compared to 19 true positive results).38,33 In fact, the majority of Cxbladder papers avoid disclosing the PPV and number of false positives of their tests. Yet, these statistics are significant in that false test results, particularly false positives, can lead to patient anxiety and distress among other procedural issues related to follow up for an inaccurate result. If numerous false positive results in Cxbladder are accepted as an inherent trait of the test, providers may not be as vigilant in closely following patients with a positive Cxbladder result after a negative cystoscopy. In addition, providers may not search for other malignancies (e.g., papillary renal cell carcinoma) as a potential cause for the “false positive” Cxbladder result.
As described in the Summary of Evidence section above, systemic reviews, meta-analyses, and even guidelines from large expert organizations like the American Urological Association generally do not support the use of Cxbladder tests in patient care. At best, the guidelines “support their potential value in preventing unnecessary cystoscopies.”26,43-47
In conclusion, the Cxbladder line of tests all suffer from insufficient test validation, a foundational problem in which potentially confounding clinical circumstances include non-TCC neoplasms and malignancies and inflammatory conditions of the urinary tract. Cxbladder also demonstrates several population biases, including a foundational study from Holyoake in 2008 that only used Japanese patients from one location.28 Most of the primary literature regarding Cxbladder test development and performance is funded, if not directly written by, the test’s parent company, Pacific Edge Diagnostics. This conflict of interest must be taken into account when reviewing these papers, particularly when there are issues not discussed in Pacific Edge Diagnostic funded papers that are subsequently addressed by non-funded studies, such as the 2019 Davidson study which identified increased false positives in patients with non-TCC malignancies.41 As a result, CxBladder tests are not reasonable and necessary to support positive outcomes in the management of bladder cancer, and are therefore, are not payable.
ThyroSeq CRC, CBLPath, Inc, University of Pittsburgh Medical Center
ThyroSeq CRC is a prognostic test for malignant cytology that predicts the 5-year likelihood of cancer recurrence (low, intermediate, or high risk) based on algorithmic synthesis of raw data from the next generation sequencing (NGS) of DNA and RNA from 112 genes. Fine needle aspirated (FNA) nodules proven to be malignant on cytology are typically surgically resected, sometimes with coincident lymph node dissections. The features of the resected cancers are then assessed on permanent pathology. Therefore, in cases where malignant cytology is identified and surgery is warranted, a prognostic test for risk of recurrence based on cytologic material (before evaluation of resected material, including assessing the lymph nodes for metastases) is premature and potentially misleading. However, ThyroSeq CRC is proposed to direct extent of surgery for Bethesda VI nodules, increasing aggressiveness of surgery for more aggressive cancers. Therefore, ThyroSeq CRC must not only supply information that is not obtained through standard clinical and pathologic procedures prior to a resection, but also provide results that are subsequently confirmed on patient follow-up after the resection. Ultimately, a prognostic test should provide information that predicts the course of a patient’s disease before therapy is implemented and thus informs future clinical management to preemptively reduce adverse outcomes. For a prognostic test to be clinically useful, it must ultimately improve patient outcomes.
In the first publication describing the evaluation of the ThyroSeq CRC test, a small population of patients (n=287) with differentiated thyroid cancer (DTC) were evaluated with the CRC prognostic algorithm and their molecular risk group (low, intermediate, high) was compared to their outcome in terms of distant metastases (DM) as identified through pathology or whole body scans with iodine-131.48 Patients were divided into 2 groups: control (n=225) and DM within 5 years (n=62). In the control group, precise numbers of how many patients fell into each CRC risk category were not supplied by the paper. Instead, the control group was further segregated into a subcategory of propensity matched patients where each DM positive patient was compared with a control patient with similar demographic and pathologic characteristics, although the authors clearly state histologic subtype was not used to perform this propensity match. Using this propensity matching technique, comparisons were provided between the 53 DM positive patients and 55 control patients. In this subgroup comparison, the DM positive patients demonstrated more high-risk scores (low=1 patient; intermediate=17 patients; high=35 patients) than the control patients (low=28 patients; intermediate=19 patients; high=8 patients). This comparison was felt to be adequate by the authors to conclude that their “molecular profile can robustly and quite accurately stratify the risk of aggressive DTC defined as DM.”
This study had numerous limitations and drew dramatic conclusions from a very small sample size that was poorly presented by the paper.48 The immediate issue with this study was the lack of transparency. Thyroid cancer is a complex category of malignancy that includes many different subtypes of cancer, each with a variety of behaviors depending on numerous demographic, clinical, and pathologic factors. Considerations for management of cancer patients is thus a multifactorial and interdisciplinary process that requires careful evaluation. The study from Yip and colleagues not only oversimplifies the descriptions of patient populations, but the background data for each patient is not provided to allow for objective review by their readers. We are not given crucial details such as key findings in pathology reports (mitoses, lymphovascular invasion, capsular invasion, histologic subtype of the cancer) nor the number of patients with positive lymph nodes found during resection of the cancer. Instead, the patient demographics and molecular characteristics provided include simplifications such as generalized cancer types without subclassification (Papillary, Follicular, or Oncocytic) and non-specific metastatic locations (bone, lung, “>1” and other). Additionally, the propensity matched description table (Table 2) only lists mean age at diagnosis, mean tumor size, and gender ratio.
Yip and colleagues also did not provide significant insight into why some controls (n=8) were ranked as high risk while one patient with DM was categorized as low risk.48 The purpose of the intermediate risk category is unclear and concerningly unhelpful when the number of patients with this risk category were basically the same between propensity matched DM and control patients (n=17 versus n=19, respectively).
Ultimately, it was unclear how this test would be used in patient care.48 Given that the test is performed on cytology before resection, the authors conjectured their test could be used to guide extent of surgery (lobectomy versus total thyroidectomy) or help direct patients to therapeutic trials. However, these potential clinical utilities were not assessed in this paper.
In the second publication evaluating ThyroSeq CRC, Skaugen and colleagues performed a single-institution retrospective cohort study assessing 128 Bethesda V (suspicious for malignancy) cytology specimens.49 The study assessed both the ThyroSeq v3 diagnostic test as well as the ThyroSeq CRC test. For the CRC portion of the study 100 specimens were assessed, with exclusion of 5 due to a benign diagnosis upon resection and 3 excluded due to concurrent metastatic disease discovered at resection. For the remaining 92 specimens, there was a mean follow-up of 51.2 months (about 4 years). The shortest follow-up time was less than 1 month, and the longest follow-up time was 470 months (nearly 40 years). It must immediately be noted here that the ThyroSeq CRC test claims to predict a 5-year risk of DM, which means over half of the CRC tested specimens (more than 46 specimens) demonstrated potentially inadequate follow-up to assess the core 5-year prognostic claim. The importance of these follow-ups becomes even more evident when the authors drew conclusions about the prognostic power of the CRC’s 3 risk categories: high, intermediate, and low. Distant metastases were identified in 12 of the 92 specimens: 6 of 11 specimens with the high-risk result and 6 of 63 specimens with the intermediate risk result. The authors did not provide a deeper analysis of the 5 high risk specimens without DM, including no speculation as to why the test potentially misclassified these specimens. Additionally, the authors did not provide significant discussion into the meaning of the intermediate-risk result that was given to 66 of the 100 specimens tested. In the paper’s conclusions ThyroSeq CRC was again proposed as potentially helpful in deciding the extent of surgery required.
Much like the first paper, the second paper (Skaugen and colleagues) lacked data transparency, making further assessments by readers difficult.49 While Table 3 provided patient characteristics, surgical findings, and pathologic findings, all to a much greater extent than the first paper, readers were still unable to synthesize how data categories corresponded to each other (e.g., of the patients who received lymph node dissection, what subtypes of thyroid cancer were represented).
Ultimately, Skaugen and colleagues lacked sufficient follow-up to draw significant conclusions about the accuracy of the ThyroSeq CRC results.49 The paper, while data rich, was not transparent nor thorough enough for readers to draw their own conclusions about the validity of the test. Moreover, the conclusions given by the authors regarding the prognostic test were overly simplified, such as highlighting the presence of DMs in some patients with intermediate and high-risk results and considering this correlation to be significant when they also noted absence of DMs in patients with low-risk results. Finally, the actual use of ThyroSeq CRC in the clinical setting is still unclear based on the discussion of the paper.
In the third publication Liu and colleagues assessed their 3 tier classification system (low-risk, intermediate-risk, and high-risk for recurrence) in the context of primary thyroid cancer recurrence after a primary thyroidectomy and subsequent initial oncologic therapy.50 Notably, the test name ThyroSeq CRC was never used in this paper, even though the 3 tier system of risk stratification appeared to be the same. This raises a concern that the classification system used in this paper may not be the same methodology as used for the marketed ThyroSeq CRC. With that caveat and for the purposes of this Analysis of Evidence, this third paper will be considered contributory to the body of literature evaluating ThyroSeq CRC.
Just from the methodology section of the publication alone, we can see immediate differences between this paper and the previous 2 papers described.48-50 Firstly, surgical specimens were permitted in the study, not just cytology specimens. This allowance of a non-cytology specimen type (“final surgical specimens,” without specification of post resection handling of tissue, formalin fixation versus fresh-frozen preservation) in a test that was presumably designed for cytologic specimen would require a separate validation of the test for a new type of specimen. Validation for this change in pre-analytic procedure was not evidenced in this paper nor in either of the 2 prior publications. Secondly, the study was not blinded due to its retrospective nature. Thirdly, in cases where multifocal cancer was identified, only samples from the “most aggressive biology” were selected for molecular testing; however, the paper does not define what constitutes “most aggressive biology.” Fourthly, the specimens included in the study included patients with preoperative Bethesda I, II, III, and IV cytology as well as Bethesda V and VI cytology. This starkly contrasts with the inclusion criteria seen in the prior 2 studies. Overall, these methodologic differences between papers reduces the comparability of results between the 3 studies.
Data collection in this study from Liu and colleagues was also different from the previous 2 papers.48-62 For instance, Liu and colleagues recorded several details on the surgical and post-surgical treatments of the patients. This data included types of lymph node dissection (central versus lateral and prophylactic versus therapeutic), postoperative complications (e.g., hematoma, hypercalcemia, surgical site infection), and long-term complications (such as hypocalcemia and recurrent laryngeal nerve paresis). Several of these data categories were similar to those seen in Skaugen and colleagues’ study, but differences found in Liu’s publication included post-operative details, grouping of several types of papillary thyroid cancer (such as tall cell variant) into a more general category (i.e., “Papillary, high risk"), evaluating only all-cause mortality (and not substratifying into disease specific mortality), and detailing American Joint Committee on Cancer (AJCC) prognostic stages. Note that, as mentioned above, there was a paucity of clinical and pathologic data provided for samples in the study from Yip and colleagues, and the amount of data from Liu and colleagues was far more diverse than that prior study.
The study followed up patients for a median of 19 months (Interquartile range [IQR] of 10-31 months).50 None of the patients were followed for a total of 5 years, which means the data in this study is insufficient to substantiate the 5-year prognostication claims of the ThyroSeq CRC test.
The above analysis captures only some of the issues identified with the study from Liu and colleagues.50 In fact, careful reading of the paper’s discussion brings up numerous other “limitations” to the study, not already described above, as identified by the authors. While the authors’ discussion remains upbeat, statements such as “how to manage the intermediate group?” draw attention to the novelty of this classification schema and the uncertainty of what and how the results can impact patient care and outcome. While the authors suggest numerous ways their classifications can affect patient management, and even suggest that they use this test within their institution to guide their decision-making, the lack of evidence demonstrating this prognostic test’s clinical utility through carefully designed studies suggests that the test currently may not be adequately studied for use in patient care. In spite of the numerous data supplied, this paper still failed to adequately evaluate the clinical validity and utility of the prognostic 3 tier system.
In the fourth publication evaluating ThyroSeq CRC, Chiosea and colleagues performed a retrospective analysis of 50,734 FNA specimens from BCIII-VI nodules.51 The samples were first analyzed using ThyroSeq v3 assay which classified results as negative or positive. All test-positive samples were then re-examined using ThyroSeq CRC to establish their result in the 3 tier classification system (low-risk, intermediate-risk, and high-risk for recurrence). From the above mentioned FNA samples, 65.3% were test-negative, and 33.9% were test-positive. Among test-positive results for follicular lesions, 73.3% had mutations, 11.3% gene fusions, and 10.8% had isolated copy number alterations. ThyroSeq Cancer Risk Classifier identified high-risk profiles in 6% of samples, more frequently in BCV-VI.
This study had multiple limitations centering around data availability and certainty of evidence. First, the study provided no confidence intervals with which to measure or examine the certainty of their data. Second, ThyroSeq CRC is used to assign risk of cancer recurrence and thus is designed to be utilized on specimens which are suspicious for malignancy or are definitively malignant (BCV and BCVI respectively). However, the vast majority of specimens included in the study were BCIII-IV (48,347 samples of 50,734 samples total), representing indeterminant cytology (atypical of undetermined significance and suspicion for follicular neoplasm respectively) that could be benign or malignant. Therefore, these BCIII-IV samples, when benign, would not require the output of the CRC test. Third, the cytologic diagnoses were rendered by local cytopathologists with different diagnostic thresholds without utilizing centralized cytopathology review. Therefore, the results may not be truly comparable depending on degrees of variability in diagnostic behavior between local cytopathologists. Fourth, the study design also did not include association of results with subsequent histologic diagnoses or clinical follow-up. Thus, the paper did not adequately evaluate the clinical validity and utility of ThyroSeq CRC.51
In the fifth publication Liu and colleagues examined whether preoperative factors are linked to incomplete response to initial therapy and if molecular testing results can be used as a substitute for the ATA Risk Stratification System (RSS) to estimate risk of recurrence.52 Similar to the first Liu et al study, this paper used the 3 tier classification system (low-risk, intermediate-risk, and high-risk for recurrence) in the context of recurrence through the analysis of 108 molecular permutations using previously reported methodology. As in the previous Liu study, the test name ThyroSeq CRC was never used in this paper, even though the 3 tier system of risk stratification appeared to be the same. This raises a concern that the classification system used in this paper may not be the same methodology as is used for the marketed ThyroSeq CRC. However, this paper will still be considered contributory to the body of literature evaluating ThyroSeq CRC.
A careful reading of the paper’s discussion brings up numerous limitations to the study that were identified by the authors. The study had a short follow-up with a median of 18 months. For the analysis of recurrence, unknown preoperative factors could have been used to predict recurrence that were not analyzed, such as ultrasound characteristics. The study also had possible selection bias due to the nature of the patient selection (as noted by the authors in the limitations section). Inclusion criteria consisted of consecutive patients who underwent index thyroidectomy for any clinical indication and had pathologically identified primary DTC between November 1, 2017, and October 31, 2021, were abstracted from the electronic health record. However, patients who had initial thyroid surgery earlier or elsewhere for benign disease and underwent a thyroidectomy for thyroid carcinoma during the study period, as well as patients who had previous thyroid surgery earlier or elsewhere for DTC, requiring reoperation during the study period, were not abstracted. In cases of multifocal cancer, the thyroid cancer type with the most aggressive histology was recorded, as this dictated clinical care. Additionally, the retrospective nature of the study and lack of blinding increases the concern for subjectivity in patient selection, and as a result, lowers the certainty in the clinically significance of this test in uncurated patient populations. Finally, the authors themselves noted that additional validation data is needed regarding molecular risk groups (MRGs), and until available, MRGs should not replace the gold standard method in routine thyroid cancer care.52
In the final publication, a retrospective cohort study by Schumm and colleagues, patients with Bethesda V and VI nodules who underwent surgery with histopathology showing differentiated thyroid cancer were examined using ThyroSeq v3 and ThyroSeq CRC.53 The structural disease persistence or recurrence, distant metastasis, and recurrence-free survival were gauged using ThyroSeq CRC molecular risk groups (low, RAS-like; intermediate, BRAF-like; high, combination of BRAF/RAS plus TERT or other high-risk alterations). Of the 105 patients, genomic alterations were found in 100 samples: 6 MRG-low, 88-MRG intermediate, and 6 MRG-high.
As with the first Liu et al study, surgical specimens were used in the study, not cytology specimens.50,53 This allowance of a non-cytology specimen type in a test that was presumably designed for cytologic specimens would require a separate validation of the test for a new type of specimen. Validation for this change in pre-analytic procedure was not evident in this paper nor in the prior publications. Secondly, the study followed up patients for a median of 3.8 years (IQR of 3.0-4.7 years). No data was provided about how many of the of the patients were followed for a total of 5 years, which does not allow for the 5 year prognostication claims of the ThyroSeq CRC test to be validated. Third, the study was not blinded due to its retrospective nature.
Outside of the above analysis, the authors explained numerous other “limitations” to the study. The study was nonrandomized and contained a small sample size. Most patients in the MRG intermediate group underwent total thyroidectomy or underwent radioactive iodine (RAI) ablation, which may have contributed to their low risk of disease recurrence. While the authors suggest numerous ways their classifications can affect patient management, they do specifically note that long-term associations between MRGs and recurrence have yet to be established and that further studies are therefore needed before MRGs can replace the current gold standard risk stratification system recommended by guidelines. This suggests that the test currently may not be adequately studied for use in patient care and has clearly not been validated in a prognostic setting and therefore failed to adequately evaluate the clinical validity and utility of ThyroSeq CRC.53
In summary, the validity of the ThyroSeq CRC test is not sufficiently supported by the 6 peer-reviewed papers identified. The 6 papers were exceptionally difficult to compare to each other due to differences in information provided, types of samples tested, and methodologies described. A number of other shortcomings were identified, including missing confidence intervals, insufficient longitudinal data, and a lack of transparency. The certainty of evidence regarding the analytic and clinical validity of ThyroSeq CRC testing is low, given these limitations. Additionally, no paper has been published specifically establishing the clinical utility of ThyroSeq CRC. Due to the inadequate quality of the papers and the insufficiency of data, this test does not have sufficient evidence to prove clinical reasonableness and necessity and will be considered non-covered in Medicare patients.
PancraGEN- Interpace Diagnostics
PancraGEN (also known as Pathfinder TG and Integrated molecular pathology [IMP]) has received multiple updates to its input data and algorithmic categorization of risk since its initial release. Comparison of early example reports to the most recent example report (available on the PancraGEN website) clearly demonstrate this evolution.136,137 The current version of the PancraGEN report relies on algorithmic assessment of molecular data, cyst fluid test results, and radiologic findings to determine a patient’s risk for developing high grade dysplasia (HGD) and/or carcinoma. The algorithm is clearly described and diagrammed in the sample report.137 The algorithmic stratification of risk is heavily weighted towards molecular data, with the absence of “significant molecular alterations” automatically resulting in “Benign” categorization and the presence of 2 or more “significant molecular alterations” automatically resulting in “Aggressive” categorization. The “Benign” category confers a “97% probability of benign disease over the next 3 years” and the “Aggressive” category confers a “91% probability of HGD/carcinoma”. Per the sample report, 5 “significant molecular alterations” are described137:
-
- “High levels of DNA”
- “High clonality KRAS point mutation”
- “High clonality GNAS point mutation”
- “Single high clonality LOH tumor suppressor gene mutation”
- “Two or more low clonality LOW tumor suppressor gene mutations”
The current body of literature does not support the clinical validity of the PancraGEN test. Since the algorithm primarily categorizes risk via “significant molecular alterations” and the “Benign” category is defined as absence of these alterations, not testing any of the above 5 alterations would result in an underestimation of patients with potential higher risk of HGD/carcinoma. Therefore, adequate assessment of clinical validity of the current version of PancraGEN would require assessment of all 5 alterations in study populations. None of the 4 studies assessing the current version of PancraGEN fully assess all 5 alterations.55,87-89 For example, in the 3 retrospective studies derived from National Pancreatic Cyst Registry, a significant number of patients (“468/492 IMP diagnoses”) were NOT tested for GNAS because their data was collected from earlier versions of the PancraGEN test that did not include GNAS testing.55,87,88 Another study, from Khosravi and colleagues, addresses the 4 categorical results (simplifying them into 2 categories for the paper: low and high risk) but does not discus GNAS or clonality of identified mutations. As a result, PancraGEN studies lack the statistical integrity required to establish the clinical validity of the current version of PancraGEN.
Additionally, there are no studies supporting the clinical utility of the PancraGEN test.
First, there are no prospective studies for PancraGEN that directly evaluate its effect on patient management and outcome. The 4 studies evaluating the current version of PancraGEN are all retrospective, assessing patient populations who received PancraGEN testing as part of their clinical care; however, the assessment of PancraGEN’s effect on patient management and outcome is extrapolated from reading patient charts after the fact.
Second, cysts with a potential to develop into pancreatic cancer, like an Intraductal Papillary Mucinous Neoplasm (IPMN), can take over a decade to become malignant.138-140 Thus, when PancraGEN categorizes a specimen as “Benign” or “Statistically Indolent” with a “97% probability of benign disease over the next 3 years”, the results may provide patients with a false sense of security and/or delay instituting a longer-term follow-up plan, potentially resulting in patient harm. Moreover, none of the studies available for PancraGEN follow up their entire patient populations for over 10 years. In fact, the "97% probability of benign disease over the next 3 years” is based on a 492 patient, 2015 study from Al-Haddad and colleagues where patients were followed-up from 23 months to 7 years and 8 months. Notably, 54% of the patients were followed up less than 3 years.
Third, current society and expert guidelines do not endorse or mention the current version of PancraGEN as necessary in the work-up of pancreatic cysts73,80,140,141 which further demonstrates the lack of evidence for PancraGEN’s clinical utility.
In summary, the body of literature for PancraGEN is insufficient to establish both clinical validity and clinical utility.
DecisionDx-SCC – Castle Biosciences
Cutaneous squamous cell carcinoma (cSCC) is a malignant process arising from keratinocytes in the epidermis. In patients with light pigmentation, cSCC typically develops in areas of photodamaged skin. However, in patients with darkly pigmented skin, such as those with African ancestry, the most common sites cSCC develops include the lower legs, anogenital regions, and areas of chronic inflammation or scarring, suggesting that ultraviolet radiation may not play an important etiologic factor.142 A skin biopsy is required to diagnose and provide information essential in staging and management. Once the diagnosis is made, assessing the risk of locoregional recurrence and regional or distant metastases is critical to informing management. However, there is no consensus on the specific clinicopathologic characteristics defining high-risk cSCC. Moreover, national consensus organizations such as the National Comprehensive Cancer Network (NCCN), the American Joint Committee on Cancer (AJCC), and the Brigham and Women’s Hospital (BWH) staging systems have dissimilar criteria. These staging systems are notable for their low positive predictive value (PPV) (14-38%), resulting in many patients categorized as high risk but who do not develop advanced disease.143-147 DecisionDX-SCC is a genomic test that was developed to predict metastatic risk for SCC patients with one or more risk factors. The test classifies patients as low (Class 1), higher (Class 2A) or highest (Class 2B) biological risk of metastasis.96
In evaluating DecisionDx – SCC, much of the analysis was focused on 3 clinical validation studies. 96,99,100 Wysong et al developed a 40-GEP test, DecisionDX -SCC, incorporating changes in gene expression of 34 metastasis-associated genes and 6 control genes to improve risk stratification of patients with high risk for metastatic cSCC disease.96 The intended use of this mRNA test is to assess patients with known localized, invasive cSCC disease and any single clinicopathological feature that would increase a patient's T-stage above T1 or render the patient as NCCN-high risk.
Ibrahim et al provided some additional data to support these findings, albeit using most of the same patients included in the Wysong study. Sensitivity of the 40-GEP 2B vs 1/2A result was 19%, specificity 96.9%, PPV was 52.2% and NPV was 87.2%, all of which are comparable to that reported in the Wysong study. The authors concluded that the 40-GEP test demonstrated significant prognostic value and that the risk classification was improved by integrating the 40-GEP results with clinicopathologic risk factor-based assessment.100
Lastly, Aaron and colleagues published a subgroup analysis derived from the same set of patients examined by Wysong and Ibrahim; however, this time, the authors were focused on patients with SCC of the head and neck. They again concluded that the 40-GEP test enhanced the accuracy of predicting metastatic risk (p < .02) when combined with AJCC8 or BWH staging systems.99 While these studies are notable for their consistent results, they use essentially the same set of patients thereby having no incremental effect on the certainty of their conclusions.
Insufficient evidence to prove analytic validity
DecisionDx-SCC is a laboratory-developed test (LDT) and thus not regulated by FDA. Fundamentally, DecisionDx-SCC is a GEP that analyzes 34 genes of interest (considered by Castle Biosciences to be significantly relevant to the prognosis of cSCC) and 6 control genes. Additionally, the literature has proposed several other genes (e.g., PLAUR, MMP1, MMP10, MMP13, TIMP4, and VEGFA) implicated in driving metastasis in cSCC, but are not part of the 40-GEP panel.147
Although the stated intended use of DecisionDx - SCC is as a prognostic test in patients with known invasive cSCC disease, there has been no literature to prove that the test accurately predicts metastatic disease risk or provides clinically meaningful or actionable information. Though Wysong, Ibrahim, and Aaron et al concluded improved prognostic accuracy using DecisionDx-SCC in comparison to widely accepted clinicopathologic staging systems (e.g., AJCC8 and BWH), the retrospective trial design substantively diminished the certainty of their results.96,99,100 Moreover, no prospective randomized clinical trials have been published to date showing improvement in patient-centered outcomes from using the 40-GEP test results to guide management versus SOC.
Insufficient evidence to prove clinical validity
Understanding the potential pitfalls of GEP testing is critical for understanding the reliability, performance, and accuracy of a GEP test. Eighteen of the 34 discriminant genes in the 40-GEP signature do not have an established role in cSCC biology, therefore, there is only indirect, uncorroborated evidence of their significance and how they contribute to the progression of cSCC.96 The authors state that ‘future studies have the potential to identify how these genes promote cSCC metastasis.’96 Moreover, the clinical validation studies published to date are notable for their observational retrospective design lending the results vulnerable to confounding variables and significant bias. As a result, clinical validation study results for DecisionDx-SCC have a low level of certainty. Moreover, all validation studies were funded by the test manufacturer compounding an elevated risk of bias and conflict of interest. Independent, prospective comparative or randomized clinical trials are necessary to enhance the quality of literature available.
Patient population is not generalizable to Medicare beneficiaries
Clinical validation studies, all using the same study population, included mostly male patients (73%) of non-Hispanic (97%), “White” ancestry (99.7%). 96,99,100 Therefore, there is a low level of certainty that the performance of the DecisionDx-SCC GEP is applicable to the diverse Medicare patient population. For instance, given the significant differences in presentation and disease progression in patients with African ancestry, there is insufficient evidence to determine if the GEP signature and algorithm (developed in a distinctly homogenous non-Hispanic, “White” patient population) is applicable to Medicare beneficiaries with Asian, Hispanic, or African ancestry.
Insufficient evidence for clinical utility
Several clinical utility studies have been published; however, none are prospective randomized (or comparative) clinical trials demonstrating patient-centered outcomes attributable to management decisions (e.g., de-escalation of surveillance) based on DecisionDx-SCC test results. Instead, published clinical utility literature for UroVysion are physician surveys incorporating selected patient scenarios, as opposed to patients directly under their care. The results reported suffer from a high risk of bias and low level of certainty, not only related to the study design, but also conflicts of interest due to funding by the manufacturer.
Aside from the paper from Au and colleagues, papers addressing clinical utility included surveys, a panel review, and literature reviews.90-95 These papers had several shortcomings and limitations, including, but not limited to:
- A high likelihood of selection and response bias in the surveys
- No description of survey participant recruitment methods
- An expert panel composed of Castle Bioscience employees, consultants, and researchers
- Respondents are not treating the example patients
- Survey cases are hand-picked
Based on these factors, there is insufficient evidence to determine the clinical utility for DecisionDx-SCC.
Molecular diagnostic testing for cSCC is not endorsed by specialty societies (e.g., NCCN)
There are no published society guidelines or widely accepted consensus statements published in peer-reviewed journals endorsing the use of DecisionDx-SCC to inform management of cSCC. NCCN guidelines does not include recommendation for molecular diagnostic testing to inform diagnosis, staging, or treatment.148
In summary, the body of peer-reviewed literature concerning DecisionDx-SCC is insufficient to establish the analytic validity, clinical validity, and clinical utility of this test in a population of patients analogous to Medicare beneficiaries. As such, this test does not currently meet reasonable and necessary criteria for Medicare patients.
UroVysion fluorescence in situ hybridization (FISH) – Abbott
The product page on Abbott Molecular’s website states that the uFISH test “is designed to detect aneuploidy for chromosomes 3, 7, 17, and loss of the 9p21 locus via fluorescence in situ hybridization (FISH) in urine specimens from persons with hematuria suspected of having bladder cancer.”149 UroVysion fluorescence in situ hybridization (uFISH) is often compared to urine cytology, and the manufacturer specifically states that the uFISH test has “greater sensitivity in detecting bladder cancer than cytology across all stages and grades.” A positive result of the test is defined by the manufacturer as 4 or more cells out of 25 showing gains for 2 or more chromosomes (3, 7, or 17) in the same cell, or 12 or more out of 25 cells having no 9p21 signals detected. However, not all bladder cancers have these alterations, and these chromosomal changes can also be seen occasionally in healthy tissues and other types of cancer, as noted by Bonberg and colleagues (2014) and Ke and colleagues (2022).103,112 Additionally, genomic profiles and chromosomal abnormalities can vary between low grade and high grade bladder cancer which can make the detection of low grade cancer less likely.
Clinical validation limits role due to low positive predictive value
As noted by Lavery and colleagues in 2017 and Mettman and colleagues in 2021, much of the literature assessing uFISH uses a variety of definitions for positivity.117,119 Lavery aimed to overcome these shortcomings by using a strict definition for a positive uFISH test – which used the manufacturer’s definition along with the addition of “tetraploidy in at least 10 morphologically abnormal cells.”117 Tetraploidy can be seen in normal cell division and in other non-cancerous processes, so this addition was made to account for false-positive results from the uFISH test. The blinded study described in the paper found no significant difference between uFISH and urine cytology, with sensitivities of 67% and 69% and specificities of 72% and 76%, respectively. Additionally, the authors found that inclusion of the tetraploidy requirement in their definition effectively reduced false-positive rates, but also determined that some bladder cancer tumors do not have the chromosomal alterations for which uFISH assesses (30% of the tumors tested by the authors). Mettman similarly attempted to increase the accuracy of the uFISH test by including tetraploidy in their positivity definition.119 The authors reported considerably different results than the paper from Lavery, with sensitivity of uFISH ranging from 58-95% depending on the definition used, and a specificity using each definition of 99%. The study was specifically evaluating the test in patients suspected of having pancreatobiliary stricture malignancies, however, which could account for the differences seen between the 2 papers.
Sassa and colleagues (2019) compared the uFISH test to urine cytology in 113 patients prior to nephrouterectomy and 23 volunteers with no history of urothelial carcinoma.122 In cases of high-grade urothelial carcinoma (HGUC), the sensitivity, specificity, positive PPV, and NPV for detection by urinary cytology were 28.0%, 100.0%, 100.0%, and 31.6%, respectively. For uFISH, these values were 60.0%, 84.0%, 93.8%, and 41.2%, respectively. In cases of low-grade urothelial carcinoma (LGUC), however, the results were significantly worse, with sensitivities for both UroVysion and urine cytology of only 30%.
Other observational studies identified included 2 cohort studies from Nagai and colleagues (2019) and Gomella and colleagues (2017), a case-control study from Freund and colleagues (2018), and a cross-sectional study from Todenhöfer and colleagues (2014).113,114,121,123 Each of these studies reported similar results and limitations to the papers described above. Additionally, Breen and colleagues (2015)34 evaluated uFISH in a comparative study with other tests used to detect urothelial carcinoma in urine. The other tests included Cxbladder Detect, cytology, and NMP22. The study utilized 5 cohorts of patients, only 1 of which evaluated all 4 tests for the entire cohort. Data from the 5 cohorts were evaluated and integrated, with several different imputation analyses utilized to fill in for missing test values and create a “new, imputed, comprehensive dataset.” The authors report that before imputation uFISH had a sensitivity of 40% (the lowest of the 4 tests) and a specificity of 87.3% (the second lowest of the 4 tests). Utilizing several different imputation methodologies, similar findings for comparative sensitivities and specificities were seen, leading to the conclusion that the imputed data sets were valid, with the best imputation methodology being the 3NN model. In this 3NN model, uFISH had considerably lower sensitivity than the other three tests and lower specificity than 2 of the 3 tests.
In recent years, other authors have conducted reviews and meta-analyses in order to better address the clinical validity of uFISH, and other urinary biomarkers in general. In 2022, Zheng and colleagues published a meta-analysis and review that assessed the prognostic value of uFISH to detect recurrence in the surveillance of non-muscle invasive bladder cancer (NMIBC).111 They identified 15 studies from 2005-2019 that met their inclusion criteria and in their meta-analysis determined that the pooled sensitivity of uFISH in detecting recurrence was 68% (95% CI:58-76%) and the pooled specificity was 64% (95% CI: 53-74%).
Sciarra and colleagues (2019) conducted a systematic review to evaluate the diagnostic performance of urinary biomarkers for the initial diagnosis of bladder cancer.109 The review identified 12 studies addressing uFISH, with a combined sample size of 5,033 uFISH test results. The mean sensitivity was 64.3% and the median was 64.4%, with a range of 37-100%. Additionally, the mean specificity was 88.4% and the median was 91.3%, with a range of 48-100%.
Another recent paper identified was from Soputro and colleagues (2022), who conducted a literature review and meta-analysis to evaluate the diagnostic performance of urinary biomarkers to detect bladder cancer in primary hematuria.110 The authors identified only 2 studies assessing uFISH that met their inclusion criteria. The pooled sensitivity and specificity of uFISH in the identified studies was 0.712 and 0.818, respectively. The authors noted that the “current diagnostic abilities of the FDA-approved biomarkers remain insufficient for their general application as a rule out test for bladder cancer diagnosis and as a triage test for cystoscopy in patients with primary hematuria.”122
Sathianathen and colleagues also conducted a literature review and meta-analysis to evaluate the performance of urinary biomarkers in the evaluation of primary hematuria.108 The authors were only able to identify 1 paper addressing uFISH which met their inclusion criteria, which determined that uFISH was comparable to the other biomarker tests being evaluated. However, given the fact that only 1 paper was identified which met the authors’ criteria for inclusion, the findings regarding uFISH could not be properly assessed.
The most recent meta-analysis identified was written by Papavasiliou and colleagues (2023) who assessed the diagnostic performance of urinary biomarkers potentially suitable for use in primary and community care settings.107 The authors identified 10 studies addressing the diagnostic performance of uFISH between 2000 and 2022. These studies had a wide range of sensitivities (0.38-0.96) but a narrower range of specificities (0.76-0.99).
Three additional literature reviews were identified from Bulai and colleagues (2022), Miyake and colleagues (2018), and Nagai and colleagues (2021). Each of these papers noted significant issues with the literature support for these biomarkers in general, and uFISH in particular, but also lacked non-ambiguous inclusion criteria, search methods, and other necessary information to validate their assessments.102,105,106
Clinical utility is limited to use as an adjunctive diagnostic test
UroVysion fluorescence in situ hybridization (uFISH) has also been assessed as a prognostic test for the recurrence of bladder cancer in patients and as a means of identifying recurrence in patients sooner. A paper from Guan and colleagues in 2018 evaluated the value of uFISH as a prognostic risk factor of bladder cancer recurrence and survival in patients with upper tract urothelial cancer (UTUC).115 One hundred and fifty-nine patients in China received a uFISH test prior to surgery and were then monitored for recurrence. While the authors did indicate that there was a relationship between uFISH results and recurrence, the results were non-significant (p=.07). Liem and colleagues (2017) conducted a prospective cohort study to evaluate whether uFISH can be used to early identify recurrence during treatment with Bacillus Calmette–Guerin (BCG).118 During the study, 3 bladder washouts at different time points during treatment (t0 = week 0, pre-BCG, t1 = 6 weeks following transurethral resection of bladder tumor [TURBT], t2 = 3 months following TURBT) were collected for uFISH from patients with bladder cancer that were treated with BCG. The authors found no significant association between a positive uFISH result at t0 or t1 but found that a positive uFISH result at t2 was associated with a higher risk of recurrence. Additionally, in 2020, Ikeda and colleagues published a paper that aimed to evaluate the relationship between uFISH test results following TURBT and subsequent intravesical recurrence.114 They indicated that uFISH test positivity was a prognostic indicator for recurrence following TURBT. However, recurrence in patients with 2 positive uFISH tests was only 33.3%, and in patients with 1 positive uFISH test (out of 2 tests total) the recurrence rate was only 16.5%.
Limited patient follow-up was a repeated weakness in papers evaluating uFISH to detect or predict recurrence. For example, the paper from Guan had a median follow-up of 27 months (range: 3-55 months), the paper from Liem had a median of 23 months of follow-up (range: 2-32 months), and the paper from Ikeda had a median follow-up of 27 months (range: 1-36.4).115,116,118 The ranges of follow-up indicate that at least 1 patient was only followed for 1 month, and at least half of all patients had less than the median follow-up time. This limited follow-up means that cases of recurrence were likely overlooked in the studies. Even in cases where shorter follow-up may have been due to the early detection of recurrence, lack of continued follow-up could result in overlooking a patient with reduced survival following a recurrence; this additional information would be relevant to the uFISH prognostics.
Only 2 identified papers significantly addressed the clinical utility of the uFISH test: Guan (2018) and Meleth (2014).104,115 Guan noted that they did not find any association between a positive uFISH test and survival in patients; however, as noted above, limited follow-up was a significant shortcoming of their study.115 Meleth and colleagues conducted a review of the available literature and were unable to find any papers that met their inclusion criteria which directly assessed patient survival, physician decision-making, or downstream health outcomes in relation to uFISH test results.104 This lack of information regarding clinical utility is notable and without studies assessing for improvement in patient outcomes in a real-world setting.
It is also important to note that no studies were identified that established that uFISH was able to accurately distinguish between urothelial carcinoma and other cancers or other non-cancer urological conditions. As noted above, the specific chromosomal changes that uFISH uses to identify urothelial carcinoma have been identified in non-cancerous tissues and other types of carcinomas. This very notable gap in the identified research included a lack of details or definitions for non-urothelial cancers, of which many would feed into the urinary system, including prostate cancers, renal cancers, and metastatic or locally invasive cancers from other organs. With the knowledge that the chromosomal changes that uFISH uses to identify urothelial carcinoma can also be found in the context of other malignancies and non-malignancies, and that their identification in urine may not coincide with clinically detectable (e.g., cystoscopically visible) carcinoma, confusion could arise with false positives, especially when the PPV of uFISH tests tends to be very low. If numerous false positive results in uFISH are accepted as an inherent trait of the test, providers may not be as vigilant in closely following patients with a positive uFISH result after a negative cystoscopy. In addition, providers may not search for other malignancies as a potential cause for the “false positive” uFISH result.
AUA/SUO endorses limited role in diagnosis of bladder cancer and surveillance
The most current version of NCCN guidelines published to promote best practice for diagnosing and managing bladder cancer do not include the use of urinary biomarkers at all.24 The American Urological Society and Society of Urologic Oncology26,46 published guidelines for diagnosis and management of bladder cancer significantly limits the use of urinary biomarkers for any purpose. While AUA/SUO endorse the use of UroVysion, they applied a low strength of recommendation based on expert opinion for the specific clinical circumstances including indeterminate or nondiagnostic cytology and surveillance of recurrence after BCG administration.26
Colvera – Clinical Genomics
In April 2015, Pedersen and colleagues published a validation paper in which they described a blood test which would later come to be named Colvera.125 The test was designed to identify 2 methylated genes, namely branched-chain amino acid transaminase 1 (BCAT1) and ikaros family zinc finger protein 1 (IKZF1). Clinical Genomics had previously identified both genes as being important in the screening of colorectal cancer (CRC). Their study used methylation-specific PCR assays to measure the level of methylated BCAT1 and IKZF1 in DNA extracted from plasma obtained from colonoscopy-confirmed 144 healthy controls and 74 CRC cases. The authors found that their test was positive in 77% of cancer cases and 7.6% of controls. This study, however, failed to sufficiently address many pre-analytic variables, such as the protocols for pathologic review and the ultimate diagnosis of patients from which plasma samples were obtained.
Later that same year, another validation paper (also led by Pedersen) was published.126 This cohort study included both prospective and retrospective methods to collect plasma samples from 2,105 volunteers and reported a test sensitivity of 66%, (95% CI: 57–74). For CRC stages I-IV respective positivity rates were 38% (95% CI: 21%–58%), 69% (95% CI: 53%–82%), 73% (95% CI: 56%–85%), and 94% (95% CI: 70%–100%). Specificity was 94% (95% CI: 92%–95%) in all 838 non-neoplastic pathology cases and 95% (95% CI: 92%–97%) in those with no colonic pathology detected (n = 450). It is important to note that case diagnosis was performed by 1 independent physician and that there were no controls involving colonoscopy or pathology procedures. The authors stated that this was due to their aim “to assess marker performance relative to outcomes determined in usual clinical practice.”
An additional validation paper was published by Murray and colleagues in 2017 which assessed both the analytic and clinical validity of the Colvera blood test.128 The authors reported using archived samples from the previous study from Pedersen and colleagues (n=2,105 samples), but only used a subset of these archived samples (n=222 specimens, 26 with cancer).125,126 The authors did not describe selection criteria for these samples specifically, namely, whether or not sample selection was a randomized process or why a majority of the archived specimens were not selected. Murray and colleagues found that the Colvera test had good reproducibility and repeatability with a reported sensitivity of 73.1% (95% CI: 52.2%–88.4%) and specificity of 89.3% (95% CI: 84.1%–93.2%). In addition to questions regarding sample selection, other questions were left unanswered in the paper including, but not limited to:
- Does the accuracy of the test vary in different stages of cancer?
- Does treatment (such as chemotherapy/radiation) impact the precision of the test?
- For apparent false positives, would a longer follow-up reveal them to be true positives?
- In general, would serial sampling or longitudinal data impact the precision estimates of the test?
An additional paper published in 2018 from Murray and colleagues sought to establish the clinical validity of the Colvera test.129 In the paper, the authors tested patients post-surgery (median of 2.3 months after surgery) and followed them to establish whether or not recurrence was detected. Median follow-up for recurrence was 23.3 months, with an IQR of 14.3-29.5 months. Twenty-three participants were diagnosed with recurrence, but the Colvera test was positive in 28 participants. It should be noted that the cancer treatment varied considerably between cases, even between patients with a positive Colvera test result and those with a negative result. Only 61% of patients with a positive Colvera result completed their initial course of treatment, while 87% of patients with a negative result completed the initial course of treatment. The authors state that this was due “to either patients declining ongoing therapy, or due to comorbidities or complications precluding a full course of treatment.”129 This could have significantly confounded the results given the higher likelihood of recurrence in a patient who did not receive a full course of treatment as opposed to patients who did receive a full course. Additionally, while the median follow-up was 23.3 months, half of the patients had a shorter follow-up than the median, and without long-term follow-up, additional cases of recurrence were likely missed.
Five other papers, all cohort studies, were identified that assessed the clinical validity of the Colvera test, in particular, Colvera’s performance compared to carcinoembryonic antigen (CEA) and/or fecal immunochemical tests (FIT).130-134 These papers from Clinical Genomics found the sensitivity of Colvera to be 62-68% (with a wide range of 95% CIs including a study with one of 48%–84% and another of 42.4%–80.6%) and the specificity to be 87-97.9%, which is better than the sensitivity and specificity seen in CEA and FIT tests.
Young and colleagues (2016) assessed 122 patients that were being monitored for recurrent CRC (28 of whom had confirmed recurrence) to determine if Colvera or CEA was more accurate.134 The study only obtained a blood sample 12 months prior to or 3 months following verification of a patient’s recurrence status. This method of determining test accuracy was problematic, in particular, because the follow-up lengths varied considerably between patients. In patients with confirmed recurrence, the median follow-up was 28.3 months, with an IQR of 21.9-41.0. In patients without confirmed recurrence, the median follow-up was only 17.3 months, with an IQR of 12.0-29.2. This indicates that the majority of “confirmed” cases of no recurrence were followed for less time than the median follow-up in recurrent cases. Without an adequate length of follow-up, it is certainly possible that cases of recurrence would be missed. Additionally, while the authors did report on some longitudinal data (the concordance of test results in the same patient taken at different times), that data was limited to only 30 cases out of the total 122. Of the cases that did have longitudinal data, multiple cases were reported to have false-positive test results. When combined with the insufficient follow-up already discussed, the likelihood of incorrectly identified false-positive tests increases considerably.
Musher and colleagues (2020) and Symonds and colleagues (2020) also compared Colvera to CEA for detecting recurrent CRC.130,133 Musher, similar to the paper from Young (2016), also had short follow-up periods and insufficient longitudinal data (median follow-up was 15 months, range: 1-60 months).130,134 Symonds (2020), however, did obtain relatively more longitudinal data and longer follow-up periods, and showed that months prior to imaging confirmation, Colvera could show a positive result.133 However, without any assessment of the impact of test results on clinical outcomes, the utility of the test cannot be ascertained. Also, in the papers from Young, Musher, and Symonds, CEA sensitivity was considerably lower than normally reported in other literature (32%, 48%, and 32%, respectively).130,133,134 While not a direct reflection on the validity or utility of Colvera, it is important to note this discrepancy since the authors were comparing Colvera to the CEA test.
Two additional cohort studies evaluating Colvera, Symonds and colleagues (2016) and Symonds and colleagues (2018), had similar findings and shortcomings as the 3 studies described above, with test sensitivities of 62% in both papers and similar study designs.131,132
One other study evaluating Colvera, Pederson and colleagues (2023), prospectively followed 142 patients who had been treated for CRC to assess the clinical validity of Colvera in identifying the patients most likely to develop recurrence.127 The follow-up periods were longer than previously described studies with a median follow-up of 4.2 years (IQR 2.7-6.5) overall and a median follow-up of 5.3 years (IQR 3.7-6.9) in the recurrence-free group. Of the 142 patients followed, 33 developed recurrence. Only 9 (27%) of those patients had positive Colvera results, while the remaining 24 (73%) had negative results. An additional 10 patients who had positive Colvera results did not have an identified recurrence during the follow-up period. The hazard ratio for recurrence in Colvera-positive cases was 5.7 (95% CI: 1.9–17.3, p = 0.002). The 3-year recurrence-free survival was 56.5% and 83.3% for Colvera-positive and Colvera-negative cases, respectively.
Finally, the paper from Cock and colleagues (2019) assessed the precision of both Colvera and FIT testing in the detection of sessile serrated adenomas/polyps (SSPs).124 For this study, the authors used the same samples that were used in Symonds and colleagues (2016).131 While the paper did address pre-analytic variables and other shortcomings more sufficiently than the previous studies discussed, the results do not support the use of Colvera for the detection of SSPs. Forty-nine SSPs were identified during the colonoscopies of 1,403 participants who were also tested with the Colvera test. In those patients with SSPs, the Colvera test only had a sensitivity of 8.8%, and when combined with FIT, the sensitivity only increased to 26.5%.
Notably, there are no studies assessing patient outcomes or clinician treatment decisions in a real-world setting following a Colvera test. Without such data, clinical utility cannot be determined. One of the key factors in determining clinical utility is a test’s impact on patient outcomes. For example, a demonstration of clinical utility could be accomplished in a clinical trial where patients’ overall survival is compared between patients tested with Colvera and patients managed without this test. To date, such a trial has not been performed for Colvera.
In general, papers assessing the validity of the Colvera test for CRC have a number of shortcomings, including short follow-up time, large confidence intervals, insufficient longitudinal data, insufficient description of study methodology, and a failure to sufficiently address important pre-analytic variables. The certainty of evidence regarding the analytic and clinical validity of Colvera testing is low, given these limitations. Additionally, no paper has been published establishing the clinical utility of Colvera; a test without an improvement in patient outcomes is not clinically useful for the purposes of Medicare coverage.
In summary, the body of peer-reviewed literature concerning Colvera is insufficient to establish the analytic validity, clinical validity, and clinical utility of this test in the Medicare population. As such, this test does not currently meet reasonable and necessary criteria for Medicare patients and will not be covered.
PancreaSeq® Genomic Classifier, Molecular and Genomic Pathology Laboratory, University of Pittsburgh Medical Center
PancreaSeq Genomic Classifier is currently unproven in both its clinical validity and clinical utility. Nikiforova and colleagues concluded in their own retrospective study utilizing archived 3-10 year old specimens:
“Prospective testing is therefore required to determine the true diagnostic performance of PancreaSeq GC and is currently underway. However, additional studies will be required to ascertain the optimal approach for PancreaSeq GC testing and how PancreaSeq GC should be incorporated into current and future pancreatic cyst guidelines.”135
Medicare coverage of the test is thus not reasonable and necessary since research into the test’s clinical validity and utility is underway and incomplete according to the study from Nikiforova and colleagues.135