Benchmarking AIVA: How phenotype-aware agentic system improves variant classification
AIVA is not just an endpoint classifier, but also a phenotype-aware reasoner that raises the ceiling of automated variant interpretation
TL;DR: Rule-based variant classifiers hit a ceiling because they ignore context. AIVA incorporates phenotype, inheritance, and functional evidence into every classification. On FDA-approved eRepo data, this achieves an overall F1 of 80.5%—outperforming InterVar (60.6%) and BIAS-2015 (75.3%) across 11 disease categories.
Automated ACMG-based variant interpretation has become a foundational part of modern genomics pipelines. Many tools now offer automated classification, enabling quick, consistent scoring of large numbers of variants, and, for many use cases, they work well. But anyone who has spent time interpreting variants knows there is a ceiling to what these systems can do. Most of these tools are optimized for what is easiest to encode from a variant alone, not for what actually determines pathogenicity in a patient. Here, the distinction is between variants that are deleterious at the molecular level and those that are pathogenic in the context of a patient’s phenotype.
In practice, most systems rely on a narrow slice of the guideline framework, emphasizing population frequency thresholds, in silico predictions, variant consequences, and overlap with existing assertions. These signals are tractable, scalable, and variant-centric. What they largely do not capture are criteria that depend on context: whether a variant is de novo, whether a patient’s phenotype is specific to a known disease mechanism, how to weight functional evidence, or how to reason when evidence is partially supportive or internally conflicting. As a result, most automated tools are effectively answering a limited question: given this variant in isolation, how suspicious does it look? They are not designed to ask whether a variant makes sense for this patient or this disease.
AIVA (our AI clinical analyst agent) was built around that distinction of phenotype-aware classification. Our agent reasons over the evidence in context, integrating phenotype, inheritance, and disease expectations when available. In AIVA, users can explicitly specify phenotype, inheritance, and family history, allowing classifications to be evaluated in a clinical context rather than in isolation.
To understand what this change enables in practice, we benchmarked AIVA against two commonly used rule-based systems: BIAS-2015 (v2.1.1) and InterVar, on a curated set of 8,387 clinically classified variants (reported across 11 disease categories: see Figure 1) from eRepo (ClinGen). Across 8,387 variants, Pathogenic/Likely Pathogenic (P/LP) account for 42% of the dataset (3,518/8,387), VUS for 34% (2,851/8,387), and Benign/Likely Benign (B/LB) for 24% (2,018/8,387). The mix is also not uniform across disease areas: e.g., metabolic disorders are strongly P/LP‑dominated, whereas categories such as cancer predisposition and neurodevelopmental disorders are B/LB‑skewed. These compositional differences reflect the history of disease curation and the trends that have shaped it. This directly influences how AIVA is evaluated and how per‑category performance metrics (including F1) should be interpreted.

Across 8,387 variants, AIVA achieved an overall F1 score of 80.5%, compared with 75.3% for BIAS‑2015 and 60.6% for InterVar, but the more interesting story appears when those results are broken down by class. We observed that AIVA outperforms (see Figure 2) in classifying variants as Pathogenic (87.4%) and VUS (73.3%), but takes a conservative approach when classifying variants as Benign (80.9%). This is a result of AIVA performing an extensive and up-to-date literature review of the gene's function in the context of the given phenotype or disease.

We also evaluated AIVA's performance by disease category. It outperforms other classifiers in 9 of 11 categories (see Figure 2). The largest gains appear in hematologic disorders, immunodeficiencies, ophthalmologic conditions, and RASopathies; areas that are relatively well studied, with richer gene–phenotype associations and more structured clinical knowledge. In these domains, incorporating context enables AIVA to leverage information present in the literature and clinical records that is difficult to encode as fixed rules. On the other hand, for neurodevelopmental disorders, AIVA disagreed with historical benign classifications in several cases, flagging them for review based on recent literature. We saw this pattern across other categories too i.e variants historically curated as B/LB reclassified to VUS or P/LP when AIVA recognized literature consistent with the disease mechanism. Whether this reflects over-caution or appropriate updating is worth examining.
Taken together, these results suggest that using AIVA not only as an endpoint classifier but also as a phenotype-aware reasoner raises the ceiling of automated variant interpretation. By incorporating clinical context, inheritance, and disease knowledge that traditional classifiers are limited by, AIVA consistently outperforms rule-based systems across most disease categories. As genomic interpretation moves beyond variant-centric scoring, systems that reason in the context of a patient’s phenotype will ultimately reduce the gap between genomic data and meaningful clinical decisions.
We’re excited to release a free-to-use AIVA classifier that takes phenotype and any additional context about the variant/patient for classification.

