Benchmarking AIVA (Part 2): How phenotype-aware agentic system improves gene prioritization
AIVA is not just an automated ranking algorithm, it is an agentic clinical analyst that has access to specialized genomics tools and real-time literature review for rare disease diagnosis
In our last post, we showed how AIVA utilizes phenotype information to classify genetic variants across 8,387 ClinGen-curated variants. Now, we tackle even harder problem: gene prioritization. Given a patient’s phenotype and a VCF file, can AIVA identify the causal gene and prioritize it for review?
TL;DR: We benchmarked AIVA against LIRICAL and Exomiser on 1,396 simulated rare disease patients derived from real UDN cases (Jacobsen et al., Nat Commun 2023). Given HPO phenotype terms and an annotated VCF, AIVA correctly ranked the true causal gene to the top of the list in 66.7% of cases vs. 59.5% for LIRICAL and 28.4% for Exomiser. When extended to top-20, AIVA correctly ranked the true gene in 94.4% of cases vs. 86.4% for LIRICAL and 90.0% for Exomiser. Unlike traditional scoring pipelines, AIVA, as an agentic system, dynamically queries databases, literature, and gene-phenotype associations at inference time. The takeaway: Gene prioritization at a clinical scale may require systems that reason, not just systems that score.
Roughly 400 million people worldwide live with a rare disease, and for most of them, the journey to diagnosis is measured in years. The average rare disease patient sees multiple specialists over 5 to 7 years before receiving a diagnosis, and many never receive one. This is the diagnostic odyssey.
The bottleneck is not sequencing. A typical clinical genome contains roughly 5 million variants spread across thousands of genes. Current bioinformatics platforms handle the first pass well: filtering by quality, frequency, and predicted pathogenicity to reduce the number to around 150 candidate variants per case. But that is where the hard work begins.
Connecting those remaining variants to a patient’s specific phenotype requires deep expertise across thousands of rare diseases, each with its own constellation of symptoms. Few clinicians possess that breadth of knowledge, and those who do are in short supply. Reviewing each candidate variant against the clinical presentation, the literature, and gene-disease databases is a manual, time-consuming process. This is what stretches weeks into months and months into years.
AIVA was built to close that gap. It is a phenotype-aware agentic system that approaches gene prioritization the way a skilled clinical analyst would: given a patient’s phenotype and variant data, it queries biomedical databases, searches the literature, traces gene-phenotype associations, and synthesizes evidence across sources to produce a ranked list of candidate genes with reasoning for each. Instead of relying on a fixed scoring algorithm, AIVA uses a rich set of tools to actively gather and weigh up-to-date evidence in real time.
To test whether this agentic approach can outperform established methods, we benchmarked AIVA head-to-head against two widely used gene prioritization tools, LIRICAL and Exomiser, on 1,396 simulated patients derived from real UDN cases and designed to mimic real-world clinical presentations. Here is what we found.
The Dataset: A Clinical Simulator Built From Real Disease Data
Meaningful benchmarking requires patient cases that reflect actual diagnostic complexity, not idealized phenotype-gene pairings. We used the rare disease patient simulator described by Jacobsen et al. (Nature Communications, 2023; data at Zenodo: 8190872), which generates synthetic patients from established ORPHA gene-disease associations, incorporating phenotype noise and clinical variability to reflect real diagnostic conditions.
Each of the 1,396 simulated patients presents with:
HPO terms: Each case includes a realistic mix of phenotype terms reflecting the imprecision of real clinical phenotyping:
Core diagnostic features associated with the true disease
Obfuscated terms where specific phenotypes are generalized to broader parent terms in the ontology
Noise terms sampled by age-stratified prevalence from population data to mimic unrelated comorbidities
Distractor terms from phenotypically similar but incorrect genes
Random dropout of true phenotype terms to simulate incomplete clinical observation
Both positive and negative HPO terms (observed and absent features) are included. AIVA supports both for prioritization.
VCF files: Simulated filtered VCFs mimicking the output of a standard rare disease analysis workflow, containing on average 14 candidate genes per case (range: 3 to 28). The resulting VCFs are annotated with CADD scores, PolyPhen-2 predictions, minor allele frequency, consequence types, and gene symbols.
Other information, including age of onset, disease ID, and causal gene symbols, was extracted from the simulator.
Three Tools, Different Architectures
All three tools received the same patient information, adapted to each tool’s required format.
LIRICAL v2.2.1 received the VCF file path and HPO terms, run in --global mode with hg19 assembly and RefSeq transcripts. It ranks candidates using a likelihood-ratio framework that combines the probability of a phenotype-gene match with variant pathogenicity scores. The key structural constraint: LIRICAL outputs only genes already linked to an entry in its HPO disease database. It cannot rank a gene whose knowledge base it doesn’t recognize.
Exomiser v14.0.0 received a YAML input file specifying HPO terms, the VCF, and output settings. It uses the hiPHIVE algorithm, combining phenotype similarity scores across human, mouse, and fish model-organism data with variant pathogenicity scores via a logistic regression model.
AIVA received all patient data in a unified JSON format: HPO terms with positive/negative status, the disease ID, age of onset, and the full annotated variant list. AIVA is an agentic system with a rich set of tools to dynamically query biomedical databases, literature, and HPO-gene associations, reasoning over each patient’s evidence in real time. Each candidate gene is assigned a confidence score from 0 to 1, with 1 indicating a likely causal role. AIVA is powered by Google Gemini 3.0 Pro Preview.
Five Levels of Diagnostic Difficulty
A critical design feature of this benchmark is stratifying the 1,396 patients by the degree of characterization of their gene-disease combination. This matters because the tools differ fundamentally in how they handle novel versus established cases, and aggregated metrics alone would mask important differences in their performance.
These categories create a gradient of increasing novelty, from fully characterized gene-disease pairs to cases in which neither the gene nor the disease association appears in any reference database. This allows us to test not just overall accuracy, but how each tool degrades as available knowledge decreases.
Overall Results: AIVA Outperforms Established Tools
AIVA ranks the causal gene top 1 in 66.7% of cases, 7 percentage points ahead of LIRICAL and 38 points ahead of Exomiser. When extended to top-20, AIVA correctly ranked the causal gene in 94.4% of cases vs. 86.4% for LIRICAL and 90.0% for Exomiser. This shows that AIVA can reduce the analyst’s review from ~150 candidate variants down to roughly 20 prioritized genes, reducing analyst burnout and shortening turnaround time. Notably, Exomiser jumps from 28.4% at top-1 to 90.0% at top-20, indicating that it identifies the correct gene in most cases but ranks it lower in the list, requiring more manual effort to surface.
We also compared these tools on per-category metrics using top-20 as the review baseline. AIVA outperforms or matches both tools across all five categories: 96.4% on Known-Gene-Disease (vs. 91.5% LIRICAL, 91.7% Exomiser), 86.3% on Known-Gene + Known-Disease (vs. 70.9%, 76.9%), 94.7% on Known-Gene + New-Disease (vs. 87.7%, 90.2%), 99.2% on New-Gene + Known-Disease (vs. 83.7%, 96.3%), and 93.6% on New-Gene + New-Disease, where LIRICAL matches at 93.6% (vs. 92.8% Exomiser).
What This Means
In our first benchmark, we showed that AIVA uses phenotype information to accurately classify variants. This benchmark addresses the harder upstream problem: given a patient’s phenotype and a VCF file, can the system identify the correct causal gene? The results show that an agentic approach, which actively queries databases, literature, and gene-phenotype associations in real time, consistently outperforms tools that rely on fixed scoring algorithms and static knowledge bases.
The clinical implication is practical. When AIVA places the causal gene in the top 20 in 94.4% of cases, an analyst can focus on roughly 20 prioritized genes instead of manually working through 150 candidate variants. That is the difference between a workflow that takes days and one that takes minutes to hours.
Taken together with our variant classification results, these benchmarks suggest that the next generation of genomic interpretation tools will not just score variants or rank genes in isolation. They will reason across the full clinical picture, combining phenotype, variant evidence, and up-to-date biomedical knowledge to help analysts work faster and with greater confidence. For the 400 million people living with a rare disease and the families waiting for answers, shorter turnaround times and higher diagnostic quality are not incremental improvements. They are what finally end the odyssey.
Next Steps
We are actively looking for clinical collaborators to validate AIVA on real clinical cases. If you work in rare disease diagnostics and are interested in piloting AIVA in your workflow, we would love to hear from you.
Questions or collaboration: Tarun Mamidi, tarun@mamidi.ai




