Serverless GPU Orchestration for Genomic LLMs

bioinformaticsserverless-gpumodalevo2genomicspython

Running state-of-the-art biological foundation models like Evo2 requires massive VRAM (H100/A100 GPUs), which are prohibitively expensive to keep running 24/7. To make Biofly a viable tool, I needed a way to trigger high-performance inference on-demand without incurring thousands of dollars in monthly cloud overhead.

Architected a decoupled inference pipeline using Modal for serverless GPU execution, allowing the system to scale NVIDIA H100 workers instantly when a genomic variant is submitted and scale to zero when idle.

Persistent G5/P4 EC2 Instances

Pros
  • Zero 'cold start' latency
  • Simpler deployment via standard Docker containers
Cons
  • Extremely high idle costs ($2k+/month for H100 availability)
  • Complex manual scaling logic required for traffic spikes

On-Premise GPU Workstation

Pros
  • One-time capital expenditure
  • Total data privacy/control
Cons
  • Limited scalability for concurrent users
  • Significant maintenance and power overhead

Modal provided a Python-native way to bridge the gap between AI research and production. By defining the infrastructure as code, I could utilize H100s for the 30-60 seconds needed for genomic tokenization and pathogenicity prediction, only paying for the exact compute time used while maintaining a seamless API connection to the Next.js frontend.

Bridging Biological Data and High-Performance Compute

Biofly transforms raw DNA sequences into clinical insights. The primary engineering hurdle was managing the massive computational requirements of genomic foundation models within a responsive web application.

1. The Decoupled Architecture

To ensure a smooth user experience, I separated the concerns of the application into two distinct layers:

  • The Orchestrator: A T3-stack Next.js application that handles user state, genomic visualizations, and cross-referencing legacy databases like ClinVar.
  • The Inference Engine: A FastAPI backend deployed on Modal that encapsulates the Evo2 weights. When a prediction is requested, Modal dynamically provisions an H100, loads the necessary genomic context, and returns the pathogenicity score.

2. High-Fidelity Sequence Streaming

Genomic data is too large to pass through traditional REST payloads efficiently. I implemented a “Proxy-Fetch” strategy:

  • The user provides a genomic coordinate (e.g., chr17:43044295).
  • The backend streams the relevant 7kb context window directly from UCSC Genome Browser APIs.
  • This ensures that the AI model always looks at the most accurate, up-to-date reference genome without Biofly needing to store terabytes of genomic BigWig files locally.

3. Mitigating Cold Starts

Serverless GPUs suffer from “cold starts” while the model weights load into VRAM. I mitigated this by implementing a Warm-up Pattern:

  • Small, frequent health checks keep a warm pool of workers during peak research hours.
  • Utilizing Modal’s optimized image layers to ensure the Evo2 environment (Python/PyTorch/Cuda) initializes in seconds rather than minutes.

Impact on Genomic Research

  • Democratized Access: Researchers can now run Evo2-level predictions from a browser on a standard laptop, bypassing the need for local Linux clusters.
  • Cost Efficiency: Infrastructure costs were reduced by over 90% compared to a persistent GPU instance model, as compute only triggers during active analysis.
  • Clinical Contextualization: By visualizing AI predictions alongside established ClinVar classifications, the tool provides a “sanity check” for researchers, identifying where AI aligns with—or challenges—known clinical literature.

Results

Biofly successfully demonstrates that the “SaaS-ification” of complex biological models is possible through serverless orchestration. The platform currently supports the full hg38 human assembly, providing real-time, LLM-driven variant effect predictions that were previously locked behind complex CLI research scripts.


Would you like me to create a “Related Decision” for the “Proxy-Fetch” strategy used to stream the genome sequences?