


Systems Reading Group with Arc Institute + LatchBio
The intersection of computing and engineering biology is a playground for systems: operating systems, file systems, virtualization, programming languages, databases, compilers, fuzzers, distributed systems, etc.
In this biotech flavored version of the SF systems reading group we'll hear from three awesome speakers who will walk through design decisions, paper highlights + snippets of source code:
Noam Teyssier | Arc Institute: BINSEQ: A Family of High-Performance Binary Formats for Nucleotide Sequences
Aidan Abdulali | LatchBio: A Distributed Filesystem Built on Postgres and S3
Abhinav Adduri | Arc Institute: Scaling Deep Learning to 1B+ Single Cells
Event space provided by LatchBio and Greylock is generously sponsoring food / refreshments.
Agenda
5:30 - 6:30 Meet others. Eat + drink.
6:30 - 8:00 Talks + Q&A
8:00 - TBD Socialize
Abstracts
Noam Teyssier | Arc Institute: A Family of High-Performance Binary Formats for Nucleotide Sequences
> Modern genomics produces billions of sequencing records per run, which are typically stored as gzip-compressed FASTQ files. While this format is widely used, it is not optimalfor high-throughput processing due to its reliance on single-threaded decompression andsequential parsing of irregularly sized records. Here, we present BINSEQ, a family of simple binary formats that enable high-throughput parallel processing of sequencing data. We demonstrate that BINSEQ files are up to 32x faster thancompressed FASTQ for parallel processing and can reduce analysis time from hoursto minutes for large-scale genome and transcriptome analyses, particularly for resource intensive applications like alignment, mapping, and de novo assembly.
LData: A Distributed Filesystem Built on Postgres and S3
Aidan Abdulali | LatchBio
> LatchBio builds data infrastructure to store, analyze and visualize lorgevolumes of molecular data. A core component of this platform is a distributed file system called LData. This talk walks through its architecture and illustrates how to build a complex distributed system with little more than a database.
Scaling Deep Learning to 1B+ Single Cells
Abhinav Adduri | Arc Institute
> Single cell transcriptomics data repositories have experienced dramatic growth in recent years. Similar to how internet-scale data enabled a new intelligence frontier for language models, the wealth of observational and perturbational data being generated will enable cellular models that reveal new biological insights. However, computational tools have not kept pace with the rapid development of single cell assays, presenting challenges in training and evaluating models on these datasets. In this short talk, I’ll describe how we scaled STATE to 300M cells, what avoidable mistakes we made, and what advancements are needed to efficiently scale to 1B+ cells.
Edge of Tomorrow Algorithms
James Braza | FutureHouse
>Imagine you're given a model, a benchmark, and just one day to saturate the benchmark. Normally training the model takes a week, but if you do not succeed in one day, the day resets. This talk is on a progression of algorithms from FutureHouse's aviary and ether0 papers that solve this exact problem, bringing us to the edge of tomorrow.
Excited to see you guys here and learn a bit more about computers.