Cover Image for Systems Reading Group with Arc Institute + LatchBio
Cover Image for Systems Reading Group with Arc Institute + LatchBio
170 Going

Systems Reading Group with Arc Institute + LatchBio

Hosted by Kenny Workman & Shreya Shekhar
Register to See Address
San Francisco, California
Registration
Event Full
If you’d like, you can join the waitlist.
Please click on the button below to join the waitlist. You will be notified if additional spots become available.
About Event

​The intersection of computing and engineering biology is a playground for systems: operating systems, file systems, virtualization, programming languages, databases, compilers, fuzzers, distributed systems, etc.

In this biotech flavored version of the SF systems reading group we'll hear from three awesome speakers who will walk through design decisions, paper highlights + snippets of source code:

  • Noam Teyssier | Arc Institute: BINSEQ: A Family of High-Performance Binary Formats for Nucleotide Sequences

  • Aidan Abdulali | LatchBio: A Distributed Filesystem Built on Postgres and S3

  • Abhinav Adduri | Arc Institute: Scaling Deep Learning to 1B+ Single Cells

Event space provided by LatchBio and Greylock is generously sponsoring food / refreshments.

Agenda

  • 5:30 - 6:30 Meet others. Eat + drink.

  • 6:30 - 8:00 Talks + Q&A

  • 8:00 - TBD Socialize

Abstracts

Noam Teyssier | Arc Institute: A Family of High-Performance Binary Formats for Nucleotide Sequences

> Modern genomics produces billions of sequencing records per run, which are typically stored as gzip-compressed FASTQ files. While this format is widely used, it is not optimalfor high-throughput processing due to its reliance on single-threaded decompression andsequential parsing of irregularly sized records. Here, we present BINSEQ, a family of simple binary formats that enable high-throughput parallel processing of sequencing data.  We demonstrate that BINSEQ files are up to 32x faster thancompressed FASTQ for parallel processing and can reduce analysis time from hoursto minutes for large-scale genome and transcriptome analyses, particularly for resource intensive applications like alignment, mapping, and de novo assembly.


LData: A Distributed Filesystem Built on Postgres and S3
Aidan Abdulali | LatchBio

> LatchBio builds data infrastructure to store, analyze and visualize lorgevolumes of molecular data. A core component of this platform is a distributed file system called LData. This talk walks through its architecture and illustrates how to build a complex distributed system with little more than a database.

Scaling Deep Learning to 1B+ Single Cells
Abhinav Adduri | Arc Institute

> Single cell transcriptomics data repositories have experienced dramatic growth in recent years. Similar to how internet-scale data enabled a new intelligence frontier for language models, the wealth of observational and perturbational data being generated will enable cellular models that reveal new biological insights. However, computational tools have not kept pace with the rapid development of single cell assays, presenting challenges in training and evaluating models on these datasets. In this short talk, I’ll describe how we scaled STATE to 300M cells, what avoidable mistakes we made, and what advancements are needed to efficiently scale to 1B+ cells.

Edge of Tomorrow Algorithms 
James Braza | FutureHouse

>Imagine you're given a model, a benchmark, and just one day to saturate the benchmark. Normally training the model takes a week, but if you do not succeed in one day, the day resets. This talk is on a progression of algorithms from FutureHouse's aviary and ether0 papers that solve this exact problem, bringing us to the edge of tomorrow.

Excited to see you guys here and learn a bit more about computers.

Location
Please register to see the exact location of this event.
San Francisco, California
170 Going