Bioengineering

BIOE20: The Tetramers

AI-guided Experimental Design for ​ Active Data Selection in Protein Developability Models​

The Tetramers project image
Identifying high-impact protein candidates that bridge the gap between known experimental space and uncharted sequences.

Project Description:

Antibody development is central to modern therapeutics, but generating experimental data is prohibitively expensive; even measuring around 10,000 sequences can cost between $1.5M and $8M. Machine learning offers a way to predict developability properties such as polyreactivity, but accurate models require large, diverse, and high-quality labeled datasets, which are rarely publicly available and costly to produce. In this project, we develop a computational experimental design framework to simulate data acquisition. Sequences are represented using normalized Levenshtein distance to define similarity structure and embedded using ESM-2 (650M, 1280-dimensional) to capture biologically meaningful features. The goal is to identify the most informative protein sequences to label, thereby maximizing predictive performance while minimizing experimental cost.

Advisor/Instructor:

Dr. Lan Ma

Sponsor:

Dr. Valentin Stanev, AstraZeneca

Team Members:

Arinze Ezeifeka Bioengineering
Blake Gilbert Bioengineering
Gabriel Lipman Bioengineering
Janet Mwebi Bioengineering

Poster:

Tetramers_Poster.pdf (1.24 MB)

Table #:

F4
Back to Top