Bioengineering

BIOE20: The Tetramers

AI-guided Experimental Design for Active Data Selection in Protein Developability Models

The Tetramers project image — Identifying high-impact protein candidates that bridge the gap between known experimental space and uncharted sequences.

Project Description:

Antibody development is central to modern therapeutics, but generating experimental data is prohibitively expensive; even measuring around 10,000 sequences can cost between $1.5M and $8M. Machine learning offers a way to predict developability properties such as polyreactivity, but accurate models require large, diverse, and high-quality labeled datasets, which are rarely publicly available and costly to produce. In this project, we develop a computational experimental design framework to simulate data acquisition. Sequences are represented using normalized Levenshtein distance to define similarity structure and embedded using ESM-2 (650M, 1280-dimensional) to capture biologically meaningful features. The goal is to identify the most informative protein sequences to label, thereby maximizing predictive performance while minimizing experimental cost.

Advisor/Instructor:

Dr. Lan Ma

Sponsor:

Dr. Valentin Stanev, AstraZeneca

Team Members:

Arinze Ezeifeka	Bioengineering

Blake Gilbert	Bioengineering

Gabriel Lipman	Bioengineering

Janet Mwebi	Bioengineering

Poster:

CapstonePoster_Tetramers.pdf (3.91 MB)