Project Description:
Antibody development is central to modern therapeutics, but generating experimental data is prohibitively expensive; even measuring around 10,000 sequences can cost between $1.5M and $8M. Machine learning offers a way to predict developability properties such as polyreactivity, but accurate models require large, diverse, and high-quality labeled datasets, which are rarely publicly available and costly to produce. In this project, we develop a computational experimental design framework to simulate data acquisition. Sequences are represented using normalized Levenshtein distance to define similarity structure and embedded using ESM-2 (650M, 1280-dimensional) to capture biologically meaningful features. The goal is to identify the most informative protein sequences to label, thereby maximizing predictive performance while minimizing experimental cost.