×
 
Submitted by charlie@sginno… on Thu, 08/10/2023 - 16:37
Description

In natural language processing, large language models, which use massive neural networks to model statistics of natural text, are redefining how companies think about search, chatbots, copywriting, text editing, code development, and more. Models like ChatGPT use massive transformer networks trained on large web text corpora to be able to respond to user queries via extending user-supplied text prompts with uncanny ability. These responses are based on learning natural statistical patterns in the training text and generating responses based on the probability assigned to each following word given some prefix text. The remarkable capabilities of these models have been unlocked by the ability to scale transformer language models to massive sizes using huge amounts of GPU computing and to train them on enormous text corpora scraped from the internet.

Much like natural language, proteins are sequences of amino acids that fold into three-dimensional structures to carry out most functions at the molecular level of life. Also, much like natural language, we now have enormous databases containing the amino acid sequences of natural, functional proteins. Although most of these proteins have not been characterised (we only know their sequences), it turns out that statistical analysis of just these sequences can reveal evolutionary pressures, and, therefore, structural, and functional characteristics.

Over the past few years, large-scale deep learning methods, like protein language models, have transformed our ability to understand and predict the structural and functional properties of proteins by learning from these evolutionary patterns. However, current protein language models and their extensions (e.g., AlphaFold2 or ESMfold) have only scratched the surface of what large protein language models can enable for protein design and optimisation.

We have developed large protein language models that enable functional, prompt-based protein generation.

The objective of this project is to work with our machine learning team to develop the next generation of protein language models and to integrate these into our OpenProtein.AI platform for solving function-driven protein design tasks.

Heading
Next Generation Protein Language Models for Generative Protein Design
Logo
Summation Category

Be part of Singapore's fastest growing Deep Tech Community

Receive the latest updates on ecosystem events and programmes!