Methods for determining a representation of a protein complex, given a constituent target complex of that protein complex are presented; where the constituent target complex is a single entity constituent or subcomplex of the protein complex; and wherein a protein complex is a complex of some combination of one or more of proteins, nucleic acids, metal ions, and small molecules. A recursive neural network is devised, wherein for each iteration of the recursion, a representation of the output constituent of the protein complex together with the input constituent target complex is passed into the neural network as input for the next iteration. Some embodiments of the invention include design and manufacturing of effective synthetic biologic drugs, monoclonal antibody (mAb) drug, Antibody Drug Conjugate (ADC), peptide ligand drug, and small molecule drugs (SMDs).
Methods and apparatus for protein and drug design using neural networks with two or more output heads, wherein one head, a sequence head, is trained to generate the sequence of a protein, and another head, a structure head, is trained to generate the structure of the protein; and wherein the neural network is configured to accept a representation of a specified condition as input, and output a representation of a protein's sequence and structure. The structure head and sequence head each have their own loss functions, and the weights of the neural network body are shared, and jointly updated during training. Non-limiting examples of specified input conditions include representations of associated proteins and/or sets of properties of the desired output protein. Some embodiments of the invention include for the design and synthesis of effective peptide drug ligands, synthetic biologic antibody drugs, antibody drug conjugates, and monoclonal antibody (mAb) drugs.
G16H 70/40 - ICT specially adapted for the handling or processing of medical references relating to drugs, e.g. their side effects or intended usage
G06F 30/27 - Design optimisation, verification or simulation using machine learning, e.g. artificial intelligence, neural networks, support vector machines [SVM] or training a model
G06N 3/084 - Backpropagation, e.g. using gradient descent
3.
Clue: dynamic context retrieval in reasoning models for AI-based protein and drug design
Systems, methods, and apparatus for obtaining proteins and small molecules representations for manufacture, using a herein disclosed dynamic Context Load Update Engine (CLUE) during output generation by reasoning models. Pre-trained neural networks equipped with retrieval augmentation and trained on chain-of-thought data for reasoning capacity are used. The pre-trained models are further equipped with an indicator mechanism. During the course of output generation, the indicator mechanism indicates when a need for an update to the context arises; wherein the context is a combination of the input query and the theretofore generated output. Output generation continues between each context update till completion. In one embodiment of the invention, transfer learning is used to train the pre-trained neural network in conjunction with its associated indicator and retrieval mechanisms. The trained system is used to generate representations of proteins or small molecule drugs in response to specifying queries. The generated representations are then manufactured.
G16B 40/00 - ICT specially adapted for biostatisticsICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
G16B 50/30 - Data warehousingComputing architectures
4.
CAM-guided transformers for AI-based protein and drug design
Systems, methods, and apparatus for peptide ligand and small molecule dug design given target protein sequence and structure are presented. The methods use class activation mapping (CAM)-guided transformers to generate the ligand. Given a target protein structure, a CAM-guided structure refinement process is used to optimize the structure towards the desired ligand effect classification. The embedding of the target protein's refined structure along with its residue embeddings are the input array into a transformer architecture.
G06F 30/27 - Design optimisation, verification or simulation using machine learning, e.g. artificial intelligence, neural networks, support vector machines [SVM] or training a model
Methods and apparatus for determining protein and ligand sequence, structure, and docking site given a target protein sequence and structure are presented. A multicapitate transformer architecture with a number of heads including a sequence head and a structure head is introduced, wherein given a target protein sequence and structure, a candidate ligand is generated, wherein the transformer's sequence head yields the ligand sequence and the structure head yields the ligand structure and docking site. Non-capitate weights are shared between the output heads. In one embodiment, a discriminative feature localization method is used to optimize the target protein's input structure representation towards the desired ligand effect class. The methods and apparatus presented enable design and synthesis of both peptide ligands and small molecule drugs each with specified ligand effect categories.
Methods and apparatus for protein and drug design using multicapitate (“two or more headed”) neural networks, wherein one head, a sequence head, is trained to generate the sequence of a protein, and another head, a structure head, is trained to generate the structure of the protein; and wherein the neural network is configured to accept a representation of a specified condition as input, and output a representation of a protein's sequence and structure. The structure head and sequence head each have their own loss functions, and the weights of the neural network body are shared, and jointly updated during training. Non-limiting examples of specified input conditions include representations of associated proteins and/or sets of properties of the desired output protein. Some embodiments of the invention include for the design and synthesis of effective peptide drug ligands, synthetic biologic antibody drugs, antibody drug conjugates, and monoclonal antibody (mAb) drugs.
Methods and apparatus for obtaining representations of proteins and small molecule drugs for synthesis; wherein input queries into trained mixed modality protein and natural language models are augmented with relevant query-related documents. In one embodiment, the relevant query-related documents are obtained by maximum inner product search of an embedding latent vector space into which the query and the documents are projected. The top-k most relevant documents to the query are then combined with the query as input into the trained mixed modality language model. In one embodiment, the mixed modality model is an autoregressive multicapitate transformer whose decoder output heads correspond to the represented modalities. The method returns mixed modality output representations of proteins or small molecule drugs for synthesis or manufacture.
Methods and apparatus using a mixture of representation modalities including natural language, protein sequence, protein structure, property-vector, and small molecule drug representations to jointly train a neural network which accepts mixed modality queries as input and produces mixed modality output responses including representations of proteins for synthesis and of small molecule drugs for manufacture. In one embodiment of the invention, multicapitate transformers wherein each decoder head has a distinct loss function and represents a distinct modality, are used. Modality-specific embeddings are implemented for the mixed modality input query, and an autoregressive process yields the output protein for synthesis or small molecule drug for manufacture.
Methods and apparatus for obtaining representations of proteins and small molecule drugs for synthesis; wherein pre-trained mixed modality protein and natural language fusion models are further trained by supervised fine tuning using reasoning-oriented query—chain-of-thought (CoT) response pairs. The resulting reasoning-oriented neural network is then used to obtain representations of output proteins or small molecule drugs, in response to mixed modality reasoning-oriented input queries specifying conditions on the output. In one embodiment, the neural network is an autoregressive multicapitate transformer whose decoder output heads correspond to the represented modalities. The method returns mixed modality output representations of proteins or small molecule drugs for synthesis or manufacture.
Methods and apparatus for determining a representation of a protein-protein complex, given a constituent target complex of the protein-protein complex are presented; where the constituent target complex is some subset of the protein-protein complex. A recursive transformer neural network is devised, wherein for each iteration of the recursion, a representation of the output constituent protein complexed with the input constituent target complex is passed into the transformer as input for the next iteration. Some embodiments of the invention include design and manufacturing of effective synthetic biologic drugs, monoclonal antibody (mAb) drug, Antibody Drug Conjugate (ADC), peptide ligand drug, and small molecule drugs (SMDs).
Methods and apparatus for determining protein and ligand structure, for identifying ligand docking sites, and for obtaining both peptide and non-peptide drug ligand candidates for target proteins are presented. Methods include receiving a plurality of protein-ligand complex structures at a processor, converting to volumetric probability representation, and generating a training dataset by sequentially transforming the voxel-wise probability distributions. A discrepancy measure between consecutive transformations is bounded; that discrepancy measure between each state and the final diffused state progressively decreases; and localization probability of each residue summed over the diffusion volume is constant. A neural network is trained to learn protein and ligand residue localization, given a diffused representation. The methods serve to generate a protein structure given its sequence; or to generate a candidate ligand structure for a given target protein, given only ligand residue composition; or to determine promising candidate peptide and non-peptide drug ligands for synthesis.
Methods, systems, and apparatus for determining a conformational structure of a protein by using discriminative feature localization to iteratively update the protein structure locally, optimizing with respect to a physical or biological property of the structure representation. In one aspect, a method comprises initialization a plurality of structure parameters, selecting a physical or biological property of interest, training a neural network to score protein structural conformations on their measure of the selected property, using the neural network to perform inference yielding both a classification score and a discriminative feature localization map; and iteratively updating the structure parameters over the discriminative feature map, optimizing with respect to the physical or biological property of interest.