A computer-implemented method for identifying off-target proteins comprises: receiving an indication of a first protein comprising residues of interest for targeting; receiving data indicative of a first whole protein sequence corresponding to the first protein; comparing the first whole protein sequence against a protein sequence database to identify whole protein sequences of other proteins having a threshold level of sequence resemblance to the first whole protein sequence; performing multiple sequence alignment on the other whole protein sequences with respect to the first whole protein sequence; identifying residues within each of the aligned whole protein sequences which positionally correspond with the residues of interest in the first whole protein sequence; determining a measure of similarity between the first protein and each other protein; and identifying one or more of the other proteins as off-target proteins with respect to the drug target based on the measures of similarity.
A computer-implemented method of training a machine learning model to identify biological entities for drug discovery is disclosed. The method comprises providing a training data set comprising a plurality of entity-linked text sequences, each text sequence including a mention of a biological entity, where the biological entity is linked to a corresponding biological entity identifier from a set of possible biological entity identifiers; masking the mention of the biological entity within each text sequence; encoding each masked text sequence into an input representation for a machine learning model; and training a machine learning model to predict the unique entity identifier of the masked biological entity based on the input representation. The described method is able to utilise the full breadth of the rich contextual information available in the biomedical text corpus to predict new biological targets for drug discovery and avoids the restrictions intrinsic to relationship prediction using knowledge graphs. The ability to identify more promising, biologically relevant targets in an automated manner, significantly reduces the requirement of human input and reduces the failure rate in targets that are progressed in the drug delivery pipeline.
G16H 50/70 - ICT specially adapted for medical diagnosis, medical simulation or medical data miningICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients
G16B 15/30 - Drug targeting using structural dataDocking or binding prediction
A computer-implemented method for determining a measure of relative gene expression is disclosed. The method comprises: receiving a plurality of gene expression datasets, wherein each gene expression dataset comprises gene expression levels for a respective sample, and wherein the plurality of gene expression datasets are all measured using a first transcriptomic platform; computing a distribution of gene expression levels across the plurality of gene expression datasets; fitting a number of Gaussian components to the distribution of gene expression levels using a Gaussian mixture model; defining, based on the fitted Gaussian components, a set of relative gene expression thresholds for the plurality of gene expression datasets; and determining a measure of relative gene expression for each of a plurality of genes across the plurality of gene expression datasets based on the set of relative gene expression thresholds.
A computer-implemented method for determining a measure of relative gene expression is disclosed. The method comprises: receiving a plurality of gene expression datasets, wherein each gene expression dataset comprises gene expression levels for a respective sample, and wherein the plurality of gene expression datasets are all measured using a first transcriptomic platform; computing a distribution of gene expression levels across the plurality of gene expression datasets; fitting a number of Gaussian components to the distribution of gene expression levels using a Gaussian mixture model; defining, based on the fitted Gaussian components, a set of relative gene expression thresholds for the plurality of gene expression datasets; and determining a measure of relative gene expression for each of a plurality of genes across the plurality of gene expression datasets based on the set of relative gene expression thresholds.
A computer-implemented method predicting a biological entity meeting a user- defined biological requirement using a knowledge base, the method comprising: providing an inference knowledge base comprising a corpus of textual data; receiving a user query defining a biological requirement for which a biological entity is to be predicted; obtaining, based on the query, a query sentence text describing the biological requirement and including mention of a biological entity, in which the biological entity itself is masked for prediction; selecting a candidate biological entity for the masked biological entity and retrieving a plurality of evidence sentences from the knowledge base, each evidence sentence including mention of the candidate biological entity, wherein the evidence sentences are retrieved based on computing a similarity of the query sentence to sentences within the knowledge base; inputting each training query sentence and a plurality of retrieved evidence sentences into a reasoner model, where mention of the candidate biological entity is masked in the query sentence and evidence sentences, the reasoner model trained to predict a probability that the candidate biological entity is the masked biological entity based on the retrieved evidence sentences.
G16H 50/70 - ICT specially adapted for medical diagnosis, medical simulation or medical data miningICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients
Herein disclosed are a methods and systems of SAMMI—a machine learning-based workflow that uses human annotations as labels for training models—used to predict human-based annotations for drug discovery. SAMMI receives an input to a model trained using human-annotated data, wherein the human-annotated data comprises at least one annotation associated with a triage-progressability annotation of whether to progress the input for the drug discovery. SAMMI also receives a set of features. The set of features are associated with the input, the model, and the triage-progressability of the input. The set of features is applied to the model to predict whether the input is triage-progressible. A model output is provided based on the prediction.
A computer-implemented method of training a machine learning model to predict a clinical outcome or characteristic based on a patient's clinical history is disclosed. The method comprises: providing training data comprising structured electronic health record data for a plurality of patients, the structured electronic health record data comprising a plurality of clinical observations, each clinical observation having a text description and an associated time stamp, wherein the training data for each patient is labelled with one or more labels, each representing a clinical outcome or characteristic; converting each patient's electronic health record data into a text sequence comprising the text descriptions concatenated in sequence of the time stamps; inputting the text sequence into a machine learning model; and training the machine learning model to predict a clinical outcome or characteristic based on the input text sequence.
G16H 50/70 - ICT specially adapted for medical diagnosis, medical simulation or medical data miningICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients
Methods and apparatus are provided for generating an embedding of a graph. The graph includes a plurality of nodes and each node includes a connection to another one or more of the nodes. The method including and/or apparatus configured to: receiving data representative of at least a portion of the graph; transforming the nodes of the graph into a non-Euclidean geometry; iteratively updating an embedding model based the transformed nodes in the non-Euclidean geometry based on a causal loss function and a link prediction function associated with the non-Euclidean geometry.
Embodiments of present disclosure provide a system, apparatus and method(s) for generating a set of metrics for evaluating entities used with a predictive machine learning model, the method comprising: selecting one or more sets of entities from a data sources for generating a plurality of predictions aggregated from said one or more sets of entities using one or more pre-trained predictive models; selecting a subset of predictions from the plurality of predictions based on said one or more sets of entities in relation to the data source; extracting metadata from the data source associated with the subset of predictions, where the metadata comprises entity metadata and predicted metadata; generating the set of metrics based on the metadata extracted and the subset of predictions; and outputting the set of metrics for evaluation.
G16B 40/00 - ICT specially adapted for biostatisticsICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
G16B 50/30 - Data warehousingComputing architectures
G16B 15/30 - Drug targeting using structural dataDocking or binding prediction
Methods, apparatus, system and computer-implemented method are provided for a computer-implemented method of automatically extracting entities associated with one or more domain(s) of interest from a corpus of text. A plurality of portions of text are received from the corpus of text, each portion of text comprising data representative of at least two entities and/or relationships thereto. For each received portion of text, identifying one or more subject-verb-object (SVO) entity data item(s) comprising data representative of at least two entities, a relationship associated with the at least two entities, a subject entity corresponding to an entity of said at least two entities, an object entity corresponding to an entity of the at least two entities, a verb portion associated with the relationship, and a direction of the relationship associated with the at least two entities. A graph structure based on the set of identified SVO entity data items is output, the graph structure comprising a graph of entity nodes and relationship edges linking the entity nodes with each relationship edge including an indication of directionality of said relationship.
G16B 40/00 - ICT specially adapted for biostatisticsICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
Methods, apparatus, system and computer-implemented method(s) are provided for creating a graph of entities of interest and relationships thereto. A search query is received corresponding to entities of interest. The search query including data representative of a first set of entities. An expanded search query is generated based on inputting the received search query to one or more entity expansion process(es) or engine(s). The expanded search query including data representative of a second set of entities and the first set of entities. Creating a graph of entities of interest and relationships thereto based on processing the expanded search query with data representative of a corpus of text. Creating the graph by processing the expanded search query to filter an existing graph of entities of interest and relationships thereto based on the expanded search query. The existing graph of entities of interest and relationships thereto is previously generated based on the corpus of text.
A computer-implemented method of querying a graph to assess relationships amongst graph nodes comprises determining a query node on the graph, identifying one or more target nodes on the graph in relation to the query node based on a set of connectivity patterns; generating graph-based statistics for each target node of the one or more target nodes, wherein the graph-based statistics are extracted for subgraphs associated with each target node and the query node; and assessing the graph-based statistics of each target node to determine predicted relationships between the one or more target nodes and the query node.
Method(s), apparatus, and system(s) are provided for selecting a data model configuration for use in training predictive models comprise receiving two or more data model configurations, extracting a data model for each of the two or more data model configurations from a knowledge graph, generating a separate predictive model for each of the extracted data models, scoring the output of each separate predictive model based on a benchmark data set, and selecting at least one data model configuration of the two or more data model configurations based on the output scores.
A system for identifying a target for the treatment of a primary disease is provided. The system comprises: an input module configured to receive data for studying the primary disease, the data relating to individuals of a cohort; an encoder configured to use machine learning to encode the data as latent variables; an interpretation module configured to interpret the latent variables to stratify the individuals of the cohort into endotypes of the primary disease; and an identification module configured to identify a target that is associated with one of the endotypes.
G16H 50/20 - ICT specially adapted for medical diagnosis, medical simulation or medical data miningICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems
G16H 70/60 - ICT specially adapted for the handling or processing of medical references relating to pathologies
G16H 10/60 - ICT specially adapted for the handling or processing of patient-related medical or healthcare data for patient-specific data, e.g. for electronic patient records
15.
DISTRIBUTIONS OVER LATENT POLICIES FOR HYPOTHESIZING IN NETWORKS
Embodiments of present disclosure provide a system, apparatus and method(s) for determining one or more target nodes and associated paths from a query of a graph structure. The method receives the query to the graph structure, where the query comprises a data representation of at least one query node. The method identifies one or more target nodes in response to the query based on a policy network, where the policy network is configured to determine the one or more target nodes in accordance with a latent policy distribution associated with the policy network. The method traverses the graph structure by a search in relation to the policy network, where the search is configured to navigate from the query node to the one or more identified target nodes to determine the associated paths. The method outputs a list of the one or more target nodes and the associated paths for the query, where the list are ranked in relation to the latent policy distribution.
A computer-implemented method of stratifying a population of patients into disease endotypes is provided. The method comprises: encoding data relating to the patients as latent variables; determining one or more importance measures of the latent variables; prioritising the latent variables using the importance measures; interpreting one or more of the ranked latent variables; and identifying a disease endotype that is represented by one or more of the interpreted latent variables.
G16H 50/70 - ICT specially adapted for medical diagnosis, medical simulation or medical data miningICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients
17.
METHOD AND SYSTEM FOR IDENTIFYING BIOLOGICAL ENTITIES FOR DRUG DISCOVERY
A computer-implemented method of training a machine learning model to identify biological entities for drug discovery is disclosed. The method comprises providing a training data set comprising a plurality of entity-linked text sequences, each text sequence including a mention of a biological entity, where the biological entity is linked to a corresponding biological entity identifier from a set of possible biological entity identifiers; masking the mention of the biological entity within each text sequence; encoding each masked text sequence into an input representation for a machine learning model; and training a machine learning model to predict the unique entity identifier of the masked biological entity based on the input representation. The described method is able to utilise the full breadth of the rich contextual information available in the biomedical text corpus to predict new biological targets for drug discovery and avoids the restrictions intrinsic to relationship prediction using knowledge graphs. The ability to identify more promising, biologically relevant targets in an automated manner, significantly reduces the requirement of human input and reduces the failure rate in targets that are progressed in the drug delivery pipeline.
A computer-implemented method and a system of selecting a cell line for an assay. The computer-implemented method and system encode data, which is comprised of one or more features, as one or more latent variables. The one or more features encoded in the one or more latent variables are identified and mapped to cell lines based on the one or more features. A relevance of one or more targets to each of one or more of the one or more latent variables is determined and the one or more targets to the cell lines are matched via the one or more latent variables.
A computer-implemented method of prioritising biological targets is disclosed. The method comprises: receiving a selection of classes of one or more categories; and, for each of a plurality of biological targets, determining an extent of alignment of the biological target to each selected class. The method also comprises prioritising the biological targets based on the extents of alignment; and outputting a representation of one or more prioritised biological targets.
G16B 5/00 - ICT specially adapted for modelling or simulations in systems biology, e.g. gene-regulatory networks, protein interaction networks or metabolic networks
G16B 20/00 - ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
A computer-implemented method of designing a molecule and determining a route to synthesise the molecule is provided. The method comprises: receiving one or more desired properties of the molecule; generating one or more candidate molecules using a first machine learning technique that uses the one or more desired properties of the molecule as an input; and for at least one candidate molecule, computing one or more routes to synthesise the candidate molecule using a second machine learning technique.
A computer-implemented method of identifying a tool compound is provided. The method comprises: searching a database for first candidate compounds that each target one or more first target genes; generating a first fingerprint for each first candidate compound by: searching the database for genes associated with the first candidate compound, and predicting genes associated with the first candidate compound; and filtering the first candidate compounds using the first fingerprints to identify a first optimum compound for targeting the one or more first target genes.
G16H 50/70 - ICT specially adapted for medical diagnosis, medical simulation or medical data miningICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients
Herein disclosed are a methods and systems of SAMMI – a machine learning-based workflow that uses human annotations as labels for training models – used to predict human-based annotations for drug discovery. SAMMI receives an input to a model trained using human-annotated data, wherein the human-annotated data comprises at least one annotation associated with a triage-progressability annotation of whether to progress the input for the drug discovery. SAMMI also receives a set of features. The set of features are associated with the input, the model, and the triage-progressability of the input. The set of features is applied to the model to predict whether the input is triage-progressible. A model output is provided based on the prediction.
G16H 10/20 - ICT specially adapted for the handling or processing of patient-related medical or healthcare data for electronic clinical trials or questionnaires
G16H 70/40 - ICT specially adapted for the handling or processing of medical references relating to drugs, e.g. their side effects or intended usage
23.
EVALUATION FRAMEWORK FOR TARGET IDENTIFICATION IN PRECISION MEDICINE
A computer-implemented method for evaluating a target identification workflow in precision medicine is provided. The target identification workflow comprises: an endotype detection module configured to detect endotypes from cohort data, and a target prediction module configured to predict targets for each of the endotypes. The method comprises: mapping endotypes detected by the endotype detection module to assays; assessing targets predicted by the target prediction module for endotype specificity; and evaluating the workflow for its ability to predict endotype specific targets. It is intended that the abstract, when published, will be accompanied by Figure 6.
G16B 5/00 - ICT specially adapted for modelling or simulations in systems biology, e.g. gene-regulatory networks, protein interaction networks or metabolic networks
G16B 25/10 - Gene or protein expression profilingExpression-ratio estimation or normalisation
A computer-implemented method of electronically mining medical and scientific datasets to determine a ranking indicating a level of evidence for an association between two entities is disclosed. The method comprises receiving a representation of an entity pair, performing first data mining on one or more unstructured datasets to generate one or more first scores each representing an extent of association between the entities of the entity pair, and performing second data mining on one or more structured datasets to generate one or more second scores each representing an extent of association between the entities of the entity pair. The method also comprises using a classifier to determine a predicted ranking for the entity pair using the one or more first scores and the one or more second scores, and providing the predicted ranking to a user as an indication of the strength of evidence for an association between the entities of the entity pair.
G16H 10/20 - ICT specially adapted for the handling or processing of patient-related medical or healthcare data for electronic clinical trials or questionnaires
G16H 50/70 - ICT specially adapted for medical diagnosis, medical simulation or medical data miningICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients
Methods and apparatus are provided for generating an embedding of a graph. The graph includes a plurality of nodes and each node includes a connection to another one or more of the nodes. The method including and/or apparatus configured to: receiving data representative of at least a portion of the graph; transforming the nodes of the graph into a non-Euclidean geometry; iteratively updating an embedding model based the transformed nodes in the non-Euclidean geometry based on a causal loss function and a link prediction function associated with the non-Euclidean geometry.
Embodiments of present disclosure provide a system, apparatus and method(s) for generating a set of metrics for evaluating entities used with a predictive machine learning model, the method comprising: selecting one or more sets of entities from a data sources for generating a plurality of predictions aggregated from said one or more sets of entities using one or more pre-trained predictive models; selecting a subset of predictions from the plurality of predictions based on said one or more sets of entities in relation to the data source; extracting metadata from the data source associated with the subset of predictions, where the metadata comprises entity metadata and predicted metadata; generating the set of metrics based on the metadata extracted and the subset of predictions; and outputting the set of metrics for evaluation.
G16B 5/00 - ICT specially adapted for modelling or simulations in systems biology, e.g. gene-regulatory networks, protein interaction networks or metabolic networks
G16H 50/20 - ICT specially adapted for medical diagnosis, medical simulation or medical data miningICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems
G16B 40/00 - ICT specially adapted for biostatisticsICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
Method(s), apparatus and system(s) are provided for entity type identification and/or disambiguation of entities within a corpus of text the method including: receiving one or more entity results, each entity result comprising data representative of an identified entity and a location of the identified entity within the corpus of text; identifying an entity type for each entity of the received entity results by inputting text associated with the location of said each entity in the corpus of text to a trained entity type (ET) model configured for predicting or extracting an entity type of said each entity from the corpus of text; and outputting data representative of the identified entity type of each entity in the received entity results.
Systems, methods and apparatus are provided for identifying entities in a corpus of text. The system comprising: a first named entity recognition (NER) system comprising one or more entity dictionaries, the first NER system configured to identify entities and/or entity types within a corpus of text based on the one or more entity dictionaries, a second NER system comprising an NER model configured for predicting entities and/or entity types within the corpus of text; and a comparison module configured for identifying entities based on comparing the entity results output from the first and second NER systems, where the identified entities are different to the entities identified by the first NER system. The system may further include an updating module configured to update the one or more entity dictionaries based on the identified entities. The system may further include a dictionary building module configured to build a set of entity dictionaries based on at least the identified entities. The system may further comprise a training module configured to generate or update the NER model by training a machine learning, ML, technique for predicting entities and/or entity types from the corpus of text using a training dataset based on data representative of the identified entities and/or entity types.
A system for identifying a target for the treatment of a primary disease is provided. The system comprises: an input module configured to receive data for studying the primary disease, the data relating to individuals of a cohort; an encoder configured to use machine learning to encode the data as latent variables; an interpretation module configured to interpret the latent variables to stratify the individuals of the cohort into endotypes of the primary disease; and an identification module configured to identify a target that is associated with one of the endotypes.
G16H 20/10 - ICT specially adapted for therapies or health-improving plans, e.g. for handling prescriptions, for steering therapy or for monitoring patient compliance relating to drugs or medications, e.g. for ensuring correct administration to patients
G16H 50/70 - ICT specially adapted for medical diagnosis, medical simulation or medical data miningICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients
G06N 3/04 - Architecture, e.g. interconnection topology
Embodiments of present disclosure provide a system, apparatus and method(s) for determining one or more target nodes and associated paths from a query of a graph structure. The method receives the query to the graph structure, where the query comprises a data representation of at least one query node. The method identifies one or more target nodes in response to the query based on a policy network, where the policy network is configured to determine the one or more target nodes in accordance with a latent policy distribution associated with the policy network. The method traverses the graph structure by a search in relation to the policy network, where the search is configured to navigate from the query node to the one or more identified target nodes to determine the associated paths. The method outputs a list of the one or more target nodes and the associated paths for the query, where the list are ranked in relation to the latent policy distribution.
Method(s), apparatus, and system(s) are provided for selecting a data model configuration for use in training predictive models comprise receiving two or more data model configurations, extracting a data model for each of the two or more data model configurations from a knowledge graph, generating a separate predictive model for each of the extracted data models, scoring the output of each separate predictive model based on a benchmark data set, and selecting at least one data model configuration of the two or more data model configurations based on the output scores.
A computer-implemented method of querying a graph to assess relationships amongst graph nodes comprises determining a query node on the graph, identifying one or more target nodes on the graph in relation to the query node based on a set of connectivity patterns; generating graph- based statistics for each target node of the one or more target nodes, wherein the graph-based statistics are extracted for subgraphs associated with each target node and the query node; and assessing the graph- based statistics of each target node to determine predicted relationships between the one or more target nodes and the query node.
A computer-implemented method of training a machine learning model to learn ligand binding similarities between protein binding sites is disclosed. The method comprises inputting to the machine learning model: a representation of a first binding site; a representation of a second binding site, wherein the representations of the first and second binding sites comprise structural information; and a label comprising an indication of ligand binding similarity between the first binding site and the second binding site. The method also comprises outputting from the machine model a similarity indicator based on the representations of the first and second binding sites; performing a comparison between the similarity indicator and the label; and updating the machine learning model based on the comparison.
A computer-implemented method of stratifying a population of patients into disease endotypes is provided. The method comprises: encoding data relating to the patients as latent variables; determining one or more importance measures of the latent variables; prioritising the latent variables using the importance measures; interpreting one or more of the ranked latent variables; and identifying a disease endotype that is represented by one or more of the interpreted latent variables.
G16H 50/20 - ICT specially adapted for medical diagnosis, medical simulation or medical data miningICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems
G16H 50/70 - ICT specially adapted for medical diagnosis, medical simulation or medical data miningICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients
35.
AUTOMATIC QUERY CONSTRUCTION FOR KNOWLEDGE DISCOVERY
A system for discovering biological knowledge patterns of interest is described. The system comprises: a receive module configured to receive information defining a base pattern and a generalised base pattern, the base pattern comprising one or more entity nodes each representing a biological entity and one or more biological relationships indicated between the nodes, the generalised base pattern being related to the base pattern by virtue of replacing at least one entity node representing a respective biological entity by an associated set node representing a set of biological entities that includes the respective biological entity; a query module configured to generate a first query portion that, in combination with the generalised base pattern, defines a first query that retrieves a first set of results including the base pattern; and a control module configured to cause the query module to generate a second query portion that, in combination with the first query, defines a second query that retrieves a second set of results including the base pattern.
Methods, apparatus, system and computer-implemented method are provided for embedding a portion of text describing one or more entities of interest and a relationship. The portion of text describes a relationship for the one or more entity(ies) of interest, where the portion of text includes multiple separable entities describing the relationship and the entity(ies). The multiple separable entities including the one or more entity(ies) of interest and one or more relationship entity(ies). A set of embeddings for each of the separable entities is generated, where the set of embeddings for a separable entity includes an embedding for the separable entity and an embedding for at least one entity associated with the separable entity. One or more composite embeddings may be formed based on at least one embedding from each of the sets of embeddings. The composite embedding(s) may be sent for input to a machine learning model or classifier.
A computer-implemented method and a system of selecting a cell line for an assay. The computer-implemented method and system encode data, which is comprised of one or more features, as one or more latent variables. The one or more features encoded in the one or more latent variables are identified and mapped to cell lines based on the one or more features. A relevance of one or more targets to each of one or more of the one or more latent variables is determined and the one or more targets to the cell lines are matched via the one or more latent variables.
Methods, apparatus, system and computer-implemented method are provided for a computer-implemented method of identifying candidate entities of interest associated with disease selection information. The method including: receiving a first set of entities that are predicted to be associated with the disease selection information; retrieving a second set of entities that are known to be associated with the disease selection information; generating a set of entity mappings between entities of the first set of entities, entities the second set of entities, and entities of a graph structure in relation to the disease selection information, the graph structure based on an entity hierarchy, ontology or taxonomy of an entity family associated with the first and second sets of entities; linking entities from the first and second sets of entities to the graph structure based on the generated set of entity mappings; and identifying candidate entities of interest from those linked entities of the first and second sets of entities on the graph structure based on determining where each entity from the first set of entities is located on the graph structure relative to one or more entities of the second set of entities on the graph structure.
Methods, apparatus, system and computer-implemented method are provided for a computer-implemented method of automatically extracting entities associated with one or more domain(s) of interest from a corpus of text. A plurality of portions of text are received from the corpus of text, each portion of text comprising data representative of at least two entities and/or relationships thereto. For each received portion of text, identifying one or more subject-verb-object (SVO) entity data item(s) comprising data representative of at least two entities, a relationship associated with the at least two entities, a subject entity corresponding to an entity of said at least two entities, an object entity corresponding to an entity of the at least two entities, a verb portion associated with the relationship, and a direction of the relationship associated with the at least two entities. A graph structure based on the set of identified SVO entity data items is output, the graph structure comprising a graph of entity nodes and relationship edges linking the entity nodes with each relationship edge including an indication of directionality of said relationship.
G16B 40/00 - ICT specially adapted for biostatisticsICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
Methods, apparatus, system and computer-implemented method(s) are provided for creating a graph of entities of interest and relationships thereto. A search query is received corresponding to entities of interest. The search query including data representative of a first set of entities. An expanded search query is generated based on inputting the received search query to one or more entity expansion process(es) or engine(s). The expanded search query including data representative of a second set of entities and the first set of entities. Creating a graph of entities of interest and relationships thereto based on processing the expanded search query with data representative of a corpus of text. Creating the graph by processing the expanded search query to filter an existing graph of entities of interest and relationships thereto based on the expanded search query. The existing graph of entities of interest and relationships thereto is previously generated based on the corpus of text.
A computer-implemented method of prioritising biological targets is disclosed. The method comprises: receiving a selection of classes of one or more categories; and, for each of a plurality of biological targets, determining an extent of alignment of the biological target to each selected class. The method also comprises prioritising the biological targets based on the extents of alignment; and outputting a representation of one or more prioritised biological targets.
G16B 5/00 - ICT specially adapted for modelling or simulations in systems biology, e.g. gene-regulatory networks, protein interaction networks or metabolic networks
G16B 20/00 - ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
A computer-implemented method of designing a molecule and determining a route to synthesise the molecule is provided. The method comprises: receiving one or more desired properties of the molecule; generating one or more candidate molecules using a first machine learning technique that uses the one or more desired properties of the molecule as an input; and for at least one candidate molecule, computing one or more routes to synthesise the candidate molecule using a second machine learning technique.
Method(s), apparatus and system(s) are provided for generating and using an ensemble model. The ensemble may be generated by training a plurality of models based on a plurality of datasets associated with compounds; calculating model performance statistics for each of the plurality of trained models; selecting and storing a set of optimal trained model(s) from the trained models based on the calculated model performance statistics; and forming one or more ensemble models, each ensemble model comprising multiple models from the set of optimal trained model(s). The ensemble model may be used by retrieving the ensemble model and inputting, to the ensemble model, data representative of one or more labelled dataset(s) used to generate and/or train the model(s) of the ensemble model; and receiving, from the ensemble model, output data associated with labels of the one or more labelled dataset(s).
Method(s), apparatus, and system(s) are provided for filtering a set of data, the set of data comprising multiple data instances by: receiving a set of scores for the set of data; determining attention filtering information based on prior knowledge of one or more relationships between the data instances in said set of data and calculating attention relevancy weights corresponding to the data instances and the set of scores; and providing the attention filtering information to a machine learning, ML, technique or ML model.
Method(s), apparatus and system(s) are provided for designing a compound exhibiting one or more desired property(ies) using a machine learning (ML) technique. This may be achieved by generating a second compound using the ML technique to modify a first compound based on the desired property(ies) and a set of rules for modifying compounds; scoring the second compound based on the desired property(ies); determining whether to repeat the generating step based on the scoring; and updating the ML technique based on the scoring prior to repeating the generating step.
Methods and apparatus are provided for generating a graph neural network (GNN) model based on an entity-entity graph. The entity-entity graph comprising a plurality of entity nodes in which each entity node is connected to one or more entity nodes of the plurality of entity nodes by one or more corresponding relationship edges. The method comprising: generating an embedding based on data representative of the entity-entity graph for the GNN model, wherein the embedding comprises an attention weight assigned to each relationship edge of the entity-entity graph; and updating weights of the GNN model including the attention weights by minimising a loss function associated with at least the embedding; wherein the attention weights indicate the relevancy of each relationship edge between entity nodes of the entity-entity graph. The entity-entity graph may be filtered based on the attention weights of a trained GNN model. The filtered entity-entity graph may be used to update the GNN model or train another GNN model. The trained GNN model may be used to predict link relationship between a first entity and a second entity associated with the entity-entity graph.
A system for determining biological entities of interest is described. The system comprises a user input module configured to receive a search term comprising a representation of a biological entity; a search module configured to determine which biological entities of a set have a known association with the biological entity of the search term, those having a known association being results and those not having a known association being non-results, wherein biological entities of the set are related to each other by parent-child relationships in a relationship tree; and an analysis module configured to determine biological entities of interest by identifying non-results that have one or more results within a boundary in the relationship tree.
Method(s), apparatus, and computer-implemented method(s) are provided for training a machine learning (ML) technique to generate a property model for predicting whether a compound has a particular property. An iterative procedure/feedback loop may be performed for generating the property model, the procedure including: generating a prediction result list for a plurality of compounds and their association with the particular property based on the property model; validating the property model based on compounds from the prediction result list having an association with the particular property; and updating the property model based on the property model validation. The procedure/loop may be repeated using the updated property model until it is determined the property model has been validly trained. The property model validation may include selecting a shortlist of compounds, performing simulation analysis and/or laboratory analysis on the shortlist of compounds in relation to the particular property and using the simulation and/or laboratory results in updating the property model.
A computer-implemented method of electronically mining medical and scientific datasets to determine a ranking indicating a level of evidence for an association between two entities is disclosed. The method comprises receiving a representation of an entity pair, performing first data mining on one or more unstructured datasets to generate one or more first scores each representing an extent of association between the entities of the entity pair, and performing second data mining on one or more structured datasets to generate one or more second scores each representing an extent of association between the entities of the entity pair. The method also comprises using a classifier to determine a predicted ranking for the entity pair using the one or more first scores and the one or more second scores, and providing the predicted ranking to a user as an indication of the strength of evidence for an association between the entities of the entity pair.
G16H 50/70 - ICT specially adapted for medical diagnosis, medical simulation or medical data miningICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients
Method(s) and apparatus are provided for generating a selection model based on a machine learning (ML) technique, the selection model for selecting a shortlist of compounds requiring validation with a particular property. An iterative procedure or feedback loop for generating the selection model may include: receiving a prediction result list output from a property model for predicting whether a plurality of compounds are associated with a particular property and an property model score; retraining the selection model based on the property model score and/or the prediction result list; selecting a shortlist of compounds using the retrained selection model from the plurality of compounds associated with the prediction result list; sending the selected shortlist of compounds for validation with the particular property, where another ML technique is used to update the property model based on the validation; repeating the receiving and retraining of the selection model until determining the selection model has been validly trained.
A computer-implemented method of identifying a tool compound is provided. The method comprises: searching a database for first candidate compounds that each target one or more first target genes; generating a first fingerprint for each first candidate compound by: searching the database for genes associated with the first candidate compound, and predicting genes associated with the first candidate compound; and filtering the first candidate compounds using the first fingerprints to identify a first optimum compound for targeting the one or more first target genes.
G16H 50/70 - ICT specially adapted for medical diagnosis, medical simulation or medical data miningICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients
G16B 15/30 - Drug targeting using structural dataDocking or binding prediction
Method(s), apparatus and system(s) are provided for entity type identification and/or disambiguation of entities within a corpus of text the method including: receiving one or more entity results, each entity result comprising data representative of an identified entity and a location of the identified entity within the corpus of text; identifying an entity type for each entity of the received entity results by inputting text associated with the location of said each entity in the corpus of text to a trained entity type (ET) model configured for predicting or extracting an entity type of said each entity from the corpus of text; and outputting data representative of the identified entity type of each entity in the received entity results.
Systems, methods and apparatus are provided for identifying entities in a corpus of text. The system comprising: a first named entity recognition (NER) system comprising one or more entity dictionaries, the first NER system configured to identify entities and/or entity types within a corpus of text based on the one or more entity dictionaries, a second NER system comprising an NER model configured for predicting entities and/or entity types within the corpus of text; and a comparison module configured for identifying entities based on comparing the entity results output from the first and second NER systems, where the identified entities are different to the entities identified by the first NER system. The system may further include an updating module configured to update the one or more entity dictionaries based on the identified entities. The system may further include a dictionary building module configured to build a set of entity dictionaries based on at least the identified entities. The system may further comprise a training module configured to generate or update the NER model by training a machine learning, ML, technique for predicting entities and/or entity types from the corpus of text using a training dataset based on data representative of the identified entities and/or entity types.
A computer-implemented method of training a machine learning model to learn ligand binding similarities between protein binding sites is disclosed. The method comprises inputting to the machine learning model: a representation of a first binding site; a representation of a second binding site, wherein the representations of the first and second binding sites comprise structural information; and a label comprising an indication of ligand binding similarity between the first binding site and the second binding site. The method also comprises outputting from the machine model a similarity indicator based on the representations of the first and second binding sites; performing a comparison between the similarity indicator and the label; and updating the machine learning model based on the comparison.
Methods, apparatus, system and computer-implemented method are provided for embedding a portion of text describing one or more entities of interest and a relationship. The portion of text describes a relationship for the one or more entity(ies) of interest, where the portion of text includes multiple separable entities describing the relationship and the entity(ies). The multiple separable entities including the one or more entity(ies) of interest and one or more relationship entity(ies). A set of embeddings for each of the separable entities is generated, where the set of embeddings for a separable entity includes an embedding for the separable entity and an embedding for at least one entity associated with the separable entity. One or more composite embeddings may be formed based on at least one embedding from each of the sets of embeddings. The composite embedding(s) may be sent for input to a machine learning model or classifier.
A system for discovering biological knowledge patterns of interest is described. The system comprises: a receive module configured to receive information defining a base pattern and a generalised base pattern, the base pattern comprising one or more entity nodes each representing a biological entity and one or more biological relationships indicated between the nodes, the generalised base pattern being related to the base pattern by virtue of replacing at least one entity node representing a respective biological entity by an associated set node representing a set of biological entities that includes the respective biological entity; a query module configured to generate a first query portion that, in combination with the generalised base pattern, defines a first query that retrieves a first set of results including the base pattern; and a control module configured to cause the query module to generate a second query portion that, in combination with the first query, defines a second query that retrieves a second set of results including the base pattern.
G06N 5/00 - Computing arrangements using knowledge-based models
G16C 20/70 - Machine learning, data mining or chemometrics
G16H 50/70 - ICT specially adapted for medical diagnosis, medical simulation or medical data miningICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients
Methods and apparatus are provided for generating a graph neural network (GNN) model based on an entity-entity graph. The entity-entity graph comprising a plurality of entity nodes in which each entity node is connected to one or more entity nodes of the plurality of entity nodes by one or more corresponding relationship edges. The method comprising: generating an embedding based on data representative of the entity-entity graph for the GNN model, wherein the embedding comprises an attention weight assigned to each relationship edge of the entity-entity graph; and updating weights of the GNN model including the attention weights by minimising a loss function associated with at least the embedding; wherein the attention weights indicate the relevancy of each relationship edge between entity nodes of the entity-entity graph. The entity-entity graph may be filtered based on the attention weights of a trained GNN model. The filtered entity-entity graph may be used to update the GNN model or train another GNN model. The trained GNN model may be used to predict link relationship between a first entity and a second entity associated with the entity-entity graph.
G16B 40/00 - ICT specially adapted for biostatisticsICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
A system is disclosed for searching a set of biological entities. The system comprises: a user input module configured to receive a user input comprising a representation of a biological entity; a search module configured to determine which entities of a set of biological entities are associated with the user input; a visualisation module configured to render a visualisation of multiple biological entities of the set and of parent-child relationships between them; and an overlay module configured to render an association indicator visually indicating one or more biological entities of the visualisation that are associated with the user input.
G16H 50/20 - ICT specially adapted for medical diagnosis, medical simulation or medical data miningICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems
Method(s), apparatus and system(s) are provided for designing a compound exhibiting one or more desired property(ies) using a machine learning (ML) technique. This may be achieved by generating a second compound using the ML technique to modify a first compound based on the desired property(ies) and a set of rules for modifying compounds; scoring the second compound based on the desired property(ies); determining whether to repeat the generating step based on the scoring; and updating the ML technique based on the scoring prior to repeating the generating step.
A system for determining biological entities of interest is described. The system comprises a user input module configured to receive a search term comprising a representation of a biological entity; a search module configured to determine which biological entities of a set have a known association with the biological entity of the search term, those having a known association being results and those not having a known association being non-results, wherein biological entities of the set are related to each other by parent-child relationships in a relationship tree; and an analysis module configured to determine biological entities of interest by identifying non-results that have one or more results within a boundary in the relationship tree.
Method(s), apparatus, and computer-implemented method(s) are provided for training a machine learning (ML) technique to generate a property model for predicting whether a compound has a particular property. An iterative procedure/feedback loop may be performed for generating the property model, the procedure including: generating a prediction result list for a plurality of compounds and their association with the particular property based on the property model; validating the property model based on compounds from the prediction result list having an association with the particular property; and updating the property model based on the property model validation. The procedure/loop may be repeated using the updated property model until it is determined the property model has been validly trained. The property model validation may include selecting a shortlist of compounds, performing simulation analysis and/or laboratory analysis on the shortlist of compounds in relation to the particular property and using the simulation and/or laboratory results in updating the property model.
Method(s), apparatus and system(s) are provided for generating and using an ensemble model. The ensemble may be generated by training a plurality of models based on a plurality of datasets associated with compounds; calculating model performance statistics for each of the plurality of trained models; selecting and storing a set of optimal trained model(s) from the trained models based on the calculated model performance statistics; and forming one or more ensemble models, each ensemble model comprising multiple models from the set of optimal trained model(s). The ensemble model may be used by retrieving the ensemble model and inputting, to the ensemble model, data representative of one or more labelled dataset(s) used to generate and/or train the model(s) of the ensemble model; and receiving, from the ensemble model, output data associated with labels of the one or more labelled dataset(s).
Method(s) and apparatus are provided for generating a selection model based on a machine learning (ML) technique, the selection model for selecting a shortlist of compounds requiring validation with a particular property. An iterative procedure or feedback loop for generating the selection model may include: receiving a prediction result list output from a property model for predicting whether a plurality of compounds are associated with a particular property and an property model score; retraining the selection model based on the property model score and/or the prediction result list; selecting a shortlist of compounds using the retrained selection model from the plurality of compounds associated with the prediction result list; sending the selected shortlist of compounds for validation with the particular property, where another ML technique is used to update the property model based on the validation; repeating the receiving and retraining of the selection model until determining the selection model has been validly trained.
Method(s), apparatus, and system(s) are provided for filtering a set of data, the set of data comprising multiple data instances by: receiving a set of scores for the set of data; determining attention filtering information based on prior knowledge of one or more relationships between the data instances in said set of data and calculating attention relevancy weights corresponding to the data instances and the set of scores; and providing the attention filtering information to a machine learning, ML, technique or ML model.