The invention provides methods for comparing one set of genetic sequences to another without discarding any information within either set. A set of genetic sequences is represented using a directed acyclic graph (DAG) avoiding any unwarranted reduction to a linear data structure. The invention provides a way to align one sequence DAG to another to produce an alignment that can itself be stored as a DAG. DAG-to-DAG alignment is a natural choice wherever a set of genomic information consisting of more than one string needs to be compared to any non-linear reference. For example, a subpoptilation DAG could be compared to a population DAG in order to compare the genetic features of that subpopulation to those of the population.
C12Q 1/6883 - Produits d’acides nucléiques utilisés dans l’analyse d’acides nucléiques, p.ex. amorces ou sondes pour les maladies provoquées par des altérations du matériel génétique
G16B 30/00 - TIC spécialement adaptées à l’analyse de séquences impliquant des nucléotides ou des aminoacides
Systems and methods for analyzing genomic information can include obtaining a sequence read including genetic information; identifying, within a graph representing a reference genome, a plurality of candidate mapping positions that relate to the genetic information, the graph comprising nodes representing genetic sequences and edges connecting pairs of nodes; determining, by means of a computer system, whether an alignment with the graph surrounding each of the plurality of candidate mapping positions is advanced or basic; and performing for each candidate mapping position, by means of the computer system, a local alignment based on whether the local alignment is advanced or basic. The advanced local alignment can include a first-local-alignment algorithm, and the basic local alignment includes a second-local-alignment algorithm. Based on the local alignments, the mapped position of the sequence read can be identified within the genome.
The invention provides methods for identifying rare variants near a structural variation in a genetic sequence, for example, in a nucleic acid sample taken from a subject. The invention additionally includes methods for aligning reads (e.g., nucleic acid reads) to a reference sequence construct accounting for the structural variation, methods for building a reference sequence construct accounting for the structural variation or the structural variation and the rare variant, and systems that use the alignment methods to identify rare variants. The method is scalable, and can be used to align millions of reads to a construct thousands of bases long, or longer.
The invention provides systems and methods for determining patterns of modification to a genome of a subject by representing the genome using a graph, such as a directed acyclic graph (DAG) with divergent paths for regions that are potentially subject to modification, profiling segments of the genome for evidence of epigenetic modification, and aligning the profiled segments to the DAG to determine locations and patterns of the epigenetic modification within the genome.
The invention provides systems and methods for analyzing viruses by representing viral genetic diversity with a directed acyclic graph (DAG), which allows genetic sequencing technology to detect rare variations and represent otherwise difficult-to-document diversity within a sample. Additionally, a host-specific sequence DAG can be used to effectively segregate viral nucleic acid sequence reads from host sequence reads when a sample from a host is subject to sequencing. Known viral genomes can be represented using a viral reference DAG and the viral sequence reads from the sample can be compared to viral DAG to identify viral species or strains from which the reads were derived. Where the viral sequence reads indicate great genetic diversity in the virus that was infecting the host, those reads can be assembled into a DAG that itself properly represents that diversity.
C12Q 1/6809 - Méthodes de détermination ou d’identification des acides nucléiques faisant intervenir la détection différentielle
C12Q 1/70 - Procédés de mesure ou de test faisant intervenir des enzymes, des acides nucléiques ou des micro-organismes; Compositions à cet effet; Procédés pour préparer ces compositions faisant intervenir des virus ou des bactériophages
G16B 30/00 - TIC spécialement adaptées à l’analyse de séquences impliquant des nucléotides ou des aminoacides
The invention provides methods of analyzing an individual's mtDNA by transforming available reference sequences into a directed graph that compactly represents all the information without duplication and comparing sequence reads from the mtDNA to the graph to identify the individual or describe their mtDNA. A directed graph can represent all of the genetic variation found among the mitochondrial genomes across all of a number of reference organisms while providing a single article to which sequence reads can be aligned or compared. Thus any sequence read or other sequence fragment can be compared, in a single operation, to the article that represents all of the reference mitochondrial sequences.
C12Q 1/6874 - Méthodes de séquençage faisant intervenir des réseaux d’acides nucléiques, p.ex. séquençage par hybridation [SBH]
C12Q 1/6888 - Produits d’acides nucléiques utilisés dans l’analyse d’acides nucléiques, p.ex. amorces ou sondes pour la détection ou l’identification d’organismes
G16B 20/00 - TIC spécialement adaptées à la génomique ou protéomique fonctionnelle, p. ex. corrélations génotype-phénotype
G16B 30/10 - Alignement de séquence; Recherche d’homologie
7.
SYSTEMS AND METHODS FOR ANALYZING CIRCULATING TUMOR DNA
The invention provides oncogenomic methods for detecting tumors by identifying circulating tumor DNA. A patient-specific reference directed acyclic graph (DAG) represents known human genomic sequences and non-tumor DNA from the patient as well as known tumor-associated mutations. Sequence reads from cell-free plasma DNA from the patient are mapped to the patient-specific genomic reference graph. Any of the known tumor-associated mutations found in the reads and any de novo mutations found in the reads are reported as the patient’s tumor mutation burden.
C12Q 1/6886 - Produits d’acides nucléiques utilisés dans l’analyse d’acides nucléiques, p.ex. amorces ou sondes pour les maladies provoquées par des altérations du matériel génétique pour le cancer
G16B 5/00 - TIC spécialement adaptées à la modélisation ou aux simulations dans la biologie des systèmes, p. ex. réseaux de régulation génétique, réseaux d’interaction entre protéines ou réseaux métaboliques
G16B 30/00 - TIC spécialement adaptées à l’analyse de séquences impliquant des nucléotides ou des aminoacides
The invention provides methods for identifying rare variants near a structural variation in a genetic sequence, for example, in a nucleic acid sample taken from a subject. The invention additionally includes methods for aligning reads (e.g., nucleic acid reads) to a reference sequence construct accounting for the structural variation, methods for building a reference sequence construct accounting for the structural variation or the structural variation and the rare variant, and systems that use the alignment methods to identify rare variants. The method is scalable, and can be used to align millions of reads to a construct thousands of bases long, or longer.
The invention includes methods and systems for identifying diseased-induced mutations by producing multi-dimensional reference sequence constructs that account for variations between individuals, different diseases, and different stages of those diseases. Once constructed, these reference sequence constructs can be used to align sequence reads corresponding to genetic samples from patients suspected of having a disease, or who have had the disease and are in suspected remission. The reference sequence constructs also provide insight to the genetic progression of the disease.
C12Q 1/6886 - Produits d’acides nucléiques utilisés dans l’analyse d’acides nucléiques, p.ex. amorces ou sondes pour les maladies provoquées par des altérations du matériel génétique pour le cancer
C12Q 1/6883 - Produits d’acides nucléiques utilisés dans l’analyse d’acides nucléiques, p.ex. amorces ou sondes pour les maladies provoquées par des altérations du matériel génétique
G16B 20/20 - Détection d’allèles ou de variantes, p. ex. détection de polymorphisme d’un seul nucléotide
G16B 20/00 - TIC spécialement adaptées à la génomique ou protéomique fonctionnelle, p. ex. corrélations génotype-phénotype
G16B 30/10 - Alignement de séquence; Recherche d’homologie
10.
SYSTEMS AND METHODS FOR GENERATING GRAPH REFERENCES
Techniques for generating a graph reference construct. The techniques include: obtaining a plurality of variants associated with a reference sequence construct; generating the graph reference construct using the plurality of variants and the reference sequence construct; and outputting the generated graph reference construct. Generating the graph reference construct includes: filtering the plurality of variants to obtain a filtered set of variants, the filtering including a first filtering stage and a second filtering stage, and generating the graph reference construct using the filtered set of variants. The first filtering stage includes identifying a first subset of variants at least in part by excluding one or more structural variants from the plurality of variants. The second filtering stage includes identifying the filtered set of variants at least in part by excluding one or more multiply-alignable variants from the first subset of variants.
Techniques for generating a graph reference construct. The techniques include: obtaining a plurality of variants associated with a reference sequence construct; generating the graph reference construct using the plurality of variants and the reference sequence construct; and outputting the generated graph reference construct. Generating the graph reference construct includes: filtering the plurality of variants to obtain a filtered set of variants, the filtering including a first filtering stage and a second filtering stage, and generating the graph reference construct using the filtered set of variants. The first filtering stage includes identifying a first subset of variants at least in part by excluding one or more structural variants from the plurality of variants. The second filtering stage includes identifying the filtered set of variants at least in part by excluding one or more multiply-alignable variants from the first subset of variants.
Techniques for generating a graph reference construct. The techniques include: obtaining a plurality of variants associated with a reference sequence construct; generating the graph reference construct using the plurality of variants and the reference sequence construct; and outputting the generated graph reference construct. Generating the graph reference construct includes: filtering the plurality of variants to obtain a filtered set of variants, the filtering including a first filtering stage and a second filtering stage, and generating the graph reference construct using the filtered set of variants. The first filtering stage includes identifying a first subset of variants at least in part by excluding one or more structural variants from the plurality of variants. The second filtering stage includes identifying the filtered set of variants at least in part by excluding one or more multiply-alignable variants from the first subset of variants.
Methods of the invention include representing biological data in a memory subsystem within a computer system with a data structure that is particular to a location in the memory subsystem and serializing the data structure into a stream of bytes that can be deserialized into a clone of the data structure. In a preferred genomic embodiment, the biological data comprises genomic sequences and the data structure comprises a genomic directed acyclic graph (DAG) in which objects have adjacency lists of pointers that indicate the location of any object adjacent to that object. After serialization and deserialization, the clone genomic DAG has the same structure as the original to represent the same sequences and relationships among them as the original.
A method for screening for disease in a genomic sample is includes receiving a representation of a reference genome comprising a sequence of symbols. The presence of a predicted mutational event is identified in a location of the reference genome. An alternate path is created in the reference genome representing the predicted mutational event. A plurality of sequence reads are obtained from a genomic sample, wherein at least one sequence read comprises at least a portion of the predicted mutational event. The at least one sequence read is then mapped to the reference genome and a location is determined corresponding to the predicted mutational event. The predicted mutational event is then identified as present in the genomic sample. The method may be used to detect evidence of non-allelic homologous recombination (NAHR) occurring in genomic samples.
G16B 20/20 - Détection d’allèles ou de variantes, p. ex. détection de polymorphisme d’un seul nucléotide
G16H 50/50 - TIC spécialement adaptées au diagnostic médical, à la simulation médicale ou à l’extraction de données médicales; TIC spécialement adaptées à la détection, au suivi ou à la modélisation d’épidémies ou de pandémies pour la simulation ou la modélisation des troubles médicaux
G16B 20/00 - TIC spécialement adaptées à la génomique ou protéomique fonctionnelle, p. ex. corrélations génotype-phénotype
G16B 30/00 - TIC spécialement adaptées à l’analyse de séquences impliquant des nucléotides ou des aminoacides
G16B 30/10 - Alignement de séquence; Recherche d’homologie
15.
Computer Method and System of Identifying Genomic Mutations Using Graph-Based Local Assembly
Computer-implemented methods and systems for performing a local assembly of a genomic region of interest include the de novo or assisted creation of a directed graph, such as a directed acyclic graph (DAG), from a plurality of obtained nucleotide sequence reads. First and second sequence reads are aligned to each other to define at least one node of the DAG. Successive alignments of the remaining sequence reads to the then-defined DAG are performed to extend nodes and/or add nodes to the DAG. Graph-aware alignment techniques that produce alignment scores or indicators are employed in defining the nodes of the DAG from the sequence reads. The created DAG represents and describes in detail the genomic region of interest and can be used to perform variant calls.
The invention includes methods for aligning reads (e.g., nucleic acid reads) comprising repeating sequences, methods for building reference sequence constructs comprising repeating sequences, and systems that can be used to align reads comprising repeating sequences. The method is scalable, and can be used to align millions of reads to a construct thousands of bases long. The methods and systems can additionally account for variability within a repeating sequence, or near to a repeating sequence, due to genetic mutation.
The invention includes methods for aligning reads (e.g., nucleic acid reads, amino acid reads) to a reference sequence construct, methods for building the reference sequence construct, and systems that use the alignment methods and constructs to produce sequences. The invention also includes methods and systems for evaluating the quality of the alignment between the reads and the reference sequence construct. The method is scalable, and can be used to align millions of reads to a construct thousands of bases or amino acids long. The invention additionally includes methods for identifying a disease or a genotype based upon alignment of nucleic acid reads to a location in the construct.
The invention provides methods for comparing one set of genetic sequences to another without discarding any information within either set. A set of genetic sequences is represented using a directed acyclic graph (DAG) avoiding any unwarranted reduction to a linear data structure. The invention provides a way to align one sequence DAG to another to produce an alignment that can itself be stored as a DAG. DAG-to-DAG alignment is a natural choice wherever a set of genomic information consisting of more than one string needs to be compared to any non-linear reference. For example, a subpopulation DAG could be compared to a population DAG in order to compare the genetic features of that subpopulation to those of the population.
C12Q 1/6883 - Produits d’acides nucléiques utilisés dans l’analyse d’acides nucléiques, p.ex. amorces ou sondes pour les maladies provoquées par des altérations du matériel génétique
The invention generally relates to genomic studies and specifically to improved methods for read mapping using identified nucleotides at known locations. The invention provides methods of using identified nucleotides at known places in a genome to guide the analysis of sequence reads from that genome by excluding potential mappings or assemblies that are not congruent with the identified nucleotides. Information about a plurality of SNPs in the subject's genome is used to identify candidate paths through a genomic directed acyclic graph (DAG). Sequence reads are mapped to the candidate paths.
A method for stream-processing biomedical data includes receiving, by a file system on a computing device, a first request for access to at least a first portion of a file stored on a remotely located storage device. The method includes receiving, by the file system, a second request for access to at least a second portion of the file. The method includes determining, by a pre-fetching component executing on the computing device, whether the first request and the second request are associated with a sequential read operation. The method includes automatically retrieving, by the pre-fetching component, a third portion of the requested file, before receiving a third request for access to least the third portion of the file, based on a determination that the first request and the second request are associated with the sequential read operation.
G06F 12/0862 - Adressage d’un niveau de mémoire dans lequel l’accès aux données ou aux blocs de données désirés nécessite des moyens d’adressage associatif, p.ex. mémoires cache avec pré-lecture
G16B 50/00 - TIC pour la programmation d’outils ou de systèmes de bases de données spécialement adaptées à la bio-informatique
G06F 12/02 - Adressage ou affectation; Réadressage
G16B 50/30 - Entreposage de données; Architectures informatiques
H04L 67/1097 - Protocoles dans lesquels une application est distribuée parmi les nœuds du réseau pour le stockage distribué de données dans des réseaux, p.ex. dispositions de transport pour le système de fichiers réseau [NFS], réseaux de stockage [SAN] ou stockage en réseau [NAS]
H04L 65/80 - Dispositions, protocoles ou services dans les réseaux de communication de paquets de données pour prendre en charge les applications en temps réel en répondant à la qualité des services [QoS]
21.
Display screen with animated graphical user interface with pill shape
Genomic data is written to disk in a compact format by dividing the data into segments and encoding each segment with the smallest number of bits per character necessary for whatever alphabet of characters appears in that segment. A computer system dynamically chooses the segment boundaries for maximum space savings. A first one of the segments may use a different number of bits per character than a second one of the segments. In one embodiment, dividing the data into segments comprises scanning the data and keeping track of a number of unique characters, noting positions in the sequence where the number increases to a power of two, calculating a compression that would be obtained by dividing the genomic data into one of the plurality of segments at ones of the noted positions, and dividing the genomic data into the plurality of segments at the positions that yield the best compression.
The invention provides systems and methods for determining patterns of modification to a genome of a subject by representing the genome using a graph, such as a directed acyclic graph (DAG) with divergent paths for regions that are potentially subject to modification, profiling segments of the genome for evidence of epigenetic modification, and aligning the profiled segments to the DAG to determine locations and patterns of the epigenetic modification within the genome.
The invention provides systems and methods for analyzing viruses by representing viral genetic diversity with a directed acyclic graph (DAG), which allows genetic sequencing technology to detect rare variations and represent otherwise difficult-to-document diversity within a sample. Additionally, a host-specific sequence DAG can be used to effectively segregate viral nucleic acid sequence reads from host sequence reads when a sample from a host is subject to sequencing. Known viral genomes can be represented using a viral reference DAG and the viral sequence reads from the sample can be compared to viral DAG to identify viral species or strains from which the reads were derived. Where the viral sequence reads indicate great genetic diversity in the virus that was infecting the host, those reads can be assembled into a DAG that itself properly represents that diversity.
C12Q 1/70 - Procédés de mesure ou de test faisant intervenir des enzymes, des acides nucléiques ou des micro-organismes; Compositions à cet effet; Procédés pour préparer ces compositions faisant intervenir des virus ou des bactériophages
G16B 30/00 - TIC spécialement adaptées à l’analyse de séquences impliquant des nucléotides ou des aminoacides
C12Q 1/6809 - Méthodes de détermination ou d’identification des acides nucléiques faisant intervenir la détection différentielle
25.
System and method for dynamic control of workflow execution
Some embodiments relate to systems for processing one or more computational workflows. In one embodiment, a description of a computational comprises a plurality of applications, in which applications are represented as nodes and edges connect the nodes indicate the flow of data elements between applications. A task execution module is configured to create and execute tasks. An application programming interface (API) is in communication with the task execution module and comprises a plurality of function calls for controlling at least one function of the task execution module. An API script includes instructions to the API to create and execute a plurality of tasks corresponding to the execution of the computational workflow for a plurality of samples. A graphical user interface (GUI) is in communication with the task execution module and configured to receive input from an end user to initiate execution of the API script.
In one embodiment, a method of processing a computational workflow comprises receiving a description of a computational workflow. The description comprises a plurality of steps, in which each step has at least one input and at least one output, and further wherein an input from a second step depends on an output from a first step. The description is translated into a static workflow graph stored in a memory, the static workflow graph comprising a plurality of nodes having input ports and output ports, wherein dependencies between inputs and outputs are specified as edges between input ports and output ports. Information about a first set of nodes is then extracted from the static workflow graph and placed into a dynamic graph. A first actionable job is identified from the dynamic graph and executed.
The invention provides methods of analyzing an individual's mtDNA by transforming available reference sequences into a directed graph that compactly represents all the information without duplication and comparing sequence reads from the mtDNA to the graph to identify the individual or describe their mtDNA. A directed graph can represent all of the genetic variation found among the mitochondrial genomes across all of a number of reference organisms while providing a single article to which sequence reads can be aligned or compared. Thus any sequence read or other sequence fragment can be compared, in a single operation, to the article that represents all of the reference mitochondrial sequences.
C12Q 1/6874 - Méthodes de séquençage faisant intervenir des réseaux d’acides nucléiques, p.ex. séquençage par hybridation [SBH]
C12Q 1/6888 - Produits d’acides nucléiques utilisés dans l’analyse d’acides nucléiques, p.ex. amorces ou sondes pour la détection ou l’identification d’organismes
G16B 20/00 - TIC spécialement adaptées à la génomique ou protéomique fonctionnelle, p. ex. corrélations génotype-phénotype
G16B 30/10 - Alignement de séquence; Recherche d’homologie
Embodiments of the invention utilize a graph-based approach for simulating genomic datasets from large scale populations. Genomic data may be represented as a directed acyclic graph (DAG) that incorporates individual sample data including variant type, position, and zygosity. A simulator may operate on the DAG to generate variant datasets based on probabilistic traversal of the DAG. This probabilistic traversal reflects genomic variant types associated with the subpopulation used to build the DAG, and as a result, the generated variant datasets maintain statistical fidelity to the original sample data.
G16B 5/00 - TIC spécialement adaptées à la modélisation ou aux simulations dans la biologie des systèmes, p. ex. réseaux de régulation génétique, réseaux d’interaction entre protéines ou réseaux métaboliques
G16B 20/00 - TIC spécialement adaptées à la génomique ou protéomique fonctionnelle, p. ex. corrélations génotype-phénotype
G16B 20/20 - Détection d’allèles ou de variantes, p. ex. détection de polymorphisme d’un seul nucléotide
G16B 45/00 - TIC spécialement adaptées à la visualisation de données liées à la bio-informatique, p. ex. affichage de cartes ou de réseaux
G16B 50/00 - TIC pour la programmation d’outils ou de systèmes de bases de données spécialement adaptées à la bio-informatique
G16B 50/30 - Entreposage de données; Architectures informatiques
29.
Hashing data-processing steps in workflow environments
Various approaches for data storage and retrieval for a computer memory include processing a computational workflow having multiple data-processing steps, generating and storing a first hash value associated with a first step of the data-processing steps based on an input to the first step, generating and storing a second hash value associated with a second step of the data-processing steps based on the generated first hash value, and reconstructing a computational state of the workflow based on the second hash value, and thereby avoid re-execution of a portion of the workflow corresponding to the second hash value.
G06F 9/48 - Lancement de programmes; Commutation de programmes, p.ex. par interruption
G06Q 10/06 - Ressources, gestion de tâches, des ressources humaines ou de projets; Planification d’entreprise ou d’organisation; Modélisation d’entreprise ou d’organisation
H04L 9/06 - Dispositions pour les communications secrètes ou protégées; Protocoles réseaux de sécurité l'appareil de chiffrement utilisant des registres à décalage ou des mémoires pour le codage par blocs, p.ex. système DES
The invention provides oncogenomic methods for detecting tumors by identifying circulating tumor DNA. A patient-specific reference directed acyclic graph (DAG) represents known human genomic sequences and non-tumor DNA from the patient as well as known tumor-associated mutations. Sequence reads from cell-free plasma DNA from the patient are mapped to the patient-specific genomic reference graph. Any of the known tumor-associated mutations found in the reads and any de novo mutations found in the reads are reported as the patient's tumor mutation burden.
G01N 33/50 - Analyse chimique de matériau biologique, p.ex. de sang ou d'urine; Test par des méthodes faisant intervenir la formation de liaisons biospécifiques par ligands; Test immunologique
C12Q 1/6886 - Produits d’acides nucléiques utilisés dans l’analyse d’acides nucléiques, p.ex. amorces ou sondes pour les maladies provoquées par des altérations du matériel génétique pour le cancer
G16B 5/00 - TIC spécialement adaptées à la modélisation ou aux simulations dans la biologie des systèmes, p. ex. réseaux de régulation génétique, réseaux d’interaction entre protéines ou réseaux métaboliques
G16B 30/00 - TIC spécialement adaptées à l’analyse de séquences impliquant des nucléotides ou des aminoacides
The invention provides methods for identifying rare variants near a structural variation in a genetic sequence, for example, in a nucleic acid sample taken from a subject. The invention additionally includes methods for aligning reads (e.g., nucleic acid reads) to a reference sequence construct accounting for the structural variation, methods for building a reference sequence construct accounting for the structural variation or the structural variation and the rare variant, and systems that use the alignment methods to identify rare variants. The method is scalable, and can be used to align millions of reads to a construct thousands of bases long, or longer.
Various embodiments of the disclosure relate to systems and methods for aligning a sequence read to a graph reference. In one embodiment, the method comprises selecting a first node from a graph reference, the graph reference comprising a plurality of nodes connected by a plurality of directed edges, at least one node of the plurality of nodes having a nucleotide sequence. The method further comprises traversing the graph reference according to a depth-first search, and comparing a sequence read to nucleotide sequences generated from the traversal of the graph reference. The traversal of the graph is then modified in response to a determination that each and every node associated with a given nucleotide sequence was previously evaluated.
G06K 9/00 - Méthodes ou dispositions pour la lecture ou la reconnaissance de caractères imprimés ou écrits ou pour la reconnaissance de formes, p.ex. d'empreintes digitales
G06K 9/68 - Méthodes ou dispositions pour la reconnaissance utilisant des moyens électroniques utilisant des comparaisons successives des signaux images avec plusieurs références, p.ex. mémoire adressable
33.
Systems and methods for adaptive local alignment for graph genomes
Systems and methods for analyzing genomic information can include obtaining a sequence read including genetic information; identifying, within a graph representing a reference genome, a plurality of candidate mapping positions that relate to the genetic information, the graph comprising nodes representing genetic sequences and edges connecting pairs of nodes; determining, by means of a computer system, whether an alignment with the graph surrounding each of the plurality of candidate mapping positions is advanced or basic; and performing for each candidate mapping position, by means of the computer system, a local alignment based on whether the local alignment is advanced or basic. The advanced local alignment can include a first-local-alignment algorithm, and the basic local alignment includes a second-local-alignment algorithm. Based on the local alignments, the mapped position of the sequence read can be identified within the genome.
G01N 33/50 - Analyse chimique de matériau biologique, p.ex. de sang ou d'urine; Test par des méthodes faisant intervenir la formation de liaisons biospécifiques par ligands; Test immunologique
G16B 30/10 - Alignement de séquence; Recherche d’homologie
G16B 30/00 - TIC spécialement adaptées à l’analyse de séquences impliquant des nucléotides ou des aminoacides
34.
Methods and systems for genotyping genetic samples
The invention provides methods and system for making specific base calls at specific loci using a reference sequence construct, e.g., a directed acyclic graph (DAG) that represents known variants at each locus of the genome. Because the sequence reads are aligned to the DAG during alignment, the subsequent step of comparing a mutation, vis-a-vis the reference genome, to a table of known mutations can be eliminated. The disclosed methods and systems are notably efficient in dealing with structural variations within a genome or mutations that are within a structural variation.
In one embodiment, a method for identifying candidate sequences for genotyping a genomic sample comprises obtaining a plurality of sequence reads mapping to a genomic region of interest. The plurality of sequence reads are assembled into a directed acyclic graph (DAG) comprising a plurality of branch sites representing variation present in the set of sequence reads, each branch site comprising two or more branches. A path through the DAG comprises a set of successive branches over two or more branch sites and represents a possible candidate sequence of the genomic sample. One or more paths through the DAG are ranked by calculating scores for one or more branch sites, wherein the calculated score comprises a number of sequence reads that span multiple branch sites in a given path. At least one path is selected as a candidate sequence based at least in part on its rank.
G16B 20/20 - Détection d’allèles ou de variantes, p. ex. détection de polymorphisme d’un seul nucléotide
C12Q 1/68 - Procédés de mesure ou de test faisant intervenir des enzymes, des acides nucléiques ou des micro-organismes; Compositions à cet effet; Procédés pour préparer ces compositions faisant intervenir des acides nucléiques
G16B 30/00 - TIC spécialement adaptées à l’analyse de séquences impliquant des nucléotides ou des aminoacides
G16B 30/10 - Alignement de séquence; Recherche d’homologie
The invention includes methods and systems for identifying diseased-induced mutations by producing multi-dimensional reference sequence constructs that account for variations between individuals, different diseases, and different stages of those diseases. Once constructed, these reference sequence constructs can be used to align sequence reads corresponding to genetic samples from patients suspected of having a disease, or who have had the disease and are in suspected remission. The reference sequence constructs also provide insight to the genetic progression of the disease.
G16B 20/20 - Détection d’allèles ou de variantes, p. ex. détection de polymorphisme d’un seul nucléotide
G16B 20/00 - TIC spécialement adaptées à la génomique ou protéomique fonctionnelle, p. ex. corrélations génotype-phénotype
G16B 30/10 - Alignement de séquence; Recherche d’homologie
C12Q 1/6886 - Produits d’acides nucléiques utilisés dans l’analyse d’acides nucléiques, p.ex. amorces ou sondes pour les maladies provoquées par des altérations du matériel génétique pour le cancer
C12Q 1/6883 - Produits d’acides nucléiques utilisés dans l’analyse d’acides nucléiques, p.ex. amorces ou sondes pour les maladies provoquées par des altérations du matériel génétique
37.
System and method for dynamic control of workflow execution
Some embodiments relate to systems for processing one or more computational workflows. In one embodiment, a description of a computational comprises a plurality of applications, in which applications are represented as nodes and edges connect the nodes indicate the flow of data elements between applications. A task execution module is configured to create and execute tasks. An application programming interface (API) is in communication with the task execution module and comprises a plurality of function calls for controlling at least one function of the task execution module. An API script includes instructions to the API to create and execute a plurality of tasks corresponding to the execution of the computational workflow for a plurality of samples. A graphical user interface (GUI) is in communication with the task execution module and configured to receive input from an end user to initiate execution of the API script.
The invention includes methods for aligning reads (e.g., nucleic acid reads, amino acid reads) to a reference sequence construct, methods for building the reference sequence construct, and systems that use the alignment methods and constructs to produce sequences. The method is scalable, and can be used to align millions of reads to a construct thousands of bases or amino acids long. The invention additionally includes methods for identifying a disease or a genotype based upon alignment of nucleic acid reads to a location in the construct.
In one aspect, a method for scheduling jobs in a computational workflow includes identifying, from a computational workflow by a workflow execution engine executing on a processor, a plurality of jobs ready for execution. The method includes sorting, based on computational resource requirements associated with each identified job, the identified jobs into a prioritized queue. The method includes provisioning one or more computational instances based on the computational resource requirements of the identified jobs in the prioritized queue, wherein at least one computational instance is provisioned based on a highest priority job in the queue. The method includes submitting the prioritized jobs for execution to the one or more computational instances.
G06F 9/48 - Lancement de programmes; Commutation de programmes, p.ex. par interruption
G06F 9/50 - Allocation de ressources, p.ex. de l'unité centrale de traitement [UCT]
G06F 9/455 - Dispositions pour exécuter des programmes spécifiques Émulation; Interprétation; Simulation de logiciel, p.ex. virtualisation ou émulation des moteurs d’exécution d’applications ou de systèmes d’exploitation
H04L 12/24 - Dispositions pour la maintenance ou la gestion
40.
Methods and systems for detecting sequence variants
The invention provides methods for identifying rare variants near a structural variation in a genetic sequence, for example, in a nucleic acid sample taken from a subject. The invention additionally includes methods for aligning reads (e.g., nucleic acid reads) to a reference sequence construct accounting for the structural variation, methods for building a reference sequence construct accounting for the structural variation or the structural variation and the rare variant, and systems that use the alignment methods to identify rare variants. The method is scalable, and can be used to align millions of reads to a construct thousands of bases long, or longer.
G06F 19/22 - pour la comparaison de séquences impliquant des nucléotides ou des acides aminés, p.ex. recherche d'homologie, identification de motifs ou de polymorphismes de nucléotides simples [SNP] ou alignement de séquences
G16B 30/00 - TIC spécialement adaptées à l’analyse de séquences impliquant des nucléotides ou des aminoacides
G16B 50/00 - TIC pour la programmation d’outils ou de systèmes de bases de données spécialement adaptées à la bio-informatique
41.
Watermarking for data security in bioinformatic sequence analysis
Embodiments of the invention protect information stored in graph-based sequence references by “watermarking” the graph with uniquely identifiable information. The watermark identifies the graph or version thereof in a detectable but nonintrusive manner. In one embodiment, insertions and/or deletions are introduced into regions of the graph.
G06F 21/16 - Traçabilité de programme ou de contenu, p.ex. par filigranage
G16B 30/10 - Alignement de séquence; Recherche d’homologie
G06F 16/901 - Indexation; Structures de données à cet effet; Structures de stockage
G16B 5/00 - TIC spécialement adaptées à la modélisation ou aux simulations dans la biologie des systèmes, p. ex. réseaux de régulation génétique, réseaux d’interaction entre protéines ou réseaux métaboliques
G16B 30/00 - TIC spécialement adaptées à l’analyse de séquences impliquant des nucléotides ou des aminoacides
G16B 50/00 - TIC pour la programmation d’outils ou de systèmes de bases de données spécialement adaptées à la bio-informatique
Systems and methods for protecting information stored in private references that are available to be queried—e.g., graph-based sequence references that users query through an interface, providing short reads to obtain the results of an alignment against the reference sequence—analyze the query and/or alignment results to determine whether the query represents an attack. The analysis may be performed before returning results to a user, and in some cases before performing the alignment.
G01N 33/50 - Analyse chimique de matériau biologique, p.ex. de sang ou d'urine; Test par des méthodes faisant intervenir la formation de liaisons biospécifiques par ligands; Test immunologique
G06F 21/55 - Détection d’intrusion locale ou mise en œuvre de contre-mesures
G16B 30/00 - TIC spécialement adaptées à l’analyse de séquences impliquant des nucléotides ou des aminoacides
G16B 50/00 - TIC pour la programmation d’outils ou de systèmes de bases de données spécialement adaptées à la bio-informatique
G16B 30/10 - Alignement de séquence; Recherche d’homologie
43.
SYSTEMS AND METHODS FOR PROVIDING ASSISTED LOCAL ALIGNMENT
A method of aligning a data sequence to one or more reference sequences represented as a sequence variation graph (SVG) is disclosed. The method can comprise receiving one or more alignment candidate regions and corresponding ordered seeding information. For each of the received alignment candidate regions, a current seed is determined, the current seed being a next-in-order unprocessed seed based on the ordered seeding information. Data paths in the alignment candidate region are then traversed to identify potential next seeds relative to the current seed. If at least one potential next seed is found, a next seed is selected and alignment results are generated by applying a local alignment procedure to align query data in portions of the query data sequence between the current seed and the next seed with reference data in portions of the alignment candidate region located between the current seed and the next seed.
G06F 19/22 - pour la comparaison de séquences impliquant des nucléotides ou des acides aminés, p.ex. recherche d'homologie, identification de motifs ou de polymorphismes de nucléotides simples [SNP] ou alignement de séquences
G06F 19/16 - pour la structure moléculaire, p.ex. alignement de la structure, relations structurales ou fonctionnelles, repliement protéique, topologies de domaine, ciblage de médicaments utilisant des données de structure, impliquant des structures bidimensionnelles ou tridimensionnelles
The invention provides methods for comparing one set of genetic sequences to another without discarding any information within either set. A set of genetic sequences is represented using a directed acyclic graph (DAG) avoiding any unwarranted reduction to a linear data structure. The invention provides a way to align one sequence DAG to another to produce an alignment that can itself be stored as a DAG. DAG-to-DAG alignment is a natural choice wherever a set of genomic information consisting of more than one string needs to be compared to any non-linear reference. For example, a subpopulation DAG could be compared to a population DAG in order to compare the genetic features of that subpopulation to those of the population.
C12Q 1/6883 - Produits d’acides nucléiques utilisés dans l’analyse d’acides nucléiques, p.ex. amorces ou sondes pour les maladies provoquées par des altérations du matériel génétique
09 - Appareils et instruments scientifiques et électriques
Produits et services
Computer software for use in software development, namely, the development, testing, and execution of computational workflows; and computer software for use in storage, analysis, and manipulation of documents, content, media, and data, namely, genetic, genomic, biological, biochemical, biomedical, clinical, scientific, engineering, business, and operations data
Embodiments of the invention utilize a graph-based approach for simulating genomic datasets from large scale populations. Genomic data may be represented as a directed acyclic graph (DAG) that incorporates individual sample data including variant type, position, and zygosity. A simulator may operate on the DAG to generate variant datasets based on probabilistic traversal of the DAG. This probabilistic traversal reflects genomic variant types associated with the subpopulation used to build the DAG, and as a result, the generated variant datasets maintain statistical fidelity to the original sample data.
G16B 45/00 - TIC spécialement adaptées à la visualisation de données liées à la bio-informatique, p. ex. affichage de cartes ou de réseaux
G16B 20/00 - TIC spécialement adaptées à la génomique ou protéomique fonctionnelle, p. ex. corrélations génotype-phénotype
G16B 50/00 - TIC pour la programmation d’outils ou de systèmes de bases de données spécialement adaptées à la bio-informatique
G16B 5/00 - TIC spécialement adaptées à la modélisation ou aux simulations dans la biologie des systèmes, p. ex. réseaux de régulation génétique, réseaux d’interaction entre protéines ou réseaux métaboliques
47.
SYSTEMS AND METHODS FOR ALIGNING SEQUENCES TO PERSONALIZED REFERENCES
Techniques for generating a personalized reference sequence construct for an individual to align sequence reads obtained for the individual. The techniques include: obtaining a plurality of sequence reads for an individual; obtaining information identifying a plurality of locations; genotyping the plurality of sequence reads for the plurality of locations to obtain a first set of variants for the individual for at least some of the plurality of locations; identifying a second set of variants associated with the first set of variants; generating a personalized reference sequence construct using the second set of variants; and aligning the plurality of sequence reads to the personalized reference sequence construct.
G06F 19/22 - pour la comparaison de séquences impliquant des nucléotides ou des acides aminés, p.ex. recherche d'homologie, identification de motifs ou de polymorphismes de nucléotides simples [SNP] ou alignement de séquences
G06F 19/18 - pour la génomique ou la protéomique fonctionnelle, p.ex. associations génotype-phénotype, déséquilibre de liaison, mutagénèse, génotypage ou annotation génomique, interactions protéines-protéines ou interactions protéines-acides nucléiques
G06F 19/24 - pour l'apprentissage automatique, l'exploration de données ou les bio statistiques, p.ex. détection de motifs, extraction de connaissances, extraction de règles, corrélation, agrégation ou classification
48.
Systems and methods for aligning sequences to graph references
Various embodiments of the disclosure relate to systems and methods for aligning a sequence read to a graph reference. In one embodiment, the method comprises selecting a first node from a graph reference, the graph reference comprising a plurality of nodes connected by a plurality of directed edges, at least one node of the plurality of nodes having a nucleotide sequence. The method further comprises traversing the graph reference according to a depth-first search, and comparing a sequence read to nucleotide sequences generated from the traversal of the graph reference. The traversal of the graph is then modified in response to a determination that each and every node associated with a given nucleotide sequence was previously evaluated.
G16B 45/00 - TIC spécialement adaptées à la visualisation de données liées à la bio-informatique, p. ex. affichage de cartes ou de réseaux
G06K 9/00 - Méthodes ou dispositions pour la lecture ou la reconnaissance de caractères imprimés ou écrits ou pour la reconnaissance de formes, p.ex. d'empreintes digitales
G06K 9/68 - Méthodes ou dispositions pour la reconnaissance utilisant des moyens électroniques utilisant des comparaisons successives des signaux images avec plusieurs références, p.ex. mémoire adressable
49.
Systems and methods for sequence encoding, storage, and compression
Genomic data is written to disk in a compact format by dividing the data into segments and encoding each segment with the smallest number of bits per character necessary for whatever alphabet of characters appears in that segment. A computer system dynamically chooses the segment boundaries for maximum space savings. A first one of the segments may use a different number of bits per character than a second one of the segments. In one embodiment, dividing the data into segments comprises scanning the data and keeping track of a number of unique characters, noting positions in the sequence where the number increases to a power of two, calculating a compression that would be obtained by dividing the genomic data into one of the plurality of segments at ones of the noted positions, and dividing the genomic data into the plurality of segments at the positions that yield the best compression.
Systems and methods for data storage and retrieval for a computer memory include processing a computational workflow having multiple data-processing steps, generating and storing a first hash value associated with a first step of the data-processing steps based on an input to the first step, generating and storing a second hash value associated with a second step of the data-processing steps based on the generated first hash value, and reconstructing a computational state of the workflow based on the second hash value, and thereby avoid re-execution of a portion of the workflow corresponding to the second hash value.
G06F 9/48 - Lancement de programmes; Commutation de programmes, p.ex. par interruption
G06Q 10/06 - Ressources, gestion de tâches, des ressources humaines ou de projets; Planification d’entreprise ou d’organisation; Modélisation d’entreprise ou d’organisation
H04L 9/06 - Dispositions pour les communications secrètes ou protégées; Protocoles réseaux de sécurité l'appareil de chiffrement utilisant des registres à décalage ou des mémoires pour le codage par blocs, p.ex. système DES
A method for screening for disease in a genomic sample is includes receiving a representation of a reference genome comprising a sequence of symbols. The presence of a predicted mutational event is identified in a location of the reference genome. An alternate path is created in the reference genome representing the predicted mutational event. A plurality of sequence reads are obtained from a genomic sample, wherein at least one sequence read comprises at least a portion of the predicted mutational event. The at least one sequence read is then mapped to the reference genome and a location is determined corresponding to the predicted mutational event. The predicted mutational event is then identified as present in the genomic sample. The method may be used to detect evidence of non-allelic homologous recombination (NAHR) occurring in genomic samples.
G16B 20/20 - Détection d’allèles ou de variantes, p. ex. détection de polymorphisme d’un seul nucléotide
G16H 50/50 - TIC spécialement adaptées au diagnostic médical, à la simulation médicale ou à l’extraction de données médicales; TIC spécialement adaptées à la détection, au suivi ou à la modélisation d’épidémies ou de pandémies pour la simulation ou la modélisation des troubles médicaux
G16B 20/00 - TIC spécialement adaptées à la génomique ou protéomique fonctionnelle, p. ex. corrélations génotype-phénotype
G16B 30/00 - TIC spécialement adaptées à l’analyse de séquences impliquant des nucléotides ou des aminoacides
G16B 30/10 - Alignement de séquence; Recherche d’homologie
52.
Systems and methods for processing computational workflows
In one embodiment, a method of processing a computational workflow comprises receiving a description of a computational workflow. The description comprises a plurality of steps, in which each step has at least one input and at least one output, and further wherein an input from a second step depends on an output from a first step. The description is translated into a static workflow graph stored in a memory, the static workflow graph comprising a plurality of nodes having input ports and output ports, wherein dependencies between inputs and outputs are specified as edges between input ports and output ports. Information about a first set of nodes is then extracted from the static workflow graph and placed into a dynamic graph. A first actionable job is identified from the dynamic graph and executed.
Computer-implemented methods and systems for performing a local assembly of a genomic region of interest include the de novo or assisted creation of a directed graph, such as a directed acyclic graph (DAG), from a plurality of obtained nucleotide sequence reads. First and second sequence reads are aligned to each other to define at least one node of the DAG. Successive alignments of the remaining sequence reads to the then-defined DAG are performed to extend nodes and/or add nodes to the DAG. Graph-aware alignment techniques that produce alignment scores or indicators are employed in defining the nodes of the DAG from the sequence reads. The created DAG represents and describes in detail the genomic region of interest and can be used to perform variant calls.
Techniques for identifying variations in sequence data relative to reference sequence data. The techniques include accessing information specifying multiple sets of variants in the sequence data relative to reference sequence data, each of the multiple sets of variants being generated by using a respective variant identification technique; and determining, using the information specifying the multiple sets of variants in the sequence data, a reconciled set of variants in the sequence data relative to the reference sequence data, the determining comprising: determining whether a first variant is present at a first position in the sequence data based, at least in part, on one or more variants at one or more other positions in the sequence data.
A method for stream-processing biomedical data includes receiving, by a file system on a computing device, a first request for access to at least a first portion of a file stored on a remotely located storage device. The method includes receiving, by the file system, a second request for access to at least a second portion of the file. The method includes determining, by a pre-fetching component executing on the computing device, whether the first request and the second request are associated with a sequential read operation. The method includes automatically retrieving, by the pre-fetching component, a third portion of the requested file, before receiving a third request for access to least the third portion of the file, based on a determination that the first request and the second request are associated with the sequential read operation.
G06F 12/0862 - Adressage d’un niveau de mémoire dans lequel l’accès aux données ou aux blocs de données désirés nécessite des moyens d’adressage associatif, p.ex. mémoires cache avec pré-lecture
G16B 50/00 - TIC pour la programmation d’outils ou de systèmes de bases de données spécialement adaptées à la bio-informatique
G06F 12/02 - Adressage ou affectation; Réadressage
56.
METHODS AND SYSTEMS FOR STREAM-PROCESSING OF BIOMEDICAL DATA
A method for stream-processing biomedical data includes receiving, by a file system on a computing device, a first request for access to at least a first portion of a file stored on a remotely located storage device. The method includes receiving, by the file system, a second request for access to at least a second portion of the file. The method includes determining, by a pre-fetching component executing on the computing device, whether the first request and the second request are associated with a sequential read operation. The method includes automatically retrieving, by the pre-fetching component, a third portion of the requested file, before receiving a third request for access to least the third portion of the file, based on a determination that the first request and the second request are associated with the sequential read operation.
G06F 19/28 - pour la programmation d'outils ou de systèmes de bases de données, p.ex. ontologies, intégration de données hétérogènes, entreposage de données ou architectures informatiques
57.
SYSTEMS AND METHODS FOR GENOTYPING WITH GRAPH REFERENCE
Genomic references are structured as a reference graph that represents diploid genotypes in organisms. A path through a series of connected nodes and edges represents a genetic sequence. Genetic variation within a diploid organism is represented by multiple paths through the reference graph. The graph may be transformed into a traversal graph in which a path represents a diploid genotype. Genetic analysis using the traversal graph allows an organism's diploid genotype to be elucidated, e.g., by mapping sequence reads to the reference graph and scoring paths in the traversal graph based on the mapping to determine the path through the traversal graph that best fits the sequence reads.
G06F 19/18 - pour la génomique ou la protéomique fonctionnelle, p.ex. associations génotype-phénotype, déséquilibre de liaison, mutagénèse, génotypage ou annotation génomique, interactions protéines-protéines ou interactions protéines-acides nucléiques
58.
Systems and methods for genotyping with graph reference
Genomic references are structured as a reference graph that represents diploid genotypes in organisms. A path through a series of connected nodes and edges represents a genetic sequence. Genetic variation within a diploid organism is represented by multiple paths through the reference graph. The graph may be transformed into a traversal graph in which a path represents a diploid genotype. Genetic analysis using the traversal graph allows an organism's diploid genotype to be elucidated, e.g., by mapping sequence reads to the reference graph and scoring paths in the traversal graph based on the mapping to determine the path through the traversal graph that best fits the sequence reads.
G01N 33/50 - Analyse chimique de matériau biologique, p.ex. de sang ou d'urine; Test par des méthodes faisant intervenir la formation de liaisons biospécifiques par ligands; Test immunologique
G06F 19/18 - pour la génomique ou la protéomique fonctionnelle, p.ex. associations génotype-phénotype, déséquilibre de liaison, mutagénèse, génotypage ou annotation génomique, interactions protéines-protéines ou interactions protéines-acides nucléiques
G06F 19/26 - pour la visualisation de données, p.ex. production de graphiques, affichage de cartes ou de réseaux ou autres représentations visuelles
59.
SYSTEMS AND METHODS FOR ENCODING GENETIC VARIATION FOR A POPULATION
In one embodiment, a method of encoding variation data for a population comprises receiving, by a variant encoding engine executing on a processor, information describing genetic variation of a population of individuals. The information comprises a plurality of variable sites within the reference genome of the population and the genotypes of a plurality of individuals in the population with respect to those variable sites. The method further comprises selecting an encoding strategy for the information based on the characteristics of the genetic variation across the population, and encoding the information according to the selected encoding strategy. In certain embodiments, selecting an encoding strategy may comprise determining the variability of a variable site within the population, and encoding information associated with the variable site based on the variability.
G06F 19/18 - pour la génomique ou la protéomique fonctionnelle, p.ex. associations génotype-phénotype, déséquilibre de liaison, mutagénèse, génotypage ou annotation génomique, interactions protéines-protéines ou interactions protéines-acides nucléiques
G06F 19/28 - pour la programmation d'outils ou de systèmes de bases de données, p.ex. ontologies, intégration de données hétérogènes, entreposage de données ou architectures informatiques
60.
Systems and methods for encoding genetic variation for a population
In one embodiment, a method of encoding variation data for a population comprises receiving, by a variant encoding engine executing on a processor, information describing genetic variation of a population of individuals. The information comprises a plurality of variable sites within the reference genome of the population and the genotypes of a plurality of individuals in the population with respect to those variable sites. The method further comprises selecting an encoding strategy for the information based on the characteristics of the genetic variation across the population, and encoding the information according to the selected encoding strategy. In certain embodiments, selecting an encoding strategy may comprise determining the variability of a variable site within the population, and encoding information associated with the variable site based on the variability.
G16B 35/00 - TIC spécialement adaptées aux bibliothèques combinatoires in silico d’acides nucléiques, de protéines ou de peptides
H03M 7/30 - Compression; Expansion; Elimination de données inutiles, p.ex. réduction de redondance
G16H 10/60 - TIC spécialement adaptées au maniement ou au traitement des données médicales ou de soins de santé relatives aux patients pour des données spécifiques de patients, p.ex. pour des dossiers électroniques de patients
G16B 20/00 - TIC spécialement adaptées à la génomique ou protéomique fonctionnelle, p. ex. corrélations génotype-phénotype
A method for generating a query of a genomic data store includes receiving, by a query generator executing on a computing device, from a graphical user interface, an identification of a first entity of a first entity class for inclusion in a resource description framework (RDF) query. The method includes receiving from the graphical user interface, an identification of a second entity of the first entity class, the second entity having a bi-directional relationship with the first entity. The method includes automatically generating an RDF query based upon the received identification of the first entity and the received identification of the second entity. The method includes executing the RDF query to select, from a plurality of genomic data sets, at least one genomic data set for at least one patient cohort. The method includes providing a listing of genomic data sets resulting from executing the RDF query.
The invention provides oncogenomic methods for detecting tumors by identifying circulating tumor DNA. A patient-specific reference directed acyclic graph (DAG) represents known human genomic sequences and non-tumor DNA from the patient as well as known tumor-associated mutations. Sequence reads from cell-free plasma DNA from the patient are mapped to the patient-specific genomic reference graph. Any of the known tumor-associated mutations found in the reads and any de novo mutations found in the reads are reported as the patient's tumor mutation burden.
C12Q 1/68 - Procédés de mesure ou de test faisant intervenir des enzymes, des acides nucléiques ou des micro-organismes; Compositions à cet effet; Procédés pour préparer ces compositions faisant intervenir des acides nucléiques
G01N 33/50 - Analyse chimique de matériau biologique, p.ex. de sang ou d'urine; Test par des méthodes faisant intervenir la formation de liaisons biospécifiques par ligands; Test immunologique
G01N 33/574 - Tests immunologiques; Tests faisant intervenir la formation de liaisons biospécifiques; Matériaux à cet effet pour le cancer
G06F 19/22 - pour la comparaison de séquences impliquant des nucléotides ou des acides aminés, p.ex. recherche d'homologie, identification de motifs ou de polymorphismes de nucléotides simples [SNP] ou alignement de séquences
63.
SYSTEMS AND METHODS FOR ADAPTIVE LOCAL ALIGNMENT FOR GRAPH GENOMES
Systems and methods for analyzing genomic information can include obtaining a sequence read including genetic information; identifying, within a graph representing a reference genome, a plurality of candidate mapping positions that relate to the genetic information, the graph comprising nodes representing genetic sequences and edges connecting pairs of nodes; determining, by means of a computer system, whether an alignment with the graph surrounding each of the plurality of candidate mapping positions is advanced or basic; and performing for each candidate mapping position, by means of the computer system, a local alignment based on whether the local alignment is advanced or basic. The advanced local alignment can include a first-local-alignment algorithm, and the basic local alignment includes a second-local-alignment algorithm. Based on the local alignments, the mapped position of the sequence read can be identified within the genome.
G06F 19/22 - pour la comparaison de séquences impliquant des nucléotides ou des acides aminés, p.ex. recherche d'homologie, identification de motifs ou de polymorphismes de nucléotides simples [SNP] ou alignement de séquences
Methods of the invention include representing biological data in a memory subsystem within a computer system with a data structure that is particular to a location in the memory subsystem and serializing the data structure into a stream of bytes that can be deserialized into a clone of the data structure. In a preferred genomic embodiment, the biological data comprises genomic sequences and the data structure comprises a genomic directed acyclic graph (DAG) in which objects have adjacency lists of pointers that indicate the location of any object adjacent to that object. After serialization and deserialization, the clone genomic DAG has the same structure as the original to represent the same sequences and relationships among them as the original.
Methods of the invention include representing biological data in a memory subsystem within a computer system with a data structure that is particular to a location in the memory subsystem and serializing the data structure into a stream of bytes that can be deserialized into a clone of the data structure. In a preferred genomic embodiment, the biological data comprises genomic sequences and the data structure comprises a genomic directed acyclic graph (DAG) in which objects have adjacency lists of pointers that indicate the location of any object adjacent to that object. After serialization and deserialization, the clone genomic DAG has the same structure as the original to represent the same sequences and relationships among them as the original.
The invention provides systems and methods for determining patterns of modification to a genome of a subject by representing the genome using a graph, such as a directed acyclic graph (DAG) with divergent paths for regions that are potentially subject to modification, profiling segments of the genome for evidence of epigenetic modification, and aligning the profiled segments to the DAG to determine locations and patterns of the epigenetic modification within the genome.
G01N 31/00 - Recherche ou analyse des matériaux non biologiques par l'emploi des procédés chimiques spécifiés dans les sous-groupes; Appareils spécialement adaptés à de tels procédés
C12Q 1/6806 - Préparation d’acides nucléiques pour analyse, p.ex. pour test de réaction en chaîne par polymérase [PCR]
C12Q 1/6874 - Méthodes de séquençage faisant intervenir des réseaux d’acides nucléiques, p.ex. séquençage par hybridation [SBH]
G16B 30/00 - TIC spécialement adaptées à l’analyse de séquences impliquant des nucléotides ou des aminoacides
The invention provides methods of analyzing an individual's mtDNA by transforming available reference sequences into a directed graph that compactly represents all the information without duplication and comparing sequence reads from the mtDNA to the graph to identify the individual or describe their mtDNA. A directed graph can represent all of the genetic variation found among the mitochondrial genomes across all of a number of reference organisms while providing a single article to which sequence reads can be aligned or compared. Thus any sequence read or other sequence fragment can be compared, in a single operation, to the article that represents all of the reference mitochondrial sequences.
G06G 7/48 - Calculateurs analogiques pour des procédés, des systèmes ou des dispositifs spécifiques, p.ex. simulateurs
C12Q 1/6874 - Méthodes de séquençage faisant intervenir des réseaux d’acides nucléiques, p.ex. séquençage par hybridation [SBH]
C12Q 1/6888 - Produits d’acides nucléiques utilisés dans l’analyse d’acides nucléiques, p.ex. amorces ou sondes pour la détection ou l’identification d’organismes
68.
Systems and methods for analyzing viral nucleic acids
The invention provides systems and methods for analyzing viruses by representing viral genetic diversity with a directed acyclic graph (DAG), which allows genetic sequencing technology to detect rare variations and represent otherwise difficult-to-document diversity within a sample. Additionally, a host-specific sequence DAG can be used to effectively segregate viral nucleic acid sequence reads from host sequence reads when a sample from a host is subject to sequencing. Known viral genomes can be represented using a viral reference DAG and the viral sequence reads from the sample can be compared to viral DAG to identify viral species or strains from which the reads were derived. Where the viral sequence reads indicate great genetic diversity in the virus that was infecting the host, those reads can be assembled into a DAG that itself properly represents that diversity.
C12Q 1/70 - Procédés de mesure ou de test faisant intervenir des enzymes, des acides nucléiques ou des micro-organismes; Compositions à cet effet; Procédés pour préparer ces compositions faisant intervenir des virus ou des bactériophages
G16B 30/00 - TIC spécialement adaptées à l’analyse de séquences impliquant des nucléotides ou des aminoacides
C12Q 1/6809 - Méthodes de détermination ou d’identification des acides nucléiques faisant intervenir la détection différentielle
69.
SYSTEMS AND METHODS FOR IDENTIFYING MICROORGANISMS
The invention provides methods for identifying a microorganism by aligning sequence reads to a graph, such as a directed acyclic graph (DAG), that contains condensed sequence information of a conserved region from multiple known microorganisms. The DAG can be constructed by obtaining sequence information of known reference microorganisms. The DAG also includes the identities of the known microorganisms that correspond to particular paths. Sequence reads obtained from an unknown sample can thus be aligned to paths in the DAG using an alignment algorithm, and the identity of a microorganism in the sample can be determined based on which path in the DAG to which the sequence reads align best.
C12Q 1/68 - Procédés de mesure ou de test faisant intervenir des enzymes, des acides nucléiques ou des micro-organismes; Compositions à cet effet; Procédés pour préparer ces compositions faisant intervenir des acides nucléiques
G06F 19/10 - Bio-informatique, c. à d. procédés ou systèmes pour le traitement de données génétiques ou se rapportant aux protéines en biologie moléculaire informatique (procédés in silico de criblage de bibliothèques chimiques virtuelles C40B 30/02;procédés mathématiques ou in silicio de création de bibliothèques chimiques virtuelles C40B 50/02)
G06F 19/22 - pour la comparaison de séquences impliquant des nucléotides ou des acides aminés, p.ex. recherche d'homologie, identification de motifs ou de polymorphismes de nucléotides simples [SNP] ou alignement de séquences
G06F 19/28 - pour la programmation d'outils ou de systèmes de bases de données, p.ex. ontologies, intégration de données hétérogènes, entreposage de données ou architectures informatiques
42 - Services scientifiques, technologiques et industriels, recherche et conception
Produits et services
Management of databases containing research subjects' clinical information obtained for scientific research purposes Providing temporary use of on-line non-downloadable software and applications for analyzing scientific and medical data and for generating and managing corresponding reports; providing temporary use of on-line non-downloadable software and applications for managing and analyzing genomic, phenotypic, demographic and electronic medical record data and for generating and analyzing corresponding reports; analyzing data relating to research subjects' clinical information for scientific research purposes
42 - Services scientifiques, technologiques et industriels, recherche et conception
Produits et services
Management of databases containing research subjects' clinical information obtained for scientific research purposes Providing temporary use of on-line non-downloadable software and applications for analyzing scientific and medical data and for generating and managing corresponding reports; providing temporary use of on-line non-downloadable software and applications for managing and analyzing genomic, phenotypic, demographic and electronic medical record data and for generating and analyzing corresponding reports; analyzing data relating to research subjects' clinical information for scientific research purposes
The invention relates to methods for determining a haplotype for an organism by using a system for transforming SNP alleles found in sequence fragments into vertices in a graph with edges connecting vertices for alleles that appear together in a sequence fragment. A community detection operation is used to infer the haplotype from the graph. The system may produce a report that includes the haplotype of the SNPs found in the genome of that organism.
C12Q 1/68 - Procédés de mesure ou de test faisant intervenir des enzymes, des acides nucléiques ou des micro-organismes; Compositions à cet effet; Procédés pour préparer ces compositions faisant intervenir des acides nucléiques
G06F 19/18 - pour la génomique ou la protéomique fonctionnelle, p.ex. associations génotype-phénotype, déséquilibre de liaison, mutagénèse, génotypage ou annotation génomique, interactions protéines-protéines ou interactions protéines-acides nucléiques
G06F 19/24 - pour l'apprentissage automatique, l'exploration de données ou les bio statistiques, p.ex. détection de motifs, extraction de connaissances, extraction de règles, corrélation, agrégation ou classification
The invention relates to methods for determining a haplotype for an organism by using a system for transforming SNP alleles found in sequence fragments into vertices in a graph with edges connecting vertices for alleles that appear together in a sequence fragment. A community detection operation can be used to infer the haplotype from the graph. The system may produce a report that includes the haplotype of the SNPs found in the genome of that organism.
G01N 33/50 - Analyse chimique de matériau biologique, p.ex. de sang ou d'urine; Test par des méthodes faisant intervenir la formation de liaisons biospécifiques par ligands; Test immunologique
G06F 19/18 - pour la génomique ou la protéomique fonctionnelle, p.ex. associations génotype-phénotype, déséquilibre de liaison, mutagénèse, génotypage ou annotation génomique, interactions protéines-protéines ou interactions protéines-acides nucléiques
C12Q 1/6827 - Tests d’hybridation pour la détection de mutation ou de polymorphisme
G06F 19/24 - pour l'apprentissage automatique, l'exploration de données ou les bio statistiques, p.ex. détection de motifs, extraction de connaissances, extraction de règles, corrélation, agrégation ou classification
74.
Methods and systems for detecting sequence variants
The invention provides methods for identifying rare variants near a structural variation in a genetic sequence, for example, in a nucleic acid sample taken from a subject. The invention additionally includes methods for aligning reads (e.g., nucleic acid reads) to a reference sequence construct accounting for the structural variation, methods for building a reference sequence construct accounting for the structural variation or the structural variation and the rare variant, and systems that use the alignment methods to identify rare variants. The method is scalable, and can be used to align millions of reads to a construct thousands of bases long, or longer.
G06F 19/22 - pour la comparaison de séquences impliquant des nucléotides ou des acides aminés, p.ex. recherche d'homologie, identification de motifs ou de polymorphismes de nucléotides simples [SNP] ou alignement de séquences
G06F 19/28 - pour la programmation d'outils ou de systèmes de bases de données, p.ex. ontologies, intégration de données hétérogènes, entreposage de données ou architectures informatiques
The invention provides methods for analyzing sequence data in which a large amount and variety of reference data are efficiently modeled as a reference graph, such as a directed acyclic graph (DAG). The method includes determining positions of k-mers, within a reference graph that represents a genomic sequence and known variation, storing the positions of each k-mer in a table entry indexed by a hash of that k-mer, and identifying a region within the reference graph that includes a threshold number of the k-mers, by reading from the table entries indexed by hashes of substrings of a subject sequence. The subject sequence may subsequently be mapped to the candidate region.
The invention provides methods for analyzing sequence data in which a large amount and variety of reference data are efficiently modeled as a reference graph, such as a directed acyclic graph (DAG). The method includes determining positions of k-mers within a reference graph that represents a genomic sequence and known variation, storing the positions of each k-mer in a table entry indexed by a hash of that k-mer, and identifying a region within the reference graph that includes a threshold number of the k-mers by reading from the table entries indexed by hashes of substrings of a subject sequence. The subject sequence may subsequently be mapped to the candidate region.
G01N 33/50 - Analyse chimique de matériau biologique, p.ex. de sang ou d'urine; Test par des méthodes faisant intervenir la formation de liaisons biospécifiques par ligands; Test immunologique
G06F 19/16 - pour la structure moléculaire, p.ex. alignement de la structure, relations structurales ou fonctionnelles, repliement protéique, topologies de domaine, ciblage de médicaments utilisant des données de structure, impliquant des structures bidimensionnelles ou tridimensionnelles
G06F 19/22 - pour la comparaison de séquences impliquant des nucléotides ou des acides aminés, p.ex. recherche d'homologie, identification de motifs ou de polymorphismes de nucléotides simples [SNP] ou alignement de séquences
77.
SYSTEMS AND METHODS FOR SMART TOOLS IN SEQUENCE PIPELINES
The invention relates to bioinformatics pipelines and wrapper scripts that call executables in those pipelines and that also identify beneficial changes to the pipelines. A tool in a pipeline has a smart wrapper that can cause the tool to analyze the sequence data it receives but that can also select a change to the pipeline when circumstances warrant. In certain aspects, the invention provides a system for genomic analysis. The system includes a processor coupled to a non- transitory memory. The system is operable to present to a user a plurality of genomic tools organized into a pipeline. At least a first one of the tools comprises an executable and a wrapper script. The system can receive instructions from the user and sequence data— instructions that call for the sequence data to be analyzed by the pipeline— and select, using the wrapper script, a change to the pipeline.
G06F 19/22 - pour la comparaison de séquences impliquant des nucléotides ou des acides aminés, p.ex. recherche d'homologie, identification de motifs ou de polymorphismes de nucléotides simples [SNP] ou alignement de séquences
78.
SYSTEMS AND METHODS FOR SMART TOOLS IN SEQUENCE PIPELINES
The invention relates to bioinformatics pipelines and wrapper scripts that call executables in those pipelines and that also identify beneficial changes to the pipelines. A tool in a pipeline has a smart wrapper that can cause the tool to analyze the sequence data it receives but that can also select a change to the pipeline when circumstances warrant. In certain aspects, the invention provides a system for genomic analysis. The system includes a processor coupled to a non- transitory memory. The system is operable to present to a user a plurality of genomic tools organized into a pipeline. At least a first one of the tools comprises an executable and a wrapper script. The system can receive instructions from the user and sequence data instructions that call for the sequence data to be analyzed by the pipeline and select, using the wrapper script, a change to the pipeline.
G16B 50/00 - TIC pour la programmation d’outils ou de systèmes de bases de données spécialement adaptées à la bio-informatique
G16B 20/00 - TIC spécialement adaptées à la génomique ou protéomique fonctionnelle, p. ex. corrélations génotype-phénotype
G16B 30/00 - TIC spécialement adaptées à l’analyse de séquences impliquant des nucléotides ou des aminoacides
G16B 40/00 - TIC spécialement adaptées aux biostatistiques; TIC spécialement adaptées à l’apprentissage automatique ou à l’exploration de données liées à la bio-informatique, p.ex. extraction de connaissances ou détection de motifs
C12Q 1/68 - Procédés de mesure ou de test faisant intervenir des enzymes, des acides nucléiques ou des micro-organismes; Compositions à cet effet; Procédés pour préparer ces compositions faisant intervenir des acides nucléiques
79.
Systems and methods for smart tools in sequence pipelines
A tool in a bioinformatics pipeline can include a smart wrapper and an executable. The smart wrapper can cause the executable to analyze the sequence data it receives and can also selectively change to the pipeline when circumstances warrant. In certain aspects, a system for genomic analysis includes a processor coupled to a non-transitory memory. The system is operable to present to a user a plurality of genomic tools organized into a pipeline. At least a first one of the tools comprises an executable and a wrapper script. The system can receive instructions from the user and sequence data—instructions that call for the sequence data to be analyzed by the pipeline—and select, using the wrapper script, a change to the pipeline.
G06F 9/44 - Dispositions pour exécuter des programmes spécifiques
G06F 19/18 - pour la génomique ou la protéomique fonctionnelle, p.ex. associations génotype-phénotype, déséquilibre de liaison, mutagénèse, génotypage ou annotation génomique, interactions protéines-protéines ou interactions protéines-acides nucléiques
G06F 19/28 - pour la programmation d'outils ou de systèmes de bases de données, p.ex. ontologies, intégration de données hétérogènes, entreposage de données ou architectures informatiques
80.
VARIANT-CALLING DATA FROM AMPLICON-BASED SEQUENCING METHODS
The invention provides systems and methods for calling variants in data from amplicon- based sequencing methods by aligning and assembling reads, associating the reads with their source amplicons, treating each amplicon as a separate sample or file, calling variants on the reads. A portion of each read is aligned to the primer binding site of the associated amplicons. Variants called at sites in the mapped portions of each read are discarded. The remaining variant calls are merged, to provide a set of variant calls across the original target region.
G06F 19/22 - pour la comparaison de séquences impliquant des nucléotides ou des acides aminés, p.ex. recherche d'homologie, identification de motifs ou de polymorphismes de nucléotides simples [SNP] ou alignement de séquences
G06F 19/10 - Bio-informatique, c. à d. procédés ou systèmes pour le traitement de données génétiques ou se rapportant aux protéines en biologie moléculaire informatique (procédés in silico de criblage de bibliothèques chimiques virtuelles C40B 30/02;procédés mathématiques ou in silicio de création de bibliothèques chimiques virtuelles C40B 50/02)
81.
Methods and systems for detecting sequence variants
The invention provides methods for identifying rare variants near a structural variation in a genetic sequence, for example, in a nucleic acid sample taken from a subject. The invention additionally includes methods for aligning reads (e.g., nucleic acid reads) to a reference sequence construct accounting for the structural variation, methods for building a reference sequence construct accounting for the structural variation or the structural variation and the rare variant, and systems that use the alignment methods to identify rare variants. The method is scalable, and can be used to align millions of reads to a construct thousands of bases long, or longer.
G05D 1/00 - Commande de la position, du cap, de l'altitude ou de l'attitude des véhicules terrestres, aquatiques, aériens ou spatiaux, p.ex. pilote automatique
G05D 3/00 - Commande de la position ou de la direction
G06F 17/00 - TRAITEMENT ÉLECTRIQUE DE DONNÉES NUMÉRIQUES Équipement ou méthodes de traitement de données ou de calcul numérique, spécialement adaptés à des fonctions spécifiques
G06F 7/00 - Procédés ou dispositions pour le traitement de données en agissant sur l'ordre ou le contenu des données maniées
G06F 19/22 - pour la comparaison de séquences impliquant des nucléotides ou des acides aminés, p.ex. recherche d'homologie, identification de motifs ou de polymorphismes de nucléotides simples [SNP] ou alignement de séquences
G06F 19/28 - pour la programmation d'outils ou de systèmes de bases de données, p.ex. ontologies, intégration de données hétérogènes, entreposage de données ou architectures informatiques
82.
Systems and methods for using paired-end data in directed acyclic structure
Methods of analyzing a transcriptome that involves obtaining at least one pair of paired-end reads from a transcriptome from an organism, finding an alignment with an optimal score between a first read of the pair and a node in a directed acyclic data structure (the data structure has nodes representing RNA sequences such as exons or transcripts and edges connecting pairs of nodes), identifying candidate paths that include the node connected to a downstream node by a path having a length substantially similar to an insert length of the pair of paired-end reads, and aligning the paired-end rends to the candidate paths to determine an optimal-scoring alignment.
G06F 19/10 - Bio-informatique, c. à d. procédés ou systèmes pour le traitement de données génétiques ou se rapportant aux protéines en biologie moléculaire informatique (procédés in silico de criblage de bibliothèques chimiques virtuelles C40B 30/02;procédés mathématiques ou in silicio de création de bibliothèques chimiques virtuelles C40B 50/02)
G06F 19/22 - pour la comparaison de séquences impliquant des nucléotides ou des acides aminés, p.ex. recherche d'homologie, identification de motifs ou de polymorphismes de nucléotides simples [SNP] ou alignement de séquences
G06F 19/28 - pour la programmation d'outils ou de systèmes de bases de données, p.ex. ontologies, intégration de données hétérogènes, entreposage de données ou architectures informatiques
The invention generally provides systems and methods for analysis of RNA-Seq reads in which an annotated reference is represented as a directed acyclic graph (DAG) or similar data structure. Features such as exons and introns from the reference provide nodes in the DAG and those features are linked as pairs in their canonical genomic order by edges. The DAG can scale to any size and can in fact be populated in the first instance by import from an extrinsic annotated reference.
G06F 19/22 - pour la comparaison de séquences impliquant des nucléotides ou des acides aminés, p.ex. recherche d'homologie, identification de motifs ou de polymorphismes de nucléotides simples [SNP] ou alignement de séquences
G06F 19/28 - pour la programmation d'outils ou de systèmes de bases de données, p.ex. ontologies, intégration de données hétérogènes, entreposage de données ou architectures informatiques
The invention provides methods for comparing one set of genetic sequences to another without discarding any information within either set. A set of genetic sequences is represented using a directed acyclic graph (DAG) avoiding any unwarranted reduction to a linear data structure. The invention provides a way to align one sequence DAG to another to produce an alignment that can itself be stored as a DAG. DAG-to-DAG alignment is a natural choice wherever a set of genomic information consisting of more than one string needs to be compared to any non-linear reference. For example, a subpopulation DAG could be compared to a population DAG in order to compare the genetic features of that subpopulation to those of the population.
C12Q 1/68 - Procédés de mesure ou de test faisant intervenir des enzymes, des acides nucléiques ou des micro-organismes; Compositions à cet effet; Procédés pour préparer ces compositions faisant intervenir des acides nucléiques
G06F 19/22 - pour la comparaison de séquences impliquant des nucléotides ou des acides aminés, p.ex. recherche d'homologie, identification de motifs ou de polymorphismes de nucléotides simples [SNP] ou alignement de séquences
G06F 19/10 - Bio-informatique, c. à d. procédés ou systèmes pour le traitement de données génétiques ou se rapportant aux protéines en biologie moléculaire informatique (procédés in silico de criblage de bibliothèques chimiques virtuelles C40B 30/02;procédés mathématiques ou in silicio de création de bibliothèques chimiques virtuelles C40B 50/02)
G06F 17/30 - Recherche documentaire; Structures de bases de données à cet effet
The invention provides methods for comparing one set of genetic sequences to another without discarding any information within either set. A set of genetic sequences is represented using a directed acyclic graph (DAG) avoiding any unwarranted reduction to a linear data structure. The invention provides a way to align one sequence DAG to another to produce an alignment that can itself be stored as a DAG. DAG-to-DAG alignment is a natural choice wherever a set of genomic information consisting of more than one string needs to be compared to any non-linear reference. For example, a subpopulation DAG could be compared to a population DAG in order to compare the genetic features of that subpopulation to those of the population.
G06F 19/22 - pour la comparaison de séquences impliquant des nucléotides ou des acides aminés, p.ex. recherche d'homologie, identification de motifs ou de polymorphismes de nucléotides simples [SNP] ou alignement de séquences
G06F 19/26 - pour la visualisation de données, p.ex. production de graphiques, affichage de cartes ou de réseaux ou autres représentations visuelles
C12Q 1/68 - Procédés de mesure ou de test faisant intervenir des enzymes, des acides nucléiques ou des micro-organismes; Compositions à cet effet; Procédés pour préparer ces compositions faisant intervenir des acides nucléiques
86.
Methods and systems for genotyping genetic samples
The invention provides methods and system for making specific base calls at specific loci using a reference sequence construct, e.g., a directed acyclic graph (DAG) that represents known variants at each locus of the genome. Because the sequence reads are aligned to the DAG during alignment, the subsequent step of comparing a mutation, vis-à-vis the reference genome, to a table of known mutations can be eliminated. The disclosed methods and systems are notably efficient in dealing with structural variations within a genome or mutations that are within a structural variation.
G06F 19/22 - pour la comparaison de séquences impliquant des nucléotides ou des acides aminés, p.ex. recherche d'homologie, identification de motifs ou de polymorphismes de nucléotides simples [SNP] ou alignement de séquences
The invention includes methods and systems for identifying diseased-induced mutations by producing multi-dimensional reference sequence constructs that account for variations between individuals, different diseases, and different stages of those diseases. Once constructed, these reference sequence constructs can be used to align sequence reads corresponding to genetic samples from patients suspected of having a disease, or who have had the disease and are in suspected remission. The reference sequence constructs also provide insight to the genetic progression of the disease.
G06F 19/00 - Équipement ou méthodes de traitement de données ou de calcul numérique, spécialement adaptés à des applications spécifiques (spécialement adaptés à des fonctions spécifiques G06F 17/00;systèmes ou méthodes de traitement de données spécialement adaptés à des fins administratives, commerciales, financières, de gestion, de surveillance ou de prévision G06Q;informatique médicale G16H)
C12Q 1/6886 - Produits d’acides nucléiques utilisés dans l’analyse d’acides nucléiques, p.ex. amorces ou sondes pour les maladies provoquées par des altérations du matériel génétique pour le cancer
C12Q 1/6883 - Produits d’acides nucléiques utilisés dans l’analyse d’acides nucléiques, p.ex. amorces ou sondes pour les maladies provoquées par des altérations du matériel génétique
88.
SYSTEMS AND METHODS FOR USE OF KNOWN ALLELES IN READ MAPPING
The invention generally relates to genomic studies and specifically to improved methods for read mapping using identified nucleotides at known locations. The invention provides methods of using identified nucleotides at known places in a genome to guide the analysis of sequence reads from that genome by excluding potential mappings or assemblies that are not congruent with the identified nucleotides. Information about a plurality of SNPs in the subject's genome is used to identify candidate paths through a genomic directed acyclic graph (DAG). Sequence reads are mapped to the candidate paths.
C12Q 1/68 - Procédés de mesure ou de test faisant intervenir des enzymes, des acides nucléiques ou des micro-organismes; Compositions à cet effet; Procédés pour préparer ces compositions faisant intervenir des acides nucléiques
89.
SYSTEMS AND METHODS FOR USE OF KNOWN ALLELES IN READ MAPPING
The invention generally relates to genomic studies and specifically to improved methods for read mapping using identified nucleotides at known locations. The invention provides methods of using identified nucleotides at known places in a genome to guide the analysis of sequence reads from that genome by excluding potential mappings or assemblies that are not congruent with the identified nucleotides. Information about a plurality of SNPs in the subject's genome is used to identify candidate paths through a genomic directed acyclic graph (DAG). Sequence reads are mapped to the candidate paths.
G16B 20/20 - Détection d’allèles ou de variantes, p. ex. détection de polymorphisme d’un seul nucléotide
G16B 20/00 - TIC spécialement adaptées à la génomique ou protéomique fonctionnelle, p. ex. corrélations génotype-phénotype
G16B 30/00 - TIC spécialement adaptées à l’analyse de séquences impliquant des nucléotides ou des aminoacides
G16B 30/10 - Alignement de séquence; Recherche d’homologie
C12Q 1/68 - Procédés de mesure ou de test faisant intervenir des enzymes, des acides nucléiques ou des micro-organismes; Compositions à cet effet; Procédés pour préparer ces compositions faisant intervenir des acides nucléiques
90.
Systems and methods for use of known alleles in read mapping
The invention generally relates to genomic studies and specifically to improved methods for read mapping using identified nucleotides at known locations. The invention provides methods of using identified nucleotides at known places in a genome to guide the analysis of sequence reads from that genome by excluding potential mappings or assemblies that are not congruent with the identified nucleotides. Information about a plurality of SNPs in the subject's genome is used to identify candidate paths through a genomic directed acyclic graph (DAG). Sequence reads are mapped to the candidate paths.
The invention includes methods for aligning reads (e.g., nucleic acid reads, amino acid reads) to a reference sequence construct, methods for building the reference sequence construct, and systems that use the alignment methods and constructs to produce sequences. The invention also includes methods and systems for evaluating the quality of the alignment between the reads and the reference sequence construct. The method is scalable, and can be used to align millions of reads to a construct thousands of bases or amino acids long. The invention additionally includes methods for identifying a disease or a genotype based upon alignment of nucleic acid reads to a location in the construct.
The invention includes methods for aligning reads (e.g., nucleic acid reads) comprising repeating sequences, methods for building reference sequence constructs comprising repeating sequences, and systems that can be used to align reads comprising repeating sequences. The method is scalable, and can be used to align millions of reads to a construct thousands of bases long. The methods and systems can additionally account for variability within a repeating sequence, or near to a repeating sequence, due to genetic mutation.
Methods of analyzing a transcriptome that involves obtaining at least one pair of paired- end reads from a transcriptome from an organism, finding an alignment with an optimal score between a first read of the pair and a node in a directed acyclic data structure (the data structure has nodes representing RNA sequences such as exons or transcripts and edges connecting pairs of nodes), identifying candidate paths that include the node connected to a downstream node by a path having a length substantially similar to an insert length of the pair of paired-end reads, and aligning the paired-end rends to the candidate paths to determine an optimal- scoring alignment.
The invention generally provides systems and methods for analysis of RNA-Seq reads in which an annotated reference is represented as a directed acyclic graph (DAG) or similar data structure. Features such as exons and introns from the reference provide nodes in the DAG and those features are linked as pairs in their canonical genomic order by edges. The DAG can scale to any size and can in fact be populated in the first instance by import from an extrinsic annotated reference.
G06F 19/10 - Bio-informatique, c. à d. procédés ou systèmes pour le traitement de données génétiques ou se rapportant aux protéines en biologie moléculaire informatique (procédés in silico de criblage de bibliothèques chimiques virtuelles C40B 30/02;procédés mathématiques ou in silicio de création de bibliothèques chimiques virtuelles C40B 50/02)
95.
SYSTEMS AND METHODS FOR USING PAIRED-END DATA IN DIRECTED ACYCLIC STRUCTURE
Methods of analyzing a transcriptome that involves obtaining at least one pair of paired- end reads from a transcriptome from an organism, finding an alignment with an optimal score between a first read of the pair and a node in a directed acyclic data structure (the data structure has nodes representing RNA sequences such as exons or transcripts and edges connecting pairs of nodes), identifying candidate paths that include the node connected to a downstream node by a path having a length substantially similar to an insert length of the pair of paired-end reads, and aligning the paired-end rends to the candidate paths to determine an optimal- scoring alignment.
G06F 19/22 - pour la comparaison de séquences impliquant des nucléotides ou des acides aminés, p.ex. recherche d'homologie, identification de motifs ou de polymorphismes de nucléotides simples [SNP] ou alignement de séquences
96.
METHODS AND SYSTEMS FOR GENOTYPING GENETIC SAMPLES
The invention provides methods and system for making specific base calls at specific loci using a reference sequence construct, e.g., a directed acyclic graph (DAG) that represents known variants at each locus of the genome. Because the sequence reads are aligned to the DAG during alignment, the subsequent step of comparing a mutation, vis-a-vis the reference genome, to a table of known mutations can be eliminated. The disclosed methods and systems are notably efficient in dealing with structural variations within a genome or mutations that are within a structural variation.
The invention includes methods and systems for identifying diseased-induced mutations by producing multi-dimensional reference sequence constructs that account for variations between individuals, different diseases, and different stages of those diseases. Once constructed, these reference sequence constructs can be used to align sequence reads corresponding to genetic samples from patients suspected of having a disease, or who have had the disease and are in suspected remission. The reference sequence constructs also provide insight to the genetic progression of the disease.
C12Q 1/6809 - Méthodes de détermination ou d’identification des acides nucléiques faisant intervenir la détection différentielle
G16B 20/00 - TIC spécialement adaptées à la génomique ou protéomique fonctionnelle, p. ex. corrélations génotype-phénotype
G16B 30/00 - TIC spécialement adaptées à l’analyse de séquences impliquant des nucléotides ou des aminoacides
C12Q 1/68 - Procédés de mesure ou de test faisant intervenir des enzymes, des acides nucléiques ou des micro-organismes; Compositions à cet effet; Procédés pour préparer ces compositions faisant intervenir des acides nucléiques
98.
METHODS AND SYSTEMS FOR GENOTYPING GENETIC SAMPLES
The invention provides methods and system for making specific base calls at specific loci using a reference sequence construct, e.g., a directed acyclic graph (DAG) that represents known variants at each locus of the genome. Because the sequence reads are aligned to the DAG during alignment, the subsequent step of comparing a mutation, vis-a-vis the reference genome, to a table of known mutations can be eliminated. The disclosed methods and systems are notably efficient in dealing with structural variations within a genome or mutations that are within a structural variation.
The invention includes methods for aligning reads (e.g., nucleic acid reads, amino acid reads) to a reference sequence construct, methods for building the reference sequence construct, and systems that use the alignment methods and constructs to produce sequences. The invention also includes methods and systems for evaluating the quality of the alignment between the reads and the reference sequence construct. The method is scalable, and can be used to align millions of reads to a construct thousands of bases or amino acids long. The invention additionally includes methods for identifying a disease or a genotype based upon alignment of nucleic acid reads to a location in the construct.
C12Q 1/68 - Procédés de mesure ou de test faisant intervenir des enzymes, des acides nucléiques ou des micro-organismes; Compositions à cet effet; Procédés pour préparer ces compositions faisant intervenir des acides nucléiques
G06F 19/10 - Bio-informatique, c. à d. procédés ou systèmes pour le traitement de données génétiques ou se rapportant aux protéines en biologie moléculaire informatique (procédés in silico de criblage de bibliothèques chimiques virtuelles C40B 30/02;procédés mathématiques ou in silicio de création de bibliothèques chimiques virtuelles C40B 50/02)
100.
METHODS AND SYSTEMS FOR IDENTIFYING DISEASE-INDUCED MUTATIONS
The invention includes methods and systems for identifying diseased-induced mutations by producing multi-dimensional reference sequence constructs that account for variations between individuals, different diseases, and different stages of those diseases. Once constructed, these reference sequence constructs can be used to align sequence reads corresponding to genetic samples from patients suspected of having a disease, or who have had the disease and are in suspected remission. The reference sequence constructs also provide insight to the genetic progression of the disease.
C12Q 1/68 - Procédés de mesure ou de test faisant intervenir des enzymes, des acides nucléiques ou des micro-organismes; Compositions à cet effet; Procédés pour préparer ces compositions faisant intervenir des acides nucléiques