The invention provides methods for comparing one set of genetic sequences to another without discarding any information within either set. A set of genetic sequences is represented using a directed acyclic graph (DAG) avoiding any unwarranted reduction to a linear data structure. The invention provides a way to align one sequence DAG to another to produce an alignment that can itself be stored as a DAG. DAG-to-DAG alignment is a natural choice wherever a set of genomic information consisting of more than one string needs to be compared to any non-linear reference. For example, a subpoptilation DAG could be compared to a population DAG in order to compare the genetic features of that subpopulation to those of the population.
C12Q 1/6883 - Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for diseases caused by alterations of genetic material
G16B 30/00 - ICT specially adapted for sequence analysis involving nucleotides or amino acids
Systems and methods for analyzing genomic information can include obtaining a sequence read including genetic information; identifying, within a graph representing a reference genome, a plurality of candidate mapping positions that relate to the genetic information, the graph comprising nodes representing genetic sequences and edges connecting pairs of nodes; determining, by means of a computer system, whether an alignment with the graph surrounding each of the plurality of candidate mapping positions is advanced or basic; and performing for each candidate mapping position, by means of the computer system, a local alignment based on whether the local alignment is advanced or basic. The advanced local alignment can include a first-local-alignment algorithm, and the basic local alignment includes a second-local-alignment algorithm. Based on the local alignments, the mapped position of the sequence read can be identified within the genome.
The invention provides methods for identifying rare variants near a structural variation in a genetic sequence, for example, in a nucleic acid sample taken from a subject. The invention additionally includes methods for aligning reads (e.g., nucleic acid reads) to a reference sequence construct accounting for the structural variation, methods for building a reference sequence construct accounting for the structural variation or the structural variation and the rare variant, and systems that use the alignment methods to identify rare variants. The method is scalable, and can be used to align millions of reads to a construct thousands of bases long, or longer.
The invention provides systems and methods for determining patterns of modification to a genome of a subject by representing the genome using a graph, such as a directed acyclic graph (DAG) with divergent paths for regions that are potentially subject to modification, profiling segments of the genome for evidence of epigenetic modification, and aligning the profiled segments to the DAG to determine locations and patterns of the epigenetic modification within the genome.
The invention provides systems and methods for analyzing viruses by representing viral genetic diversity with a directed acyclic graph (DAG), which allows genetic sequencing technology to detect rare variations and represent otherwise difficult-to-document diversity within a sample. Additionally, a host-specific sequence DAG can be used to effectively segregate viral nucleic acid sequence reads from host sequence reads when a sample from a host is subject to sequencing. Known viral genomes can be represented using a viral reference DAG and the viral sequence reads from the sample can be compared to viral DAG to identify viral species or strains from which the reads were derived. Where the viral sequence reads indicate great genetic diversity in the virus that was infecting the host, those reads can be assembled into a DAG that itself properly represents that diversity.
G01N 33/48 - Biological material, e.g. blood, urineHaemocytometers
C12Q 1/6809 - Methods for determination or identification of nucleic acids involving differential detection
C12Q 1/70 - Measuring or testing processes involving enzymes, nucleic acids or microorganismsCompositions thereforProcesses of preparing such compositions involving virus or bacteriophage
G16B 30/00 - ICT specially adapted for sequence analysis involving nucleotides or amino acids
The invention provides methods of analyzing an individual's mtDNA by transforming available reference sequences into a directed graph that compactly represents all the information without duplication and comparing sequence reads from the mtDNA to the graph to identify the individual or describe their mtDNA. A directed graph can represent all of the genetic variation found among the mitochondrial genomes across all of a number of reference organisms while providing a single article to which sequence reads can be aligned or compared. Thus any sequence read or other sequence fragment can be compared, in a single operation, to the article that represents all of the reference mitochondrial sequences.
The invention provides oncogenomic methods for detecting tumors by identifying circulating tumor DNA. A patient-specific reference directed acyclic graph (DAG) represents known human genomic sequences and non-tumor DNA from the patient as well as known tumor-associated mutations. Sequence reads from cell-free plasma DNA from the patient are mapped to the patient-specific genomic reference graph. Any of the known tumor-associated mutations found in the reads and any de novo mutations found in the reads are reported as the patient’s tumor mutation burden.
C12Q 1/6886 - Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for diseases caused by alterations of genetic material for cancer
G16B 5/00 - ICT specially adapted for modelling or simulations in systems biology, e.g. gene-regulatory networks, protein interaction networks or metabolic networks
G16B 30/00 - ICT specially adapted for sequence analysis involving nucleotides or amino acids
The invention provides methods for identifying rare variants near a structural variation in a genetic sequence, for example, in a nucleic acid sample taken from a subject. The invention additionally includes methods for aligning reads (e.g., nucleic acid reads) to a reference sequence construct accounting for the structural variation, methods for building a reference sequence construct accounting for the structural variation or the structural variation and the rare variant, and systems that use the alignment methods to identify rare variants. The method is scalable, and can be used to align millions of reads to a construct thousands of bases long, or longer.
The invention includes methods and systems for identifying diseased-induced mutations by producing multi-dimensional reference sequence constructs that account for variations between individuals, different diseases, and different stages of those diseases. Once constructed, these reference sequence constructs can be used to align sequence reads corresponding to genetic samples from patients suspected of having a disease, or who have had the disease and are in suspected remission. The reference sequence constructs also provide insight to the genetic progression of the disease.
C12Q 1/6886 - Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for diseases caused by alterations of genetic material for cancer
C12Q 1/6883 - Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for diseases caused by alterations of genetic material
G16B 20/20 - Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
G16B 20/00 - ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
Techniques for generating a graph reference construct. The techniques include: obtaining a plurality of variants associated with a reference sequence construct; generating the graph reference construct using the plurality of variants and the reference sequence construct; and outputting the generated graph reference construct. Generating the graph reference construct includes: filtering the plurality of variants to obtain a filtered set of variants, the filtering including a first filtering stage and a second filtering stage, and generating the graph reference construct using the filtered set of variants. The first filtering stage includes identifying a first subset of variants at least in part by excluding one or more structural variants from the plurality of variants. The second filtering stage includes identifying the filtered set of variants at least in part by excluding one or more multiply-alignable variants from the first subset of variants.
Techniques for generating a graph reference construct. The techniques include: obtaining a plurality of variants associated with a reference sequence construct; generating the graph reference construct using the plurality of variants and the reference sequence construct; and outputting the generated graph reference construct. Generating the graph reference construct includes: filtering the plurality of variants to obtain a filtered set of variants, the filtering including a first filtering stage and a second filtering stage, and generating the graph reference construct using the filtered set of variants. The first filtering stage includes identifying a first subset of variants at least in part by excluding one or more structural variants from the plurality of variants. The second filtering stage includes identifying the filtered set of variants at least in part by excluding one or more multiply-alignable variants from the first subset of variants.
Techniques for generating a graph reference construct. The techniques include: obtaining a plurality of variants associated with a reference sequence construct; generating the graph reference construct using the plurality of variants and the reference sequence construct; and outputting the generated graph reference construct. Generating the graph reference construct includes: filtering the plurality of variants to obtain a filtered set of variants, the filtering including a first filtering stage and a second filtering stage, and generating the graph reference construct using the filtered set of variants. The first filtering stage includes identifying a first subset of variants at least in part by excluding one or more structural variants from the plurality of variants. The second filtering stage includes identifying the filtered set of variants at least in part by excluding one or more multiply-alignable variants from the first subset of variants.
Methods of the invention include representing biological data in a memory subsystem within a computer system with a data structure that is particular to a location in the memory subsystem and serializing the data structure into a stream of bytes that can be deserialized into a clone of the data structure. In a preferred genomic embodiment, the biological data comprises genomic sequences and the data structure comprises a genomic directed acyclic graph (DAG) in which objects have adjacency lists of pointers that indicate the location of any object adjacent to that object. After serialization and deserialization, the clone genomic DAG has the same structure as the original to represent the same sequences and relationships among them as the original.
A method for screening for disease in a genomic sample is includes receiving a representation of a reference genome comprising a sequence of symbols. The presence of a predicted mutational event is identified in a location of the reference genome. An alternate path is created in the reference genome representing the predicted mutational event. A plurality of sequence reads are obtained from a genomic sample, wherein at least one sequence read comprises at least a portion of the predicted mutational event. The at least one sequence read is then mapped to the reference genome and a location is determined corresponding to the predicted mutational event. The predicted mutational event is then identified as present in the genomic sample. The method may be used to detect evidence of non-allelic homologous recombination (NAHR) occurring in genomic samples.
G16B 20/20 - Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
G16H 50/50 - ICT specially adapted for medical diagnosis, medical simulation or medical data miningICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for simulation or modelling of medical disorders
G16B 20/00 - ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
G16B 30/00 - ICT specially adapted for sequence analysis involving nucleotides or amino acids
Computer-implemented methods and systems for performing a local assembly of a genomic region of interest include the de novo or assisted creation of a directed graph, such as a directed acyclic graph (DAG), from a plurality of obtained nucleotide sequence reads. First and second sequence reads are aligned to each other to define at least one node of the DAG. Successive alignments of the remaining sequence reads to the then-defined DAG are performed to extend nodes and/or add nodes to the DAG. Graph-aware alignment techniques that produce alignment scores or indicators are employed in defining the nodes of the DAG from the sequence reads. The created DAG represents and describes in detail the genomic region of interest and can be used to perform variant calls.
The invention includes methods for aligning reads (e.g., nucleic acid reads) comprising repeating sequences, methods for building reference sequence constructs comprising repeating sequences, and systems that can be used to align reads comprising repeating sequences. The method is scalable, and can be used to align millions of reads to a construct thousands of bases long. The methods and systems can additionally account for variability within a repeating sequence, or near to a repeating sequence, due to genetic mutation.
The invention includes methods for aligning reads (e.g., nucleic acid reads, amino acid reads) to a reference sequence construct, methods for building the reference sequence construct, and systems that use the alignment methods and constructs to produce sequences. The invention also includes methods and systems for evaluating the quality of the alignment between the reads and the reference sequence construct. The method is scalable, and can be used to align millions of reads to a construct thousands of bases or amino acids long. The invention additionally includes methods for identifying a disease or a genotype based upon alignment of nucleic acid reads to a location in the construct.
The invention provides methods for comparing one set of genetic sequences to another without discarding any information within either set. A set of genetic sequences is represented using a directed acyclic graph (DAG) avoiding any unwarranted reduction to a linear data structure. The invention provides a way to align one sequence DAG to another to produce an alignment that can itself be stored as a DAG. DAG-to-DAG alignment is a natural choice wherever a set of genomic information consisting of more than one string needs to be compared to any non-linear reference. For example, a subpopulation DAG could be compared to a population DAG in order to compare the genetic features of that subpopulation to those of the population.
C12Q 1/6883 - Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for diseases caused by alterations of genetic material
The invention generally relates to genomic studies and specifically to improved methods for read mapping using identified nucleotides at known locations. The invention provides methods of using identified nucleotides at known places in a genome to guide the analysis of sequence reads from that genome by excluding potential mappings or assemblies that are not congruent with the identified nucleotides. Information about a plurality of SNPs in the subject's genome is used to identify candidate paths through a genomic directed acyclic graph (DAG). Sequence reads are mapped to the candidate paths.
A method for stream-processing biomedical data includes receiving, by a file system on a computing device, a first request for access to at least a first portion of a file stored on a remotely located storage device. The method includes receiving, by the file system, a second request for access to at least a second portion of the file. The method includes determining, by a pre-fetching component executing on the computing device, whether the first request and the second request are associated with a sequential read operation. The method includes automatically retrieving, by the pre-fetching component, a third portion of the requested file, before receiving a third request for access to least the third portion of the file, based on a determination that the first request and the second request are associated with the sequential read operation.
G06F 12/0862 - Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches with prefetch
G16B 50/00 - ICT programming tools or database systems specially adapted for bioinformatics
G16B 50/30 - Data warehousingComputing architectures
H04L 67/1097 - Protocols in which an application is distributed across nodes in the network for distributed storage of data in networks, e.g. transport arrangements for network file system [NFS], storage area networks [SAN] or network attached storage [NAS]
Genomic data is written to disk in a compact format by dividing the data into segments and encoding each segment with the smallest number of bits per character necessary for whatever alphabet of characters appears in that segment. A computer system dynamically chooses the segment boundaries for maximum space savings. A first one of the segments may use a different number of bits per character than a second one of the segments. In one embodiment, dividing the data into segments comprises scanning the data and keeping track of a number of unique characters, noting positions in the sequence where the number increases to a power of two, calculating a compression that would be obtained by dividing the genomic data into one of the plurality of segments at ones of the noted positions, and dividing the genomic data into the plurality of segments at the positions that yield the best compression.
The invention provides systems and methods for determining patterns of modification to a genome of a subject by representing the genome using a graph, such as a directed acyclic graph (DAG) with divergent paths for regions that are potentially subject to modification, profiling segments of the genome for evidence of epigenetic modification, and aligning the profiled segments to the DAG to determine locations and patterns of the epigenetic modification within the genome.
The invention provides systems and methods for analyzing viruses by representing viral genetic diversity with a directed acyclic graph (DAG), which allows genetic sequencing technology to detect rare variations and represent otherwise difficult-to-document diversity within a sample. Additionally, a host-specific sequence DAG can be used to effectively segregate viral nucleic acid sequence reads from host sequence reads when a sample from a host is subject to sequencing. Known viral genomes can be represented using a viral reference DAG and the viral sequence reads from the sample can be compared to viral DAG to identify viral species or strains from which the reads were derived. Where the viral sequence reads indicate great genetic diversity in the virus that was infecting the host, those reads can be assembled into a DAG that itself properly represents that diversity.
G01N 33/48 - Biological material, e.g. blood, urineHaemocytometers
C12Q 1/70 - Measuring or testing processes involving enzymes, nucleic acids or microorganismsCompositions thereforProcesses of preparing such compositions involving virus or bacteriophage
G16B 30/00 - ICT specially adapted for sequence analysis involving nucleotides or amino acids
C12Q 1/6809 - Methods for determination or identification of nucleic acids involving differential detection
25.
System and method for dynamic control of workflow execution
Some embodiments relate to systems for processing one or more computational workflows. In one embodiment, a description of a computational comprises a plurality of applications, in which applications are represented as nodes and edges connect the nodes indicate the flow of data elements between applications. A task execution module is configured to create and execute tasks. An application programming interface (API) is in communication with the task execution module and comprises a plurality of function calls for controlling at least one function of the task execution module. An API script includes instructions to the API to create and execute a plurality of tasks corresponding to the execution of the computational workflow for a plurality of samples. A graphical user interface (GUI) is in communication with the task execution module and configured to receive input from an end user to initiate execution of the API script.
In one embodiment, a method of processing a computational workflow comprises receiving a description of a computational workflow. The description comprises a plurality of steps, in which each step has at least one input and at least one output, and further wherein an input from a second step depends on an output from a first step. The description is translated into a static workflow graph stored in a memory, the static workflow graph comprising a plurality of nodes having input ports and output ports, wherein dependencies between inputs and outputs are specified as edges between input ports and output ports. Information about a first set of nodes is then extracted from the static workflow graph and placed into a dynamic graph. A first actionable job is identified from the dynamic graph and executed.
The invention provides methods of analyzing an individual's mtDNA by transforming available reference sequences into a directed graph that compactly represents all the information without duplication and comparing sequence reads from the mtDNA to the graph to identify the individual or describe their mtDNA. A directed graph can represent all of the genetic variation found among the mitochondrial genomes across all of a number of reference organisms while providing a single article to which sequence reads can be aligned or compared. Thus any sequence read or other sequence fragment can be compared, in a single operation, to the article that represents all of the reference mitochondrial sequences.
Embodiments of the invention utilize a graph-based approach for simulating genomic datasets from large scale populations. Genomic data may be represented as a directed acyclic graph (DAG) that incorporates individual sample data including variant type, position, and zygosity. A simulator may operate on the DAG to generate variant datasets based on probabilistic traversal of the DAG. This probabilistic traversal reflects genomic variant types associated with the subpopulation used to build the DAG, and as a result, the generated variant datasets maintain statistical fidelity to the original sample data.
G01N 33/48 - Biological material, e.g. blood, urineHaemocytometers
G16B 5/00 - ICT specially adapted for modelling or simulations in systems biology, e.g. gene-regulatory networks, protein interaction networks or metabolic networks
G16B 20/00 - ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
G16B 20/20 - Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
G16B 45/00 - ICT specially adapted for bioinformatics-related data visualisation, e.g. displaying of maps or networks
G16B 50/00 - ICT programming tools or database systems specially adapted for bioinformatics
G16B 50/30 - Data warehousingComputing architectures
29.
Hashing data-processing steps in workflow environments
Various approaches for data storage and retrieval for a computer memory include processing a computational workflow having multiple data-processing steps, generating and storing a first hash value associated with a first step of the data-processing steps based on an input to the first step, generating and storing a second hash value associated with a second step of the data-processing steps based on the generated first hash value, and reconstructing a computational state of the workflow based on the second hash value, and thereby avoid re-execution of a portion of the workflow corresponding to the second hash value.
G06F 9/48 - Program initiatingProgram switching, e.g. by interrupt
G06Q 10/06 - Resources, workflows, human or project managementEnterprise or organisation planningEnterprise or organisation modelling
H04L 9/06 - Arrangements for secret or secure communicationsNetwork security protocols the encryption apparatus using shift registers or memories for blockwise coding, e.g. D.E.S. systems
The invention provides oncogenomic methods for detecting tumors by identifying circulating tumor DNA. A patient-specific reference directed acyclic graph (DAG) represents known human genomic sequences and non-tumor DNA from the patient as well as known tumor-associated mutations. Sequence reads from cell-free plasma DNA from the patient are mapped to the patient-specific genomic reference graph. Any of the known tumor-associated mutations found in the reads and any de novo mutations found in the reads are reported as the patient's tumor mutation burden.
G01N 33/48 - Biological material, e.g. blood, urineHaemocytometers
G01N 33/50 - Chemical analysis of biological material, e.g. blood, urineTesting involving biospecific ligand binding methodsImmunological testing
C12Q 1/6886 - Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for diseases caused by alterations of genetic material for cancer
G16B 5/00 - ICT specially adapted for modelling or simulations in systems biology, e.g. gene-regulatory networks, protein interaction networks or metabolic networks
G16B 30/00 - ICT specially adapted for sequence analysis involving nucleotides or amino acids
The invention provides methods for identifying rare variants near a structural variation in a genetic sequence, for example, in a nucleic acid sample taken from a subject. The invention additionally includes methods for aligning reads (e.g., nucleic acid reads) to a reference sequence construct accounting for the structural variation, methods for building a reference sequence construct accounting for the structural variation or the structural variation and the rare variant, and systems that use the alignment methods to identify rare variants. The method is scalable, and can be used to align millions of reads to a construct thousands of bases long, or longer.
Various embodiments of the disclosure relate to systems and methods for aligning a sequence read to a graph reference. In one embodiment, the method comprises selecting a first node from a graph reference, the graph reference comprising a plurality of nodes connected by a plurality of directed edges, at least one node of the plurality of nodes having a nucleotide sequence. The method further comprises traversing the graph reference according to a depth-first search, and comparing a sequence read to nucleotide sequences generated from the traversal of the graph reference. The traversal of the graph is then modified in response to a determination that each and every node associated with a given nucleotide sequence was previously evaluated.
G06K 9/00 - Methods or arrangements for reading or recognising printed or written characters or for recognising patterns, e.g. fingerprints
G06K 9/68 - Methods or arrangements for recognition using electronic means using sequential comparisons of the image signals with a plurality of reference, e.g. addressable memory
33.
Systems and methods for adaptive local alignment for graph genomes
Systems and methods for analyzing genomic information can include obtaining a sequence read including genetic information; identifying, within a graph representing a reference genome, a plurality of candidate mapping positions that relate to the genetic information, the graph comprising nodes representing genetic sequences and edges connecting pairs of nodes; determining, by means of a computer system, whether an alignment with the graph surrounding each of the plurality of candidate mapping positions is advanced or basic; and performing for each candidate mapping position, by means of the computer system, a local alignment based on whether the local alignment is advanced or basic. The advanced local alignment can include a first-local-alignment algorithm, and the basic local alignment includes a second-local-alignment algorithm. Based on the local alignments, the mapped position of the sequence read can be identified within the genome.
The invention provides methods and system for making specific base calls at specific loci using a reference sequence construct, e.g., a directed acyclic graph (DAG) that represents known variants at each locus of the genome. Because the sequence reads are aligned to the DAG during alignment, the subsequent step of comparing a mutation, vis-a-vis the reference genome, to a table of known mutations can be eliminated. The disclosed methods and systems are notably efficient in dealing with structural variations within a genome or mutations that are within a structural variation.
In one embodiment, a method for identifying candidate sequences for genotyping a genomic sample comprises obtaining a plurality of sequence reads mapping to a genomic region of interest. The plurality of sequence reads are assembled into a directed acyclic graph (DAG) comprising a plurality of branch sites representing variation present in the set of sequence reads, each branch site comprising two or more branches. A path through the DAG comprises a set of successive branches over two or more branch sites and represents a possible candidate sequence of the genomic sample. One or more paths through the DAG are ranked by calculating scores for one or more branch sites, wherein the calculated score comprises a number of sequence reads that span multiple branch sites in a given path. At least one path is selected as a candidate sequence based at least in part on its rank.
The invention includes methods and systems for identifying diseased-induced mutations by producing multi-dimensional reference sequence constructs that account for variations between individuals, different diseases, and different stages of those diseases. Once constructed, these reference sequence constructs can be used to align sequence reads corresponding to genetic samples from patients suspected of having a disease, or who have had the disease and are in suspected remission. The reference sequence constructs also provide insight to the genetic progression of the disease.
C12Q 1/6886 - Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for diseases caused by alterations of genetic material for cancer
C12Q 1/6883 - Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for diseases caused by alterations of genetic material
37.
System and method for dynamic control of workflow execution
Some embodiments relate to systems for processing one or more computational workflows. In one embodiment, a description of a computational comprises a plurality of applications, in which applications are represented as nodes and edges connect the nodes indicate the flow of data elements between applications. A task execution module is configured to create and execute tasks. An application programming interface (API) is in communication with the task execution module and comprises a plurality of function calls for controlling at least one function of the task execution module. An API script includes instructions to the API to create and execute a plurality of tasks corresponding to the execution of the computational workflow for a plurality of samples. A graphical user interface (GUI) is in communication with the task execution module and configured to receive input from an end user to initiate execution of the API script.
The invention includes methods for aligning reads (e.g., nucleic acid reads, amino acid reads) to a reference sequence construct, methods for building the reference sequence construct, and systems that use the alignment methods and constructs to produce sequences. The method is scalable, and can be used to align millions of reads to a construct thousands of bases or amino acids long. The invention additionally includes methods for identifying a disease or a genotype based upon alignment of nucleic acid reads to a location in the construct.
In one aspect, a method for scheduling jobs in a computational workflow includes identifying, from a computational workflow by a workflow execution engine executing on a processor, a plurality of jobs ready for execution. The method includes sorting, based on computational resource requirements associated with each identified job, the identified jobs into a prioritized queue. The method includes provisioning one or more computational instances based on the computational resource requirements of the identified jobs in the prioritized queue, wherein at least one computational instance is provisioned based on a highest priority job in the queue. The method includes submitting the prioritized jobs for execution to the one or more computational instances.
The invention provides methods for identifying rare variants near a structural variation in a genetic sequence, for example, in a nucleic acid sample taken from a subject. The invention additionally includes methods for aligning reads (e.g., nucleic acid reads) to a reference sequence construct accounting for the structural variation, methods for building a reference sequence construct accounting for the structural variation or the structural variation and the rare variant, and systems that use the alignment methods to identify rare variants. The method is scalable, and can be used to align millions of reads to a construct thousands of bases long, or longer.
G06F 19/22 - for sequence comparison involving nucleotides or amino acids, e.g. homology search, motif or Single-Nucleotide Polymorphism [SNP] discovery or sequence alignment
G16B 30/00 - ICT specially adapted for sequence analysis involving nucleotides or amino acids
G16B 50/00 - ICT programming tools or database systems specially adapted for bioinformatics
41.
Watermarking for data security in bioinformatic sequence analysis
Embodiments of the invention protect information stored in graph-based sequence references by “watermarking” the graph with uniquely identifiable information. The watermark identifies the graph or version thereof in a detectable but nonintrusive manner. In one embodiment, insertions and/or deletions are introduced into regions of the graph.
G16B 5/00 - ICT specially adapted for modelling or simulations in systems biology, e.g. gene-regulatory networks, protein interaction networks or metabolic networks
G16B 30/00 - ICT specially adapted for sequence analysis involving nucleotides or amino acids
G16B 50/00 - ICT programming tools or database systems specially adapted for bioinformatics
Systems and methods for protecting information stored in private references that are available to be queried—e.g., graph-based sequence references that users query through an interface, providing short reads to obtain the results of an alignment against the reference sequence—analyze the query and/or alignment results to determine whether the query represents an attack. The analysis may be performed before returning results to a user, and in some cases before performing the alignment.
A method of aligning a data sequence to one or more reference sequences represented as a sequence variation graph (SVG) is disclosed. The method can comprise receiving one or more alignment candidate regions and corresponding ordered seeding information. For each of the received alignment candidate regions, a current seed is determined, the current seed being a next-in-order unprocessed seed based on the ordered seeding information. Data paths in the alignment candidate region are then traversed to identify potential next seeds relative to the current seed. If at least one potential next seed is found, a next seed is selected and alignment results are generated by applying a local alignment procedure to align query data in portions of the query data sequence between the current seed and the next seed with reference data in portions of the alignment candidate region located between the current seed and the next seed.
G06F 19/22 - for sequence comparison involving nucleotides or amino acids, e.g. homology search, motif or Single-Nucleotide Polymorphism [SNP] discovery or sequence alignment
G06F 19/16 - for molecular structure, e.g. structure alignment, structural or functional relations, protein folding, domain topologies, drug targeting using structure data, involving two-dimensional or three-dimensional structures
The invention provides methods for comparing one set of genetic sequences to another without discarding any information within either set. A set of genetic sequences is represented using a directed acyclic graph (DAG) avoiding any unwarranted reduction to a linear data structure. The invention provides a way to align one sequence DAG to another to produce an alignment that can itself be stored as a DAG. DAG-to-DAG alignment is a natural choice wherever a set of genomic information consisting of more than one string needs to be compared to any non-linear reference. For example, a subpopulation DAG could be compared to a population DAG in order to compare the genetic features of that subpopulation to those of the population.
C12Q 1/6883 - Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for diseases caused by alterations of genetic material
09 - Scientific and electric apparatus and instruments
Goods & Services
Computer software for use in software development, namely, the development, testing, and execution of computational workflows; and computer software for use in storage, analysis, and manipulation of documents, content, media, and data, namely, genetic, genomic, biological, biochemical, biomedical, clinical, scientific, engineering, business, and operations data
Embodiments of the invention utilize a graph-based approach for simulating genomic datasets from large scale populations. Genomic data may be represented as a directed acyclic graph (DAG) that incorporates individual sample data including variant type, position, and zygosity. A simulator may operate on the DAG to generate variant datasets based on probabilistic traversal of the DAG. This probabilistic traversal reflects genomic variant types associated with the subpopulation used to build the DAG, and as a result, the generated variant datasets maintain statistical fidelity to the original sample data.
G16B 45/00 - ICT specially adapted for bioinformatics-related data visualisation, e.g. displaying of maps or networks
G16B 20/00 - ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
G16B 50/00 - ICT programming tools or database systems specially adapted for bioinformatics
G16B 5/00 - ICT specially adapted for modelling or simulations in systems biology, e.g. gene-regulatory networks, protein interaction networks or metabolic networks
47.
SYSTEMS AND METHODS FOR ALIGNING SEQUENCES TO PERSONALIZED REFERENCES
Techniques for generating a personalized reference sequence construct for an individual to align sequence reads obtained for the individual. The techniques include: obtaining a plurality of sequence reads for an individual; obtaining information identifying a plurality of locations; genotyping the plurality of sequence reads for the plurality of locations to obtain a first set of variants for the individual for at least some of the plurality of locations; identifying a second set of variants associated with the first set of variants; generating a personalized reference sequence construct using the second set of variants; and aligning the plurality of sequence reads to the personalized reference sequence construct.
G06F 19/22 - for sequence comparison involving nucleotides or amino acids, e.g. homology search, motif or Single-Nucleotide Polymorphism [SNP] discovery or sequence alignment
G06F 19/18 - for functional genomics or proteomics, e.g. genotype-phenotype associations, linkage disequilibrium, population genetics, binding site identification, mutagenesis, genotyping or genome annotation, protein-protein interactions or protein-nucleic acid interactions
G06F 19/24 - for machine learning, data mining or biostatistics, e.g. pattern finding, knowledge discovery, rule extraction, correlation, clustering or classification
48.
Systems and methods for aligning sequences to graph references
Various embodiments of the disclosure relate to systems and methods for aligning a sequence read to a graph reference. In one embodiment, the method comprises selecting a first node from a graph reference, the graph reference comprising a plurality of nodes connected by a plurality of directed edges, at least one node of the plurality of nodes having a nucleotide sequence. The method further comprises traversing the graph reference according to a depth-first search, and comparing a sequence read to nucleotide sequences generated from the traversal of the graph reference. The traversal of the graph is then modified in response to a determination that each and every node associated with a given nucleotide sequence was previously evaluated.
G16B 45/00 - ICT specially adapted for bioinformatics-related data visualisation, e.g. displaying of maps or networks
G06K 9/00 - Methods or arrangements for reading or recognising printed or written characters or for recognising patterns, e.g. fingerprints
G06K 9/68 - Methods or arrangements for recognition using electronic means using sequential comparisons of the image signals with a plurality of reference, e.g. addressable memory
49.
Systems and methods for sequence encoding, storage, and compression
Genomic data is written to disk in a compact format by dividing the data into segments and encoding each segment with the smallest number of bits per character necessary for whatever alphabet of characters appears in that segment. A computer system dynamically chooses the segment boundaries for maximum space savings. A first one of the segments may use a different number of bits per character than a second one of the segments. In one embodiment, dividing the data into segments comprises scanning the data and keeping track of a number of unique characters, noting positions in the sequence where the number increases to a power of two, calculating a compression that would be obtained by dividing the genomic data into one of the plurality of segments at ones of the noted positions, and dividing the genomic data into the plurality of segments at the positions that yield the best compression.
Systems and methods for data storage and retrieval for a computer memory include processing a computational workflow having multiple data-processing steps, generating and storing a first hash value associated with a first step of the data-processing steps based on an input to the first step, generating and storing a second hash value associated with a second step of the data-processing steps based on the generated first hash value, and reconstructing a computational state of the workflow based on the second hash value, and thereby avoid re-execution of a portion of the workflow corresponding to the second hash value.
G06F 9/48 - Program initiatingProgram switching, e.g. by interrupt
G06Q 10/06 - Resources, workflows, human or project managementEnterprise or organisation planningEnterprise or organisation modelling
H04L 9/06 - Arrangements for secret or secure communicationsNetwork security protocols the encryption apparatus using shift registers or memories for blockwise coding, e.g. D.E.S. systems
A method for screening for disease in a genomic sample is includes receiving a representation of a reference genome comprising a sequence of symbols. The presence of a predicted mutational event is identified in a location of the reference genome. An alternate path is created in the reference genome representing the predicted mutational event. A plurality of sequence reads are obtained from a genomic sample, wherein at least one sequence read comprises at least a portion of the predicted mutational event. The at least one sequence read is then mapped to the reference genome and a location is determined corresponding to the predicted mutational event. The predicted mutational event is then identified as present in the genomic sample. The method may be used to detect evidence of non-allelic homologous recombination (NAHR) occurring in genomic samples.
G01N 33/48 - Biological material, e.g. blood, urineHaemocytometers
G16B 20/20 - Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
G16H 50/50 - ICT specially adapted for medical diagnosis, medical simulation or medical data miningICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for simulation or modelling of medical disorders
G16B 20/00 - ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
G16B 30/00 - ICT specially adapted for sequence analysis involving nucleotides or amino acids
In one embodiment, a method of processing a computational workflow comprises receiving a description of a computational workflow. The description comprises a plurality of steps, in which each step has at least one input and at least one output, and further wherein an input from a second step depends on an output from a first step. The description is translated into a static workflow graph stored in a memory, the static workflow graph comprising a plurality of nodes having input ports and output ports, wherein dependencies between inputs and outputs are specified as edges between input ports and output ports. Information about a first set of nodes is then extracted from the static workflow graph and placed into a dynamic graph. A first actionable job is identified from the dynamic graph and executed.
Computer-implemented methods and systems for performing a local assembly of a genomic region of interest include the de novo or assisted creation of a directed graph, such as a directed acyclic graph (DAG), from a plurality of obtained nucleotide sequence reads. First and second sequence reads are aligned to each other to define at least one node of the DAG. Successive alignments of the remaining sequence reads to the then-defined DAG are performed to extend nodes and/or add nodes to the DAG. Graph-aware alignment techniques that produce alignment scores or indicators are employed in defining the nodes of the DAG from the sequence reads. The created DAG represents and describes in detail the genomic region of interest and can be used to perform variant calls.
Techniques for identifying variations in sequence data relative to reference sequence data. The techniques include accessing information specifying multiple sets of variants in the sequence data relative to reference sequence data, each of the multiple sets of variants being generated by using a respective variant identification technique; and determining, using the information specifying the multiple sets of variants in the sequence data, a reconciled set of variants in the sequence data relative to the reference sequence data, the determining comprising: determining whether a first variant is present at a first position in the sequence data based, at least in part, on one or more variants at one or more other positions in the sequence data.
A method for stream-processing biomedical data includes receiving, by a file system on a computing device, a first request for access to at least a first portion of a file stored on a remotely located storage device. The method includes receiving, by the file system, a second request for access to at least a second portion of the file. The method includes determining, by a pre-fetching component executing on the computing device, whether the first request and the second request are associated with a sequential read operation. The method includes automatically retrieving, by the pre-fetching component, a third portion of the requested file, before receiving a third request for access to least the third portion of the file, based on a determination that the first request and the second request are associated with the sequential read operation.
G06F 12/0862 - Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches with prefetch
G16B 50/00 - ICT programming tools or database systems specially adapted for bioinformatics
A method for stream-processing biomedical data includes receiving, by a file system on a computing device, a first request for access to at least a first portion of a file stored on a remotely located storage device. The method includes receiving, by the file system, a second request for access to at least a second portion of the file. The method includes determining, by a pre-fetching component executing on the computing device, whether the first request and the second request are associated with a sequential read operation. The method includes automatically retrieving, by the pre-fetching component, a third portion of the requested file, before receiving a third request for access to least the third portion of the file, based on a determination that the first request and the second request are associated with the sequential read operation.
Genomic references are structured as a reference graph that represents diploid genotypes in organisms. A path through a series of connected nodes and edges represents a genetic sequence. Genetic variation within a diploid organism is represented by multiple paths through the reference graph. The graph may be transformed into a traversal graph in which a path represents a diploid genotype. Genetic analysis using the traversal graph allows an organism's diploid genotype to be elucidated, e.g., by mapping sequence reads to the reference graph and scoring paths in the traversal graph based on the mapping to determine the path through the traversal graph that best fits the sequence reads.
G06F 19/18 - for functional genomics or proteomics, e.g. genotype-phenotype associations, linkage disequilibrium, population genetics, binding site identification, mutagenesis, genotyping or genome annotation, protein-protein interactions or protein-nucleic acid interactions
58.
Systems and methods for genotyping with graph reference
Genomic references are structured as a reference graph that represents diploid genotypes in organisms. A path through a series of connected nodes and edges represents a genetic sequence. Genetic variation within a diploid organism is represented by multiple paths through the reference graph. The graph may be transformed into a traversal graph in which a path represents a diploid genotype. Genetic analysis using the traversal graph allows an organism's diploid genotype to be elucidated, e.g., by mapping sequence reads to the reference graph and scoring paths in the traversal graph based on the mapping to determine the path through the traversal graph that best fits the sequence reads.
G01N 33/48 - Biological material, e.g. blood, urineHaemocytometers
G01N 33/50 - Chemical analysis of biological material, e.g. blood, urineTesting involving biospecific ligand binding methodsImmunological testing
G06F 19/18 - for functional genomics or proteomics, e.g. genotype-phenotype associations, linkage disequilibrium, population genetics, binding site identification, mutagenesis, genotyping or genome annotation, protein-protein interactions or protein-nucleic acid interactions
G06F 19/26 - for data visualisation, e.g. graphics generation, display of maps or networks or other visual representations
59.
SYSTEMS AND METHODS FOR ENCODING GENETIC VARIATION FOR A POPULATION
In one embodiment, a method of encoding variation data for a population comprises receiving, by a variant encoding engine executing on a processor, information describing genetic variation of a population of individuals. The information comprises a plurality of variable sites within the reference genome of the population and the genotypes of a plurality of individuals in the population with respect to those variable sites. The method further comprises selecting an encoding strategy for the information based on the characteristics of the genetic variation across the population, and encoding the information according to the selected encoding strategy. In certain embodiments, selecting an encoding strategy may comprise determining the variability of a variable site within the population, and encoding information associated with the variable site based on the variability.
G06F 19/18 - for functional genomics or proteomics, e.g. genotype-phenotype associations, linkage disequilibrium, population genetics, binding site identification, mutagenesis, genotyping or genome annotation, protein-protein interactions or protein-nucleic acid interactions
G06F 19/28 - for programming tools or database systems, e.g. ontologies, heterogeneous data integration, data warehousing or computing architectures
60.
Systems and methods for encoding genetic variation for a population
In one embodiment, a method of encoding variation data for a population comprises receiving, by a variant encoding engine executing on a processor, information describing genetic variation of a population of individuals. The information comprises a plurality of variable sites within the reference genome of the population and the genotypes of a plurality of individuals in the population with respect to those variable sites. The method further comprises selecting an encoding strategy for the information based on the characteristics of the genetic variation across the population, and encoding the information according to the selected encoding strategy. In certain embodiments, selecting an encoding strategy may comprise determining the variability of a variable site within the population, and encoding information associated with the variable site based on the variability.
G16B 35/00 - ICT specially adapted for in silico combinatorial libraries of nucleic acids, proteins or peptides
H03M 7/30 - CompressionExpansionSuppression of unnecessary data, e.g. redundancy reduction
G16H 10/60 - ICT specially adapted for the handling or processing of patient-related medical or healthcare data for patient-specific data, e.g. for electronic patient records
G16B 20/00 - ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
A method for generating a query of a genomic data store includes receiving, by a query generator executing on a computing device, from a graphical user interface, an identification of a first entity of a first entity class for inclusion in a resource description framework (RDF) query. The method includes receiving from the graphical user interface, an identification of a second entity of the first entity class, the second entity having a bi-directional relationship with the first entity. The method includes automatically generating an RDF query based upon the received identification of the first entity and the received identification of the second entity. The method includes executing the RDF query to select, from a plurality of genomic data sets, at least one genomic data set for at least one patient cohort. The method includes providing a listing of genomic data sets resulting from executing the RDF query.
The invention provides oncogenomic methods for detecting tumors by identifying circulating tumor DNA. A patient-specific reference directed acyclic graph (DAG) represents known human genomic sequences and non-tumor DNA from the patient as well as known tumor-associated mutations. Sequence reads from cell-free plasma DNA from the patient are mapped to the patient-specific genomic reference graph. Any of the known tumor-associated mutations found in the reads and any de novo mutations found in the reads are reported as the patient's tumor mutation burden.
C12Q 1/68 - Measuring or testing processes involving enzymes, nucleic acids or microorganismsCompositions thereforProcesses of preparing such compositions involving nucleic acids
G01N 33/50 - Chemical analysis of biological material, e.g. blood, urineTesting involving biospecific ligand binding methodsImmunological testing
G01N 33/574 - ImmunoassayBiospecific binding assayMaterials therefor for cancer
G06F 19/22 - for sequence comparison involving nucleotides or amino acids, e.g. homology search, motif or Single-Nucleotide Polymorphism [SNP] discovery or sequence alignment
63.
SYSTEMS AND METHODS FOR ADAPTIVE LOCAL ALIGNMENT FOR GRAPH GENOMES
Systems and methods for analyzing genomic information can include obtaining a sequence read including genetic information; identifying, within a graph representing a reference genome, a plurality of candidate mapping positions that relate to the genetic information, the graph comprising nodes representing genetic sequences and edges connecting pairs of nodes; determining, by means of a computer system, whether an alignment with the graph surrounding each of the plurality of candidate mapping positions is advanced or basic; and performing for each candidate mapping position, by means of the computer system, a local alignment based on whether the local alignment is advanced or basic. The advanced local alignment can include a first-local-alignment algorithm, and the basic local alignment includes a second-local-alignment algorithm. Based on the local alignments, the mapped position of the sequence read can be identified within the genome.
G06F 19/22 - for sequence comparison involving nucleotides or amino acids, e.g. homology search, motif or Single-Nucleotide Polymorphism [SNP] discovery or sequence alignment
Methods of the invention include representing biological data in a memory subsystem within a computer system with a data structure that is particular to a location in the memory subsystem and serializing the data structure into a stream of bytes that can be deserialized into a clone of the data structure. In a preferred genomic embodiment, the biological data comprises genomic sequences and the data structure comprises a genomic directed acyclic graph (DAG) in which objects have adjacency lists of pointers that indicate the location of any object adjacent to that object. After serialization and deserialization, the clone genomic DAG has the same structure as the original to represent the same sequences and relationships among them as the original.
Methods of the invention include representing biological data in a memory subsystem within a computer system with a data structure that is particular to a location in the memory subsystem and serializing the data structure into a stream of bytes that can be deserialized into a clone of the data structure. In a preferred genomic embodiment, the biological data comprises genomic sequences and the data structure comprises a genomic directed acyclic graph (DAG) in which objects have adjacency lists of pointers that indicate the location of any object adjacent to that object. After serialization and deserialization, the clone genomic DAG has the same structure as the original to represent the same sequences and relationships among them as the original.
The invention provides systems and methods for determining patterns of modification to a genome of a subject by representing the genome using a graph, such as a directed acyclic graph (DAG) with divergent paths for regions that are potentially subject to modification, profiling segments of the genome for evidence of epigenetic modification, and aligning the profiled segments to the DAG to determine locations and patterns of the epigenetic modification within the genome.
G01N 33/48 - Biological material, e.g. blood, urineHaemocytometers
G01N 31/00 - Investigating or analysing non-biological materials by the use of the chemical methods specified in the subgroupsApparatus specially adapted for such methods
C12Q 1/6806 - Preparing nucleic acids for analysis, e.g. for polymerase chain reaction [PCR] assay
C12Q 1/6874 - Methods for sequencing involving nucleic acid arrays, e.g. sequencing by hybridisation [SBH]
G16B 30/00 - ICT specially adapted for sequence analysis involving nucleotides or amino acids
The invention provides methods of analyzing an individual's mtDNA by transforming available reference sequences into a directed graph that compactly represents all the information without duplication and comparing sequence reads from the mtDNA to the graph to identify the individual or describe their mtDNA. A directed graph can represent all of the genetic variation found among the mitochondrial genomes across all of a number of reference organisms while providing a single article to which sequence reads can be aligned or compared. Thus any sequence read or other sequence fragment can be compared, in a single operation, to the article that represents all of the reference mitochondrial sequences.
The invention provides systems and methods for analyzing viruses by representing viral genetic diversity with a directed acyclic graph (DAG), which allows genetic sequencing technology to detect rare variations and represent otherwise difficult-to-document diversity within a sample. Additionally, a host-specific sequence DAG can be used to effectively segregate viral nucleic acid sequence reads from host sequence reads when a sample from a host is subject to sequencing. Known viral genomes can be represented using a viral reference DAG and the viral sequence reads from the sample can be compared to viral DAG to identify viral species or strains from which the reads were derived. Where the viral sequence reads indicate great genetic diversity in the virus that was infecting the host, those reads can be assembled into a DAG that itself properly represents that diversity.
G01N 33/48 - Biological material, e.g. blood, urineHaemocytometers
C12Q 1/70 - Measuring or testing processes involving enzymes, nucleic acids or microorganismsCompositions thereforProcesses of preparing such compositions involving virus or bacteriophage
G16B 30/00 - ICT specially adapted for sequence analysis involving nucleotides or amino acids
C12Q 1/6809 - Methods for determination or identification of nucleic acids involving differential detection
69.
SYSTEMS AND METHODS FOR IDENTIFYING MICROORGANISMS
The invention provides methods for identifying a microorganism by aligning sequence reads to a graph, such as a directed acyclic graph (DAG), that contains condensed sequence information of a conserved region from multiple known microorganisms. The DAG can be constructed by obtaining sequence information of known reference microorganisms. The DAG also includes the identities of the known microorganisms that correspond to particular paths. Sequence reads obtained from an unknown sample can thus be aligned to paths in the DAG using an alignment algorithm, and the identity of a microorganism in the sample can be determined based on which path in the DAG to which the sequence reads align best.
C12Q 1/68 - Measuring or testing processes involving enzymes, nucleic acids or microorganismsCompositions thereforProcesses of preparing such compositions involving nucleic acids
G06F 19/10 - Bioinformatics, i.e. methods or systems for genetic or protein-related data processing in computational molecular biology (in silico methods of screening virtual chemical libraries C40B 30/02;in silico or mathematical methods of creating virtual chemical libraries C40B 50/02)
G06F 19/22 - for sequence comparison involving nucleotides or amino acids, e.g. homology search, motif or Single-Nucleotide Polymorphism [SNP] discovery or sequence alignment
G06F 19/28 - for programming tools or database systems, e.g. ontologies, heterogeneous data integration, data warehousing or computing architectures
The invention relates to methods for determining a haplotype for an organism by using a system for transforming SNP alleles found in sequence fragments into vertices in a graph with edges connecting vertices for alleles that appear together in a sequence fragment. A community detection operation is used to infer the haplotype from the graph. The system may produce a report that includes the haplotype of the SNPs found in the genome of that organism.
C12Q 1/68 - Measuring or testing processes involving enzymes, nucleic acids or microorganismsCompositions thereforProcesses of preparing such compositions involving nucleic acids
G06F 19/18 - for functional genomics or proteomics, e.g. genotype-phenotype associations, linkage disequilibrium, population genetics, binding site identification, mutagenesis, genotyping or genome annotation, protein-protein interactions or protein-nucleic acid interactions
G06F 19/24 - for machine learning, data mining or biostatistics, e.g. pattern finding, knowledge discovery, rule extraction, correlation, clustering or classification
The invention relates to methods for determining a haplotype for an organism by using a system for transforming SNP alleles found in sequence fragments into vertices in a graph with edges connecting vertices for alleles that appear together in a sequence fragment. A community detection operation can be used to infer the haplotype from the graph. The system may produce a report that includes the haplotype of the SNPs found in the genome of that organism.
G01N 33/48 - Biological material, e.g. blood, urineHaemocytometers
G01N 33/50 - Chemical analysis of biological material, e.g. blood, urineTesting involving biospecific ligand binding methodsImmunological testing
G06F 19/18 - for functional genomics or proteomics, e.g. genotype-phenotype associations, linkage disequilibrium, population genetics, binding site identification, mutagenesis, genotyping or genome annotation, protein-protein interactions or protein-nucleic acid interactions
C12Q 1/6827 - Hybridisation assays for detection of mutation or polymorphism
G06F 19/24 - for machine learning, data mining or biostatistics, e.g. pattern finding, knowledge discovery, rule extraction, correlation, clustering or classification
72.
Methods and systems for detecting sequence variants
The invention provides methods for identifying rare variants near a structural variation in a genetic sequence, for example, in a nucleic acid sample taken from a subject. The invention additionally includes methods for aligning reads (e.g., nucleic acid reads) to a reference sequence construct accounting for the structural variation, methods for building a reference sequence construct accounting for the structural variation or the structural variation and the rare variant, and systems that use the alignment methods to identify rare variants. The method is scalable, and can be used to align millions of reads to a construct thousands of bases long, or longer.
G06F 19/22 - for sequence comparison involving nucleotides or amino acids, e.g. homology search, motif or Single-Nucleotide Polymorphism [SNP] discovery or sequence alignment
G06F 19/28 - for programming tools or database systems, e.g. ontologies, heterogeneous data integration, data warehousing or computing architectures
The invention provides methods for analyzing sequence data in which a large amount and variety of reference data are efficiently modeled as a reference graph, such as a directed acyclic graph (DAG). The method includes determining positions of k-mers, within a reference graph that represents a genomic sequence and known variation, storing the positions of each k-mer in a table entry indexed by a hash of that k-mer, and identifying a region within the reference graph that includes a threshold number of the k-mers, by reading from the table entries indexed by hashes of substrings of a subject sequence. The subject sequence may subsequently be mapped to the candidate region.
The invention provides methods for analyzing sequence data in which a large amount and variety of reference data are efficiently modeled as a reference graph, such as a directed acyclic graph (DAG). The method includes determining positions of k-mers within a reference graph that represents a genomic sequence and known variation, storing the positions of each k-mer in a table entry indexed by a hash of that k-mer, and identifying a region within the reference graph that includes a threshold number of the k-mers by reading from the table entries indexed by hashes of substrings of a subject sequence. The subject sequence may subsequently be mapped to the candidate region.
G01N 33/48 - Biological material, e.g. blood, urineHaemocytometers
G01N 33/50 - Chemical analysis of biological material, e.g. blood, urineTesting involving biospecific ligand binding methodsImmunological testing
G06F 19/16 - for molecular structure, e.g. structure alignment, structural or functional relations, protein folding, domain topologies, drug targeting using structure data, involving two-dimensional or three-dimensional structures
G06F 19/22 - for sequence comparison involving nucleotides or amino acids, e.g. homology search, motif or Single-Nucleotide Polymorphism [SNP] discovery or sequence alignment
75.
SYSTEMS AND METHODS FOR SMART TOOLS IN SEQUENCE PIPELINES
The invention relates to bioinformatics pipelines and wrapper scripts that call executables in those pipelines and that also identify beneficial changes to the pipelines. A tool in a pipeline has a smart wrapper that can cause the tool to analyze the sequence data it receives but that can also select a change to the pipeline when circumstances warrant. In certain aspects, the invention provides a system for genomic analysis. The system includes a processor coupled to a non- transitory memory. The system is operable to present to a user a plurality of genomic tools organized into a pipeline. At least a first one of the tools comprises an executable and a wrapper script. The system can receive instructions from the user and sequence data— instructions that call for the sequence data to be analyzed by the pipeline— and select, using the wrapper script, a change to the pipeline.
G06F 19/22 - for sequence comparison involving nucleotides or amino acids, e.g. homology search, motif or Single-Nucleotide Polymorphism [SNP] discovery or sequence alignment
76.
SYSTEMS AND METHODS FOR SMART TOOLS IN SEQUENCE PIPELINES
The invention relates to bioinformatics pipelines and wrapper scripts that call executables in those pipelines and that also identify beneficial changes to the pipelines. A tool in a pipeline has a smart wrapper that can cause the tool to analyze the sequence data it receives but that can also select a change to the pipeline when circumstances warrant. In certain aspects, the invention provides a system for genomic analysis. The system includes a processor coupled to a non- transitory memory. The system is operable to present to a user a plurality of genomic tools organized into a pipeline. At least a first one of the tools comprises an executable and a wrapper script. The system can receive instructions from the user and sequence data instructions that call for the sequence data to be analyzed by the pipeline and select, using the wrapper script, a change to the pipeline.
G16B 50/00 - ICT programming tools or database systems specially adapted for bioinformatics
G16B 20/00 - ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
G16B 30/00 - ICT specially adapted for sequence analysis involving nucleotides or amino acids
G16B 40/00 - ICT specially adapted for biostatisticsICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
C12Q 1/68 - Measuring or testing processes involving enzymes, nucleic acids or microorganismsCompositions thereforProcesses of preparing such compositions involving nucleic acids
77.
Systems and methods for smart tools in sequence pipelines
A tool in a bioinformatics pipeline can include a smart wrapper and an executable. The smart wrapper can cause the executable to analyze the sequence data it receives and can also selectively change to the pipeline when circumstances warrant. In certain aspects, a system for genomic analysis includes a processor coupled to a non-transitory memory. The system is operable to present to a user a plurality of genomic tools organized into a pipeline. At least a first one of the tools comprises an executable and a wrapper script. The system can receive instructions from the user and sequence data—instructions that call for the sequence data to be analyzed by the pipeline—and select, using the wrapper script, a change to the pipeline.
G06F 9/44 - Arrangements for executing specific programs
G06F 19/18 - for functional genomics or proteomics, e.g. genotype-phenotype associations, linkage disequilibrium, population genetics, binding site identification, mutagenesis, genotyping or genome annotation, protein-protein interactions or protein-nucleic acid interactions
G06F 19/28 - for programming tools or database systems, e.g. ontologies, heterogeneous data integration, data warehousing or computing architectures
78.
VARIANT-CALLING DATA FROM AMPLICON-BASED SEQUENCING METHODS
The invention provides systems and methods for calling variants in data from amplicon- based sequencing methods by aligning and assembling reads, associating the reads with their source amplicons, treating each amplicon as a separate sample or file, calling variants on the reads. A portion of each read is aligned to the primer binding site of the associated amplicons. Variants called at sites in the mapped portions of each read are discarded. The remaining variant calls are merged, to provide a set of variant calls across the original target region.
G06F 19/22 - for sequence comparison involving nucleotides or amino acids, e.g. homology search, motif or Single-Nucleotide Polymorphism [SNP] discovery or sequence alignment
G06F 19/10 - Bioinformatics, i.e. methods or systems for genetic or protein-related data processing in computational molecular biology (in silico methods of screening virtual chemical libraries C40B 30/02;in silico or mathematical methods of creating virtual chemical libraries C40B 50/02)
79.
Methods and systems for detecting sequence variants
The invention provides methods for identifying rare variants near a structural variation in a genetic sequence, for example, in a nucleic acid sample taken from a subject. The invention additionally includes methods for aligning reads (e.g., nucleic acid reads) to a reference sequence construct accounting for the structural variation, methods for building a reference sequence construct accounting for the structural variation or the structural variation and the rare variant, and systems that use the alignment methods to identify rare variants. The method is scalable, and can be used to align millions of reads to a construct thousands of bases long, or longer.
G06F 17/00 - Digital computing or data processing equipment or methods, specially adapted for specific functions
G06F 7/00 - Methods or arrangements for processing data by operating upon the order or content of the data handled
G06F 19/22 - for sequence comparison involving nucleotides or amino acids, e.g. homology search, motif or Single-Nucleotide Polymorphism [SNP] discovery or sequence alignment
G06F 19/28 - for programming tools or database systems, e.g. ontologies, heterogeneous data integration, data warehousing or computing architectures
80.
Systems and methods for using paired-end data in directed acyclic structure
Methods of analyzing a transcriptome that involves obtaining at least one pair of paired-end reads from a transcriptome from an organism, finding an alignment with an optimal score between a first read of the pair and a node in a directed acyclic data structure (the data structure has nodes representing RNA sequences such as exons or transcripts and edges connecting pairs of nodes), identifying candidate paths that include the node connected to a downstream node by a path having a length substantially similar to an insert length of the pair of paired-end reads, and aligning the paired-end rends to the candidate paths to determine an optimal-scoring alignment.
G06F 19/10 - Bioinformatics, i.e. methods or systems for genetic or protein-related data processing in computational molecular biology (in silico methods of screening virtual chemical libraries C40B 30/02;in silico or mathematical methods of creating virtual chemical libraries C40B 50/02)
G06F 19/22 - for sequence comparison involving nucleotides or amino acids, e.g. homology search, motif or Single-Nucleotide Polymorphism [SNP] discovery or sequence alignment
G06F 19/28 - for programming tools or database systems, e.g. ontologies, heterogeneous data integration, data warehousing or computing architectures
The invention generally provides systems and methods for analysis of RNA-Seq reads in which an annotated reference is represented as a directed acyclic graph (DAG) or similar data structure. Features such as exons and introns from the reference provide nodes in the DAG and those features are linked as pairs in their canonical genomic order by edges. The DAG can scale to any size and can in fact be populated in the first instance by import from an extrinsic annotated reference.
G06F 19/22 - for sequence comparison involving nucleotides or amino acids, e.g. homology search, motif or Single-Nucleotide Polymorphism [SNP] discovery or sequence alignment
G06F 19/28 - for programming tools or database systems, e.g. ontologies, heterogeneous data integration, data warehousing or computing architectures
The invention provides methods for comparing one set of genetic sequences to another without discarding any information within either set. A set of genetic sequences is represented using a directed acyclic graph (DAG) avoiding any unwarranted reduction to a linear data structure. The invention provides a way to align one sequence DAG to another to produce an alignment that can itself be stored as a DAG. DAG-to-DAG alignment is a natural choice wherever a set of genomic information consisting of more than one string needs to be compared to any non-linear reference. For example, a subpopulation DAG could be compared to a population DAG in order to compare the genetic features of that subpopulation to those of the population.
C12Q 1/68 - Measuring or testing processes involving enzymes, nucleic acids or microorganismsCompositions thereforProcesses of preparing such compositions involving nucleic acids
G06F 19/22 - for sequence comparison involving nucleotides or amino acids, e.g. homology search, motif or Single-Nucleotide Polymorphism [SNP] discovery or sequence alignment
G06F 19/10 - Bioinformatics, i.e. methods or systems for genetic or protein-related data processing in computational molecular biology (in silico methods of screening virtual chemical libraries C40B 30/02;in silico or mathematical methods of creating virtual chemical libraries C40B 50/02)
G06F 17/30 - Information retrieval; Database structures therefor
The invention provides methods for comparing one set of genetic sequences to another without discarding any information within either set. A set of genetic sequences is represented using a directed acyclic graph (DAG) avoiding any unwarranted reduction to a linear data structure. The invention provides a way to align one sequence DAG to another to produce an alignment that can itself be stored as a DAG. DAG-to-DAG alignment is a natural choice wherever a set of genomic information consisting of more than one string needs to be compared to any non-linear reference. For example, a subpopulation DAG could be compared to a population DAG in order to compare the genetic features of that subpopulation to those of the population.
G01N 33/48 - Biological material, e.g. blood, urineHaemocytometers
G06F 19/22 - for sequence comparison involving nucleotides or amino acids, e.g. homology search, motif or Single-Nucleotide Polymorphism [SNP] discovery or sequence alignment
G06F 19/26 - for data visualisation, e.g. graphics generation, display of maps or networks or other visual representations
C12Q 1/68 - Measuring or testing processes involving enzymes, nucleic acids or microorganismsCompositions thereforProcesses of preparing such compositions involving nucleic acids
84.
Methods and systems for genotyping genetic samples
The invention provides methods and system for making specific base calls at specific loci using a reference sequence construct, e.g., a directed acyclic graph (DAG) that represents known variants at each locus of the genome. Because the sequence reads are aligned to the DAG during alignment, the subsequent step of comparing a mutation, vis-à-vis the reference genome, to a table of known mutations can be eliminated. The disclosed methods and systems are notably efficient in dealing with structural variations within a genome or mutations that are within a structural variation.
G06F 19/22 - for sequence comparison involving nucleotides or amino acids, e.g. homology search, motif or Single-Nucleotide Polymorphism [SNP] discovery or sequence alignment
The invention includes methods and systems for identifying diseased-induced mutations by producing multi-dimensional reference sequence constructs that account for variations between individuals, different diseases, and different stages of those diseases. Once constructed, these reference sequence constructs can be used to align sequence reads corresponding to genetic samples from patients suspected of having a disease, or who have had the disease and are in suspected remission. The reference sequence constructs also provide insight to the genetic progression of the disease.
G06F 19/00 - Digital computing or data processing equipment or methods, specially adapted for specific applications (specially adapted for specific functions G06F 17/00;data processing systems or methods specially adapted for administrative, commercial, financial, managerial, supervisory or forecasting purposes G06Q;healthcare informatics G16H)
C12Q 1/6886 - Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for diseases caused by alterations of genetic material for cancer
C12Q 1/6883 - Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for diseases caused by alterations of genetic material
86.
SYSTEMS AND METHODS FOR USE OF KNOWN ALLELES IN READ MAPPING
The invention generally relates to genomic studies and specifically to improved methods for read mapping using identified nucleotides at known locations. The invention provides methods of using identified nucleotides at known places in a genome to guide the analysis of sequence reads from that genome by excluding potential mappings or assemblies that are not congruent with the identified nucleotides. Information about a plurality of SNPs in the subject's genome is used to identify candidate paths through a genomic directed acyclic graph (DAG). Sequence reads are mapped to the candidate paths.
The invention generally relates to genomic studies and specifically to improved methods for read mapping using identified nucleotides at known locations. The invention provides methods of using identified nucleotides at known places in a genome to guide the analysis of sequence reads from that genome by excluding potential mappings or assemblies that are not congruent with the identified nucleotides. Information about a plurality of SNPs in the subject's genome is used to identify candidate paths through a genomic directed acyclic graph (DAG). Sequence reads are mapped to the candidate paths.
The invention generally relates to genomic studies and specifically to improved methods for read mapping using identified nucleotides at known locations. The invention provides methods of using identified nucleotides at known places in a genome to guide the analysis of sequence reads from that genome by excluding potential mappings or assemblies that are not congruent with the identified nucleotides. Information about a plurality of SNPs in the subject's genome is used to identify candidate paths through a genomic directed acyclic graph (DAG). Sequence reads are mapped to the candidate paths.
The invention includes methods for aligning reads (e.g., nucleic acid reads, amino acid reads) to a reference sequence construct, methods for building the reference sequence construct, and systems that use the alignment methods and constructs to produce sequences. The invention also includes methods and systems for evaluating the quality of the alignment between the reads and the reference sequence construct. The method is scalable, and can be used to align millions of reads to a construct thousands of bases or amino acids long. The invention additionally includes methods for identifying a disease or a genotype based upon alignment of nucleic acid reads to a location in the construct.
The invention includes methods for aligning reads (e.g., nucleic acid reads) comprising repeating sequences, methods for building reference sequence constructs comprising repeating sequences, and systems that can be used to align reads comprising repeating sequences. The method is scalable, and can be used to align millions of reads to a construct thousands of bases long. The methods and systems can additionally account for variability within a repeating sequence, or near to a repeating sequence, due to genetic mutation.
Methods of analyzing a transcriptome that involves obtaining at least one pair of paired- end reads from a transcriptome from an organism, finding an alignment with an optimal score between a first read of the pair and a node in a directed acyclic data structure (the data structure has nodes representing RNA sequences such as exons or transcripts and edges connecting pairs of nodes), identifying candidate paths that include the node connected to a downstream node by a path having a length substantially similar to an insert length of the pair of paired-end reads, and aligning the paired-end rends to the candidate paths to determine an optimal- scoring alignment.
The invention generally provides systems and methods for analysis of RNA-Seq reads in which an annotated reference is represented as a directed acyclic graph (DAG) or similar data structure. Features such as exons and introns from the reference provide nodes in the DAG and those features are linked as pairs in their canonical genomic order by edges. The DAG can scale to any size and can in fact be populated in the first instance by import from an extrinsic annotated reference.
G06F 19/10 - Bioinformatics, i.e. methods or systems for genetic or protein-related data processing in computational molecular biology (in silico methods of screening virtual chemical libraries C40B 30/02;in silico or mathematical methods of creating virtual chemical libraries C40B 50/02)
93.
SYSTEMS AND METHODS FOR USING PAIRED-END DATA IN DIRECTED ACYCLIC STRUCTURE
Methods of analyzing a transcriptome that involves obtaining at least one pair of paired- end reads from a transcriptome from an organism, finding an alignment with an optimal score between a first read of the pair and a node in a directed acyclic data structure (the data structure has nodes representing RNA sequences such as exons or transcripts and edges connecting pairs of nodes), identifying candidate paths that include the node connected to a downstream node by a path having a length substantially similar to an insert length of the pair of paired-end reads, and aligning the paired-end rends to the candidate paths to determine an optimal- scoring alignment.
G06F 19/22 - for sequence comparison involving nucleotides or amino acids, e.g. homology search, motif or Single-Nucleotide Polymorphism [SNP] discovery or sequence alignment
94.
METHODS AND SYSTEMS FOR GENOTYPING GENETIC SAMPLES
The invention provides methods and system for making specific base calls at specific loci using a reference sequence construct, e.g., a directed acyclic graph (DAG) that represents known variants at each locus of the genome. Because the sequence reads are aligned to the DAG during alignment, the subsequent step of comparing a mutation, vis-a-vis the reference genome, to a table of known mutations can be eliminated. The disclosed methods and systems are notably efficient in dealing with structural variations within a genome or mutations that are within a structural variation.
The invention includes methods and systems for identifying diseased-induced mutations by producing multi-dimensional reference sequence constructs that account for variations between individuals, different diseases, and different stages of those diseases. Once constructed, these reference sequence constructs can be used to align sequence reads corresponding to genetic samples from patients suspected of having a disease, or who have had the disease and are in suspected remission. The reference sequence constructs also provide insight to the genetic progression of the disease.
The invention provides methods and system for making specific base calls at specific loci using a reference sequence construct, e.g., a directed acyclic graph (DAG) that represents known variants at each locus of the genome. Because the sequence reads are aligned to the DAG during alignment, the subsequent step of comparing a mutation, vis-a-vis the reference genome, to a table of known mutations can be eliminated. The disclosed methods and systems are notably efficient in dealing with structural variations within a genome or mutations that are within a structural variation.
The invention includes methods for aligning reads (e.g., nucleic acid reads, amino acid reads) to a reference sequence construct, methods for building the reference sequence construct, and systems that use the alignment methods and constructs to produce sequences. The invention also includes methods and systems for evaluating the quality of the alignment between the reads and the reference sequence construct. The method is scalable, and can be used to align millions of reads to a construct thousands of bases or amino acids long. The invention additionally includes methods for identifying a disease or a genotype based upon alignment of nucleic acid reads to a location in the construct.
C12Q 1/68 - Measuring or testing processes involving enzymes, nucleic acids or microorganismsCompositions thereforProcesses of preparing such compositions involving nucleic acids
G06F 19/10 - Bioinformatics, i.e. methods or systems for genetic or protein-related data processing in computational molecular biology (in silico methods of screening virtual chemical libraries C40B 30/02;in silico or mathematical methods of creating virtual chemical libraries C40B 50/02)
98.
METHODS AND SYSTEMS FOR IDENTIFYING DISEASE-INDUCED MUTATIONS
The invention includes methods and systems for identifying diseased-induced mutations by producing multi-dimensional reference sequence constructs that account for variations between individuals, different diseases, and different stages of those diseases. Once constructed, these reference sequence constructs can be used to align sequence reads corresponding to genetic samples from patients suspected of having a disease, or who have had the disease and are in suspected remission. The reference sequence constructs also provide insight to the genetic progression of the disease.
The invention generally provides systems and methods for analysis of RNA-Seq reads in which an annotated reference is represented as a directed acyclic graph (DAG) or similar data structure. Features such as exons and introns from the reference provide nodes in the DAG and those features are linked as pairs in their canonical genomic order by edges. The DAG can scale to any size and can in fact be populated in the first instance by import from an extrinsic annotated reference.
G06F 19/10 - Bioinformatics, i.e. methods or systems for genetic or protein-related data processing in computational molecular biology (in silico methods of screening virtual chemical libraries C40B 30/02;in silico or mathematical methods of creating virtual chemical libraries C40B 50/02)
G06F 19/22 - for sequence comparison involving nucleotides or amino acids, e.g. homology search, motif or Single-Nucleotide Polymorphism [SNP] discovery or sequence alignment
G06F 19/28 - for programming tools or database systems, e.g. ontologies, heterogeneous data integration, data warehousing or computing architectures
100.
Systems and methods for using paired-end data in directed acyclic structure
Methods of analyzing a transcriptome that involves obtaining at least one pair of paired-end reads from a transcriptome from an organism, finding an alignment with an optimal score between a first read of the pair and a node in a directed acyclic data structure (the data structure has nodes representing RNA sequences such as exons or transcripts and edges connecting pairs of nodes), identifying candidate paths that include the node connected to a downstream node by a path having a length substantially similar to an insert length of the pair of paired-end reads, and aligning the paired-end rends to the candidate paths to determine an optimal-scoring alignment.
G06F 19/10 - Bioinformatics, i.e. methods or systems for genetic or protein-related data processing in computational molecular biology (in silico methods of screening virtual chemical libraries C40B 30/02;in silico or mathematical methods of creating virtual chemical libraries C40B 50/02)
G06F 19/22 - for sequence comparison involving nucleotides or amino acids, e.g. homology search, motif or Single-Nucleotide Polymorphism [SNP] discovery or sequence alignment
G06F 19/28 - for programming tools or database systems, e.g. ontologies, heterogeneous data integration, data warehousing or computing architectures