A system is disclosed comprising a memory containing instructions and one or more computer processors. When the instructions are executed, the system performs an operation to configure a Domain Name System (DNS) proxy, executing in a node of a cloud data platform associated with a first account, to perform hostname resolution of an Account Host Identifier (AHID) of the first account. The DNS proxy receives a DNS request from a process executing in a pod of the node, and the system fails to resolve the DNS request if the name in the DNS request differs from the AHID of the first account. The system returns an Internet Protocol (IP) address if the name in the DNS request matches the AHID. The process executing in the pod of the node is configured to send data to data storage of the cloud data platform using the returned IP address.
H04L 61/4511 - Network directoriesName-to-address mapping using standardised directoriesNetwork directoriesName-to-address mapping using standardised directory access protocols using domain name system [DNS]
Described is a system that receives data from a variety of external data repositories and identifies unstructured data within the received content. The unstructured data is processed to generate textual representations. A chat message is displayed in a user interface, prompting the first user to submit a query. Upon receiving the user's query, the system generates a modified version of the query and identifies portions of the textual representations. A content block is then generated from these portions and input into a machine learning model trained to generate responses using content blocks. The system generates a response to the user's query and displays the response within the user interface.
H04L 51/02 - User-to-user messaging in packet-switching networks, transmitted according to store-and-forward or real-time protocols, e.g. e-mail using automatic reactions or user delegation, e.g. automatic replies or chatbot-generated messages
A top K query directed at a table is received. Run-time pruning is performed during execution of the top K query on the table. The run-time pruning comprises determining, by a top K node, a current boundary based on a set of values identified by a table scan node in scanning the table and applying, by the table scan node, the current boundary to prune data during the scanning of the table. The applying of the current boundary comprises reducing scanning ranges of the table scan node based on the top K column being a key column of the table and filtering values scanned by the table scan node based on the top K column being a non-key column of the table. The result set is returned responsive to the top K query based on the run-time pruning performed during execution of the top K query on the table.
Various example embodiments described herein provide for systems, methods, devices, instructions, and the like for suffix-based speculative token decoding for an artificial intelligence model, such as a language model (e.g., large language model). In particular, some example embodiments provide an AI model system with hybrid speculative token decoding, which combines suffix-based speculative token decoding with a draft AI model approach to speculative token decoding. With this hybrid decoding approach, various example embodiments can accelerate inference throughput while adapting to different types of workloads, particularly agentic applications that exhibit repetitive token generation patterns.
A client system of a computing environment including a data platform is provided that optimizes a database query. The client system creates a logical plan tree for a Structured Query Language (SQL) query, with the logical plan tree comprising a set of nodes. The client system identifies a set of duplicate nodes in the set of nodes of the logical plan tree and identifies a duplicate subtree in the logical plan tree by determining a set of root nodes of the duplicate subtree using parent-child relationships of the set of nodes. The client system generates an optimized query by replacing instances of subqueries represented by the duplicate subtree using a set of optimized subqueries. The client system communicates the optimized query to the data platform for execution.
Embodiments of the present disclosure provide techniques for classification with automated model selection, tuning, and training. A processing device receives, from a client, a data query referencing an input data set of a database associated with a virtual warehouse. The processing device allocates an amount of memory of the virtual warehouse to be used to train a machine learning (ML) model based on the input data set and a peak memory estimate, where the peak memory estimate is based on a heuristic. The processing device trains, based on the input data set and the data query, the ML model in the virtual warehouse using the amount of memory.
Example methods include providing a data platform for participants to assign roles of various rights to access data objects on the data platform. A first participant acting in a role as an owner may create a code entity on the data platform to interact with the data objects. A second participant acting in a role as an administrator may define a security boundary over the code entity created by the first participant. A third participant acting as a caller may request to interact with the data objects using the code entity. A processing device of the database system provides the third participant, access to the code entity created by the first participant based on the security boundary defined by the second participant.
Methods, systems, and computer programs are presented for generating and managing query plan representations using a unified framework based on a common query plan configuration. The system includes a memory with instructions and one or more computer processors. The instructions, when executed, cause the system to provide a user interface (UI) for a development tool to create a configuration for a plan builder. The UI offers options to create a plurality of query plans, including a logical query plan, a physical query plan, a query plan hash, and a query plan signature. The system receives parameters to create the configuration and select a query plan from the plurality of query plans. The system generates a first query plan based on the parameters and causes the presentation of the first query plan on a display.
Various example embodiments described herein provide for systems, methods, devices, instructions, and the like for optimizing a query plan to execute a query by using early row flush and pushing down top-k information in a query plan that includes an inner join, which can be used within a data platform environment. In particular, various example embodiments use early flush operations by one or more aggregation operators to eventually enable information from a top-k operator of a query plan to be pushed down the query plan and through one or more inner join operators of the query plan to one or more select operators (e.g., aggregation operators and table scan operators) positioned below the one or more inner join operators.
The described system aims to reduce or eliminate inaccuracies and hallucinations in responses generated by a machine learning model when processing user queries. The data platform parses and categorizes the text within data files to create structured textual representations. The user submits multiple prompts which are collectively assessed to refine and modify the initial queries.
The described system aims to reduce or eliminate inaccuracies and hallucinations in responses generated by a machine learning model when processing user queries. The data platform parses and categorizes the text within data files to create structured textual representations. The user submits multiple prompts which are collectively assessed to refine and modify the initial queries.
The modified query is used to identifying segments of data files that are most relevant to the query. These relevant portions are then compiled into a Retrieval-Augmented Generation (RAG) context block. This RAG context block is fed into a prompt response machine learning model, which processes the enriched information to generate a well-informed and accurate response to the user's query. Finally, this response is displayed back to the user through the chat interface, completing a cycle that enhances the reliability and relevance of machine-generated answers.
Techniques for managing states of virtual warehouses in a multi-tenant network-based data system are described. A “resolver” may be provided in each warehouse scheduling service thread. The resolver may maintain a current state of the virtual warehouse and may generate a target state of the virtual warehouse based on an operation request, such as a resume operation, a suspend operation, resize operation, etc. The resolver may generate an action plan to converge the current state to the target state.
Questionnaire completion systems and methodologies for a data platform. The data platform receives from a consumer an unstructured questionnaire to be completed based on structured database objects, semi-structured database objects, and unstructured database objects stored on the data platform by a provider. The data platform generates a secured completion of the unstructured questionnaire based on a questionnaire completion model and the unstructured questionnaire. The data platform determines a confidence score for the completion and in response to determining the confidence score does not exceed a threshold value, the data platform generates a structured query based on the unstructured questionnaire and a structured query model, and generates the secured completion based on querying the structured database objects using the structured query. The data platform applies a security function to the secured completion to generate a completion of the unstructured questionnaire and provides the completion to the consumer.
A query engine can use partition-granular level statistics to optimize query performance. A query can reference a table with a plurality of partitions and include a predicate. A partition-granular selectivity estimate for the predicate can be generated based on statistics stored regarding the plurality of partitions of the table. A query plan can be generated based on partition-granular selectivity estimate to optimize query processing.
Methods, systems, and computer programs are presented for providing performance metrics in an online performance analysis system employing customer objects, such as database tables. A plurality of metric source data associated with a plurality of objects is accessed and a subset of the plurality of objects is determined that satisfies stableness criteria based on the plurality of metric source data to identify a set of stable objects. A set of metrics is generated based on the subset of the plurality of objects that satisfies the stableness criteria.
Antagonistic queries can have a high resource and time footprint triggering a range of issues, such as causing compilation performance degradation of other queries and machine failures. Described herein are techniques for automatically identifying antagonistic queries and redirecting the antagonistic queries to dedicated resources. This redirecting can help better balance the workload on different work clusters and to isolate antagonistic workloads from impacting the compilation and execution performance of other queries.
Various embodiments described herein provide for systems, methods, devices, instructions, and like for generating synthetic data. According to various embodiments, synthetic data generation comprises receiving input specifying one or more source tables and join key columns, and generating synthetic data that preserves statistical similarity and referential integrity among columns of the source data.
A system is disclosed for recovering historical table data in a database environment. The system includes at least one hardware processor and at least one memory. The memory stores instructions that, when executed, cause the system to receive a request to recover historical table data of a source table. The historical table data includes multiple partition files, and each partition file includes a deleted file designation. The system performs a recovery process on the partition files by determining a recoverable time range for the source table based on lifecycle information and restoring the partition files based on the recoverable time range. The system retrieves a schema associated with the historical table data and generates metadata corresponding to the schema. The metadata is associated with the recovered partition files to reconstruct the historical table data. This approach allows efficient and reliable recovery of deleted or lost table data.
G06F 11/14 - Error detection or correction of the data by redundancy in operation, e.g. by using different operation sequences leading to the same result
A data platform including an error handling framework for loading of input data. The data platform generates input data columns based on an input file and generates result data columns based on the input data columns and evaluating expressions. The data platform detects projection errors during the generating of the result data columns and stores result error indicators in error indicator arrays of the result data columns based on the projection errors. The data platform generates filtered result data columns based on the result data columns and the result error indicator arrays of the result data columns and stores the filtered result data columns in a database of the data platform.
Various example embodiments described herein provide for systems, methods, devices, instructions, and the like for structured language parsing to execute a differentially private query on a database system. According to some example embodiments, a user (e.g., an analyst) submits to a data system (e.g., data platform) a differentially private query using a structured language interface (e.g., SQL interface), which causes the calling of one or more stored procedures on the data system, where the one or more stored procedures encapsulate or facilitate use of a differential privacy engine, which can execute the differentially private query and generate a differentially private query result.
A system is disclosed that includes one or more hardware processors and at least one memory storing instructions. The system receives a first query directed towards a shared dataset and accesses a first set of data from a first table in the shared dataset. The system determines that an aggregation constraint policy is attached to the first table, which restricts output of data values stored in the table. The system performs a uniqueness check on join keys for a join operation associated with the first table, verifying that at least one row from the first table is not amplified in the result. The system enforces the aggregation constraint policy on the first query based on this verification. The system generates an output to the first query based on the first set of data. This approach helps control data aggregation and ensures privacy when accessing shared datasets.
Systems and methods are provided for creating a secure database execution environment. The system generates, by a database system executing on a secure enclave, attestation information. The system transmits the attestation information to a remote entity. The system obtains, by the database system executing on the secure enclave, one or more encryption keys in response to the remote entity authenticating the attestation information. The system performs, by the database system executing on the secure enclave, one or more database operations on encrypted data stored on the database system using the one or more encryption keys.
Described is a system for join constraints for query processing by receiving a first query directed towards a shared dataset in a data clean room; assessing the first query to identify that the one or more functions s at least a join function; determining that the first query is configured to join a first set of data from the shared dataset with a second set of data using the join function; determining that a join constraint policy is to be enforced in relation to the first query; and generating an output to the first query based on the execution of the one or more functions, the output to the first query without data values stored in the portion of the first set of data based on determining that the join constraint policy is to be enforced in relation to the first query.
The subject technology sends, from a child rowset operator (RSO) instance, a first request for performing a user defined aggregate function (UDAF) to a user defined function (UDF) server to initialize an aggregate state for a set of aggregation groups and update aggregated states for each aggregation group from the set of aggregation groups, the first request including a set of input rows. The subject technology receives information comprising a computation status of the UDAF. The subject technology sends, by the child RSO instance, a second request to the UDF server to update the aggregated states for each aggregation group from the set of aggregation groups, the second request including a second set of input rows. The subject technology receives an aggregate states vector with one entry per aggregation group. The subject technology sends, by the child RSO instance, the aggregate states vector to a parent RSO instance.
A join decision manager (JDM) generates a data processing pipeline. The data processing pipeline includes at least one join operation associated with build-side row data and probe-side row data. The JDM determines the maximum cardinality associated with the probe-side row data. The JDM determines size of the build-side row data at a decision node of the at least one join operation. The JDM configures execution of the at least one join operation as one of a broadcast join or a hash-hash join based on the size of the build-side row data and the maximum cardinality.
Systems and methods are provided for controlling the deletion of data in a database system. The system receives input comprising a deletion criterion for a database system. The system applies the deletion criterion to a set of tables of the database system. The system determines that an individual portion of the set of tables satisfies the deletion criterion. In response to determining that the individual portion of the set of tables satisfies the deletion criterion, the system transfers the individual portion of the set of tables to a temporary storage system.
Disclosed are techniques for routing and filtering telemetry data based on custom telemetry definitions provided by a user. A telemetry filter definition comprising rules for routing and filtering telemetry data may be converted into a common expression language (CEL) abstract syntax tree (AST). The CEL AST may be provided to a filtering component, which may compile the CEL AST into a CEL filter program comprising the rules for routing and filtering telemetry data. In response to receiving telemetry data, filtering, by the filtering component, the received telemetry data based on the CEL filter program to generate filtered telemetry data.
Disclosed is an execution information sharing system that writes execution information to a provider target (and other targets) in a secure manner. Execution information generated by an application may be written to a consumer stage, wherein the application is shared by a provider account of a data exchange with a consumer account that executes the application. A consumer exchange service(ES) of the data exchange may send a request to a copy service of the data exchange to copy the execution information from the consumer stage to the provider stage, wherein the consumer ES is a part of the data exchange and is protected from actions of the consumer account. A copy operation may be executed to copy the execution information from the consumer stage to the provider stage using the copy service of the data exchange. The execution information is ingested from the provider stage to a provider table.
Disclosed are techniques for using an application control framework to build, share and manage access to and usage of applications via a data sharing platform. An application control framework may provide a number of predefined controls and may receive values for certain predefined controls as well as custom control definitions and corresponding values from a provider. The application control framework may also receive application logic and may build an application package comprising the application logic and a set of controls including the predefined and custom controls to manage access to and usage of the application. In response to a consumer of the data sharing platform importing the application package, the application control framework may call the set of install scripts to install an instance of the application in the consumer account using the application logic and manage access to the application instance by the consumer using the set of controls.
A hot server is identified from a plurality of servers based on one or more server metrics associated with the hot server. A hot data range stored by the hot server is identified based on one or more read density metrics. The hot data range comprises a range of data values with a higher volume of access requests compared to other data values stored by the hot server. The hot data range is replicated across a number of additional servers.
G06F 16/27 - Replication, distribution or synchronisation of data between databases or within a distributed database systemDistributed database system architectures therefor
G06F 9/50 - Allocation of resources, e.g. of the central processing unit [CPU]
G06F 16/25 - Integrating or interfacing systems involving database management systems
30.
REPLICATION OF UNSTRUCTURED STAGED DATA BETWEEN DATABASE DEPLOYMENTS
Systems and methods for replicating unstructured staged data between remote database deployments are disclosed. The system includes at least one hardware processor and memory storing instructions that identify staged data at a first database deployment for replication to a second, remote database deployment. The staged data includes unstructured data items stored in a storage resource associated with the first database deployment. The system replicates a directory from the first database deployment to the second, where the directory includes information identifying the unstructured data items. Metadata is also replicated, including references to the locations of the unstructured data items in the storage resource. The second database deployment is enabled to access one or more unstructured data items from the storage resource of the first database deployment using the directory and references, without duplicating the data. Incremental replication of additional staged data is facilitated based on a comparison of directories between deployments.
G06F 16/27 - Replication, distribution or synchronisation of data between databases or within a distributed database systemDistributed database system architectures therefor
G06F 16/25 - Integrating or interfacing systems involving database management systems
Embodiments of the present disclosure provide techniques for efficient computation over a wide table. A processing device determines that a first number of columns of a first table is greater than a threshold number of columns. The processing device transforms the first table into a second table based on the determination, where the second table includes a second number of columns that is less than the first number of columns, and where the second table includes a first column that includes first fields that identify columns of the first table, a second column that includes second fields that identify data types of fields of the first table, and a third column that includes third fields that include data of the fields of the first table. The processing device executes a UDTF on the second table.
Cloning operations can be used generate snapshots of tables at specified times. The snapshot objects can be stored in a first-tier storage with the table, where the cloned versions of the tables and the table may share files, such as micro-partition files, to conserve storage resources. After a first expiration time, snapshot objects can be transferred from the first-tier storage to a second-tier storage to further save on storage costs. After a second expiration time (e.g., full retention period), the snapshot objects can be deleted from the second-tier storage as well.
A data platform is provided that uses a replication cache to replicate data. The data platform is designed to receive a replication request from a secondary deployment that includes a request for a data transfer of data files from a primary deployment. The data platform analyzes metadata of a replication cache and the primary deployment to identify the data files for replication. Based on this metadata, the data platform determines whether to route the data transfer through the replication cache or directly from the primary deployment to the secondary deployment. The data transfer is then routed accordingly, and the receipt of the data transfer at the secondary deployment is verified.
G06F 16/27 - Replication, distribution or synchronisation of data between databases or within a distributed database systemDistributed database system architectures therefor
G06F 11/34 - Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation
Provided herein are systems and methods for configuring dynamic tables with externally-managed Iceberg source tables. An example method for updating a dynamic table using data from an Iceberg source table includes generating, for each row in an Iceberg source table, a row identifier derived from immutable metadata associated with a physical storage location of the row and a position of the row within the physical storage location. The method further includes generating, for each of a first version and a second version of the Iceberg source table, a set of the row identifiers by computing the row identifier for each row present in the respective version. The sets of the row identifiers are compared between the first version and the second version of the Iceberg source table to identify changes at a row level. A dynamic table associated with the Iceberg source table is updated based on the identified changes.
To provide outbound private link support for a multi-tenant data system with tenant isolation, a separate, dedicated virtual network is provided, referred to as private link (PL) virtual network. The PL virtual network may host a plurality of host interface endpoints and resource endpoints. A core virtual network and the PL virtual network may be peered together to work in conjunction. The private endpoints in the PL virtual network may then be connected to external systems using a private link without exposure to the public internet.
A system includes at least one hardware processor and memory storing instructions. The processor generates a query plan for a received query. The query plan includes multiple hash-join-build and hash-join-probe operations. A primary decision node is configured in the query plan. The primary decision node receives build-side data information from the hash-join-build operations. For each hash-join-build operation, a memory amount for performing a broadcast is determined. A subset of hash-join-build operations is selected for broadcast join distribution by comparing the memory amount to a broadcast memory threshold. The system selects a broadcast join distribution for the subset and a hash-hash join distribution for the remaining hash-join-build operations. The query plan is executed using the broadcast join distribution for the selected subset and the hash-hash join distribution for the remaining operations. This approach optimizes memory usage and join distribution during query execution.
A tag propagator may obtain a SQL statement. As a result of obtaining the SQL statement, object dependencies between objects referenced in the SQL statement may be determined. Tags associated with the determined object dependencies may be further determined. The tags may be propagated.
As described herein, a N-Gram index may be created and the search may be conducted using the index, which will lead to faster search results. The N-Gram index may also include partial N-Gram components to capture more relevant data. Moreover, as described herein, the search may also take into account recent log data that has not yet been indexed. Techniques for building an index store using log data and efficiently searching the index store and log data to process search requests are described herein.
G06F 7/14 - Merging, i.e. combining at least two sets of record carriers each arranged in the same ordered sequence to produce a single set having the same ordered sequence
The subject technology receives a first query plan corresponding to a query, the first query plan comprising a set of predicates. The subject technology receives, during execution of a first portion of the first query plan, a set of rowsets. The subject technology determines a set of metrics for a first number of rows from a plurality of rows, the first number of rows corresponding to a first predicate order. The subject technology determines, using a heuristic, a second predicate order based at least in part on the set of metrics. The subject technology processes, during execution of the first portion of the first query plan using the second predicate order, a second set of rowsets, the second set of rowsets comprising a second plurality of rows that correspond to the first portion of the first query plan that has been executed based on the second predicate order.
Provided herein are systems and methods for configuring managed dynamic Iceberg tables. An example method includes parsing, by at least one hardware processor, a table definition to determine a lag duration value, an external volume indicator, and a location indicator. A dynamic table (DT) manager generates a dynamic Iceberg table based on the table definition. The generating is based on selecting an external storage volume of a network-based database system based on the external volume indicator and the location indicator. The DT manager stores a base Iceberg table at a storage location associated with the external storage volume. The DT manager configures the base Iceberg table as the dynamic Iceberg table based on the lag duration value. The lag duration value indicates a maximum time period that a result of a prior refresh of the dynamic Iceberg table lags behind a current time instance.
Various embodiments described herein provide for systems, methods, devices, instructions, and like for swapping artificial intelligence models, such as large language models (LLMs), based on inference request monitoring. In particular, some embodiments monitor inference requests submitted to various inference engines (where each inference engine comprises a group of software containers sharing assigned computing resources) and, based on analysis of inference request data, available models, currently loaded models, or a combination thereof, determine whether to swap out a set of AI models currently active on a select inference engine with another set of AI models available on the select inference engine.
The subject technology receives a query, the query including a set of statements, the set of statements including a function call, the function call including a declaration of a vector data type as an argument of the function call. The subject technology processes the query, the processing including invoking the function call. The subject technology provides a set of query results from processing the query, the set of query results including a vector data structure corresponding to the vector data type, the vector data structure including a set of elements, each element comprising a numerical data type.
Provided herein are systems and methods for configuring automatic evolution of dynamic tables. An example method includes parsing, by at least one hardware processor, a query associated with a dynamic table to determine a current base object dependency of the dynamic table on at least a first base object. A prior base object dependency of the dynamic table on at least a second base object is retrieved. A delta between data stored by the at least first base object and data stored by the at least second base object is determined. The dynamic table is updated based on the delta.
Embodiments of the present disclosure provide techniques for mountless querying of listing data. A processing device obtains a query that includes a universal listing identifier of a database, wherein the universal listing identifier is different from an identifier for the database. The processing device activates, at runtime, at least one role for accessing the database and shared objects based on the universal listing identifier. The processing device generates, based on the universal listing identifier and the at least one activated role, an in-memory placeholder object associated with the database. The processing device provides access to data of the database based on the in-memory placeholder object and the query.
A data platform is provided. The data platform is configured to receive a request from a client device of a user to run a web application within a computing environment. It initiates an execution of the web application and determines the availability of a cached user interface state of the web application. Upon determining that the cached user interface state is available, the data platform fetches the cached user interface state from the datastore and communicates it to the client device. This allows for displaying an initial user interface to a user by the client device using the cached user interface state while continuing to initialize the web application as the initial user interface is displayed.
G06F 16/957 - Browsing optimisation, e.g. caching or content distillation
G06F 9/451 - Execution arrangements for user interfaces
G06F 16/953 - Querying, e.g. by the use of web search engines
G06F 21/53 - Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems during program execution, e.g. stack integrity, buffer overflow or preventing unwanted data erasure by executing in a restricted environment, e.g. sandbox or secure virtual machine
46.
CONFIGURING INTERACTIONS BETWEEN PYTHON AND SQL CELLS IN A NOTEBOOK
Provided herein are systems and methods for configuring interactions between Python and SQL cells in a notebook. An example method includes detecting a run cell message received from a notebook UI application. The run cell message specifies a set of cells of a notebook. At least a first cell of the set of cells is configured as an SQL cell within the notebook. A query within at least one SQL statement associated with the SQL cell is executed to generate cell results. The cell results of the SQL cell are stored in a global namespace of the notebook. Access to the cell results in the global namespace is configured to at least a second cell of the set of cells.
Data replication can be used to copy database data from a primary deployment to a secondary deployment in a network-based data system. Logical representation of the clone tables in the secondary deployment can be used to reduce data transfer and storage costs. In response to a refresh request, the data system may clone from existing tables stored in the secondary deployment by applying a difference operation on the existing tables instead of copying entire cloned tables for each refresh request.
G06F 16/27 - Replication, distribution or synchronisation of data between databases or within a distributed database systemDistributed database system architectures therefor
G06F 16/11 - File system administration, e.g. details of archiving or snapshots
G06F 16/174 - Redundancy elimination performed by the file system
48.
DISCRETE WORKLOAD PROCESSING USING A PROCESSING PIPELINE
Provided herein are systems and methods for discrete workload processing using a file processing service. An example method includes retrieving a manifest file from a work queue. The manifest file includes metadata associated with a plurality of workloads. A plurality of processing configurations corresponding to the plurality of workloads is generated. A processing configuration of the plurality of processing configurations is associated with scheduling execution of one or more tasks for a workload of the plurality of workloads. A processing pipeline definition of the manifest file is generated. The processing pipeline definition includes the plurality of processing configurations. The processing pipeline definition is registered with a pipeline definition registry of a network-based database system to generate a definition registration.
Provided herein are systems and methods for source monitoring associated with discrete workload processing. An example method includes generating a processing pipeline definition comprising a plurality of configurations associated with a corresponding plurality of notification fetching jobs. A source monitor definition is generated based on the processing pipeline definition. A source monitor definition instance is instantiated based on the source monitor definition. One or more notifications associated with a data source are fetched based on executing at least one notification fetching job of the plurality of notification fetching jobs configured in the source monitor definition instance.
Provided herein are systems and methods for data table auto-refresh. An example method includes configuring a first processing pipeline definition comprising a first plurality of configurations associated with a corresponding plurality of notification fetching jobs for metadata of a database table. A second processing pipeline definition is configured to include a second plurality of configurations associated with the metadata. A source monitor pipeline is instantiated based on the first processing pipeline definition to fetch a manifest file based on the first plurality of configurations. A refresh pipeline is instantiated based on the second processing pipeline definition to perform a refresh operation of the metadata and generate refreshed metadata based on the second plurality of configurations.
Replication of data is disclosed. A method includes replicating the data stored in a primary deployment hosted by a first cloud storage provider such that the data is further stored in a secondary deployment hosted by a second cloud storage provider. The method includes determining that the primary deployment transitioned from an available state to an unavailable state. The method includes executing one or more transactions on the data at the secondary deployment to cause a change to the data in response to determining that the primary deployment is unavailable.
G06F 16/27 - Replication, distribution or synchronisation of data between databases or within a distributed database systemDistributed database system architectures therefor
H04L 67/1097 - Protocols in which an application is distributed across nodes in the network for distributed storage of data in networks, e.g. transport arrangements for network file system [NFS], storage area networks [SAN] or network attached storage [NAS]
52.
DATA CONSISTENCY SERVICE FOR INTERNAL AND EXTERNAL VOLUMES
A network-based database system that performs consistency checks on data files to which the network-based database system does not have write access is provided. The network-based database system monitors a data file stored in a read-only storage system for changes. Upon detecting a change, the network-based database system performs a data consistency check using the content of the data file and its first metadata. If an inconsistency between the content and the first metadata is detected, the network-based database system sets a flag in second metadata, which is stored in a writable storage system, indicating the detected inconsistency. The network-based database system detects this flag during the execution of a query against a data object of the data file and executes the query without query performance tuning based on the detection of the flag, ensuring accurate query results.
The subject technology receives blob metadata from a key-value store. The subject technology retrieves a blob file from a blob store based on the blob metadata; the blob file comprises at least one of a snapshot file or a delta file. The subject technology transforms the blob file from a first format to a column file format, the transformation comprising converting data from the blob file to rowsets and writing the rowsets into a file in the column file format. The subject technology stores the file in a local cache.
Various embodiments described herein provide for systems, methods, devices, instructions, and like for generating a structured language data query based on a natural language request and context data relating to a schema of a data store (e.g., database or the like). In particular, some embodiments use a set of large language models to generate a structured language data query for a data store based on the natural language request and the context data, where the response comprises a structured language data query for a data store, and a natural language explanation of the structured language data query.
A data platform for managing an application as a first-class database object. The data platform includes at least one processor and a memory storing instructions that cause the at least one processor to perform operations including detecting a data request from a browser for a data object located on the data platform, executing a stored procedure, the stored procedure containing instructions that cause the at least one processor to perform additional operations including instantiating a User Defined Function (UDF) server, an application engine, and the application within a security context of the data platform based on a security policy determined by an owner of the data object. The data platform then communicates with the browser using the application engine as a proxy server.
A system includes one or more hardware processors and at least one memory storing instructions. The hardware processors receive a packages policy for a cloud data platform account, the packages policy including at least one allowlist and at least one blocklist. The hardware processors receive a request to generate a report associated with the packages policy. In response, the hardware processors generate a report identifying, for the account, packages or versions of packages allowed by the allowlist and packages or versions of packages blocked by the blocklist, at a specified time or over a specified period. The hardware processors generate a notification to a user when a package is added to or removed from the allowlist or blocklist, the notification including a summary of changes and a reference to access an updated version of the report.
G06F 21/53 - Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems during program execution, e.g. stack integrity, buffer overflow or preventing unwanted data erasure by executing in a restricted environment, e.g. sandbox or secure virtual machine
57.
COLUMNAR DATA ANONYMIZATION USING SEMANTIC AND PRIVACY CATEGORY-BASED EXPANSION AND TRANSFORMATION
An approach is disclosed that retrieves data from a data set organized in multiple columns, where a first column includes both a first and a second data type. The approach expands the first column into a second column for the first data type and a third column for the second data type; determines a semantic category for each data type; and assigns a privacy category to each semantic category. The approach then anonymizes the second column using a first anonymization technique based on the first privacy category, and anonymizes the third column using a second anonymization technique based on the second privacy category. In turn, the approach generates an anonymized view of the data set using the anonymized data.
Provided herein are systems and methods for dynamic table replication. A method includes configuring a first DT within a first failover group. The method further includes causing replication of the first DT from a primary deployment of a network-based database system to a second DT in a secondary deployment of the network-based database system. The method further includes configuring the second DT as a primary DT in the secondary deployment based on detecting a failover event in the primary deployment. The method further includes performing an automatic refresh of the primary DT in the secondary deployment based on a scheduling state of the first DT in the primary deployment prior to the failover event.
G06F 16/27 - Replication, distribution or synchronisation of data between databases or within a distributed database systemDistributed database system architectures therefor
59.
QUERY GENERATION BASED ON NATURAL LANGUAGE QUESTION AND SEMANTIC DATA
Various embodiments described herein provide for systems, methods, devices, instructions, and like for generating a structured language data query based on a natural language question and semantic data associated with a schema of a data store (e.g., database or the like). In particular, some embodiments use a set of large language models to generate a structured language data query for a data store based on semantic data and the natural language question, determines whether the structured language data query is valid, causes the structured language data query to be performed on a data store in response to determining that the structured language data query is valid, and generating a response that comprises a query result from the data store.
Disclosed are techniques for selectively sharing with a provider account of a data exchange, events generated by an application shared by the provider account. A set of telemetry definitions may be defined for a data listing via which an application is shared by a provider account of a data sharing platform. Each of the set of telemetry definitions specifies a type of event generated by the application and a corresponding sharing requirement for the type of event. The set of telemetry definitions are persisted as metadata associated with the data listing. The application may be installed in a consumer account of the data exchange. In response to the application generating a plurality of events, a subset of the plurality of events may be shared with the provider account, wherein the subset of the plurality events that is shared is based in part on the set of telemetry definitions.
An entity-level privacy system receives a query directed towards a shared dataset, the shared dataset comprising one or more data entries associated with one or more distinct entities, each entity of the one or more distinct entities being identifiable by one or more unique entity identifiers. The entity-level privacy system implements an entity-level privacy constraint, the entity-level privacy constraint comprising a dynamic aggregation constraint based on the one or more unique entity identifiers. The entity-level privacy system determines that the one or more unique entity identifiers satisfy a threshold condition comprising a minimum number of entities. The entity-level privacy system enforces the entity-level privacy constraint on the query and generates an output to the query based on the entity-level privacy constraint and the dynamic aggregation constraint while maintaining entity-level privacy associated with the one or more distinct entities.
Some embodiments include information retrieval through query history insights by accessing query history of a first user, processing the query history of the first user using a first machine learning model to identify naming characteristics of the query history specific for the first user, and enriching a database comprising data associated with the first user with the identified naming characteristics of the query history. The system receives a new search query in natural language from the first user, processes the new search query in the natural language using a second machine learning model to identify embeddings within the new search query, identifies one or more recommended tables and corresponding columns, and causes display of the recommended tables and corresponding columns for each of the recommended tables by a user device of the first user.
Provided herein are systems and methods for a zero-copy clone of a DT. A method includes performing a clone operation on a dynamic table (DT) to generate a cloned DT. The DT is based on a query applied on a base table. The cloned DT is based on the query applied on a cloned base table corresponding to the base table. A first delta is determined based on at least one change in the base table between a first version of the base table used by the DT at a time of the clone operation and a second version of the base table generated prior to the clone operation. A first refresh operation of the cloned DT is performed based on the first delta.
G06F 16/27 - Replication, distribution or synchronisation of data between databases or within a distributed database systemDistributed database system architectures therefor
A data platform that performs a file existence check is provided. The data platform creates a bounded page and selects a set of selected metadata files from a set of metadata files, where each selected metadata file includes a set of data file metadata files. Each member of the set of data file metadata files includes a file name of a respective data file. The data platform stores the set of data file metadata files of each selected metadata file in a first sorted list in the bounded page. The data platform retrieves a second sorted list of file names of a set of data files stored on a data storage system. The data platform determines the existence of each respective data file of each member of the set of data file metadata files on the data storage system by comparing the first sorted list to the second sorted list.
G06F 7/08 - Sorting, i.e. grouping record carriers in numerical or other ordered sequence according to the classification of at least some of the information they carry
G06F 16/174 - Redundancy elimination performed by the file system
65.
INTEGRATE SQL DATABASE WITH CONTAINER EXECUTION MANAGEMENT
The subject technology receives a first set of statements, the first set of statements comprising at least a first statement to create a particular service associated with a container service. Further, the subject technology instantiates the particular service associated with the container service, in response to the first statement, at a first cluster, the first cluster including a first set of worker nodes.
The subject technology receives a first request to create a container service, the request indicating a service specification for creating the container service. The subject technology generates a set of endpoints based on the service specification. The subject technology generates a set of roles based on the service specification. The subject technology stores service metadata related to the set of endpoints and the set of roles in a metadata database. The subject technology instantiates the container service at a container services cluster, the container services cluster including a set of worker nodes, the container service being deployed on a worker node from the set of worker nodes, and enforces security policies based on the roles and service metadata. The subject technology coordinates with Role Based Access Control (RBAC) and network policies of the subject database system and transparently enforces the same policies over in the subject container system.
Techniques for providing an interface for viewing real-time metadata stored in different locations and in different formats are described. A monitoring schema may process queries related to user metadata using the techniques described below in further detail. The monitoring schema may also provide a single interface with fine-grain access control for viewing metadata based on role-based access control with limitless retention using different storage locations.
An assignment of a resource for a service to a compute node in a compute cluster is evaluated. The evaluating of the assignment includes determining one or more capacity consumption metrics associated with compute capacity consumed by the resource and determining one or more available capacity metrics associated with the compute node. The one or more capacity consumption metrics are compared with the one or more available capacity metrics to determine whether the compute node has available capacity for the assignment of the resource. A determination whether to confirm the assignment of the resource to the compute node is made based on the evaluating.
Systems and methods are provided for classifying data. The systems and methods access an automatic classification profile comprising one or more conditions for triggering data classification and access a classification scope that identifies one or more tables to be classified. The systems and methods determine that a set of attributes of the one or more tables identified by the classification scope corresponds to the one or more conditions of the automatic classification profile. The systems and methods, in response to determining that the set of attributes of the one or more tables identified by the classification scope corresponds to the one or more conditions of the automatic classification profile, automatically classify data stored in one or more columns of the one or more tables.
The subject technology initiates a reinstallation process of a key-value storage device and locking the key-value storage device. The subject technology performs a bootstrap process for a blob manager and a blob worker. The subject technology performs a restoration process of a storage server. The subject technology applies a set of mutation logs to the storage server. The subject technology unlocks the key-value storage device and enabling network traffic for the key-value storage device.
G06F 11/14 - Error detection or correction of the data by redundancy in operation, e.g. by using different operation sequences leading to the same result
A data platform that upgrades applications having containerized services across multiple consumer user accounts when the data platform receives a new version from a provider user. For each consumer account utilizing the application, the data platform performs a series of upgrade operations. The operations include identifying the relevant set of services linked to the application and executing an upgrade command for each service to transition to the new version. The data platform actively monitors the health and version status of each service, ensuring they meet the upgrade criteria. The upgrade is deemed successful and confirmed by the data platform once all services are verified to be healthy and aligned with the new version, thus ensuring a seamless and efficient upgrade experience.
Systems and methods for an organization-level account for an organization on a data platform, users of which can possess administrative or management privileges with respect to the organization and across one or more others accounts of the organization.
Row-level security (RLS) may provide fine-grained access control based on flexible, user-defined access policies to databases, tables, objects, and other data structures. A RLS policy may be an entity or object that defines rules for row access. A RLS policy may be decoupled or independent from any specific table. This allows more robust and flexible control. A RLS policy may then be attached to one or more tables. The RLS policy may include a Boolean-valued expression.
Some embodiments include receiving a first query directed towards a shared dataset, accessing a first set of data from the shared dataset to perform the one or more functions, determining that a row access policy is to be enforced in relation to the first query, and generating an output to the first query based on an execution of the one or more functions.
Disclosed are techniques for anomaly detection in time series data using an ML model. An untrained time series forecasting machine learning (ML) model may be provided as part of a class that includes an anomaly detection function, a features module, and a target transform module. In response to the class being invoked, an instance of the time series forecasting ML model may be trained using training time series data specified in the invocation of the class. The trained instance of the forecasting ML model may be persisted in an anomaly detection object along with instances of the anomaly detection function, the features module, and the target transform module. In response to receiving a call to the anomaly detection object, performing anomaly detection on time series data specified in the call using at least the trained instance of the forecasting ML model and the instance of the anomaly detection function.
Disclosed is a method of detecting anomalies in time series data. The method includes computing a first bound for a first window of the time series a second bound for a second window of the time series, wherein the second window includes more samples of the time series data. The method also includes generating a first outlier status that indicates whether a current value of the time series data exceeds the first bound, and generating a second outlier status that indicates whether the current value of the time series data exceeds the second bound. The method also includes determining, by a processing device, whether an anomaly is detected in the time series data based on values of the first outlier status and the second outlier status. The method also includes generating an alert in response to determining that the anomaly is detected and sending the alert to a notification system.
H04L 41/0604 - Management of faults, events, alarms or notifications using filtering, e.g. reduction of information by using priority, element types, position or time
H04L 41/16 - Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks using machine learning or artificial intelligence
77.
LARGE LANGUAGE MODEL-BASED COMMUNICATION CONTENT GENERATION
Various embodiments described herein provide for systems, methods, devices, instructions, and like for generating communication content using one or more large language models (LLMs). In particular, some embodiments provide a communication content generation system that generates content for a communication to a target organization using one or more LLMs and information regarding the target organization provided by an organization database, which can comprise curated organization-intelligence data.
Systems and methods are provided for processing a query with one or more predicates in a database system. The system and methods receive a query comprising one or more predicates. The systems and methods process metadata associated with a database comprising a plurality of files to identify a set of fully-matched files and a set of partially-matched files, the set of fully-matched files comprising a first group of files in which all rows of each file match each of the one or more predicates of the query, the set of partially-matched files comprising a second group of files having rows that possibly match the one or more predicates of the query. The system and methods perform, based on the query, one or more database operations on the set of fully matched files prior to processing the set of partially-matched files.
A network egress request is received from a container service within a cloud data platform. A cryptographically signed egress policy associated with the network egress request is received by a trusted service controller of the cloud data platform. The network egress request is validated against the cryptographically signed egress policy. Based on the validation, a determination of whether the network egress request complies with the cryptographically signed egress policy is established. Upon validation, the network egress request is granted or denied based on the determination.
Provided herein are systems and methods for hash-join broadcast decision making. For example, a method includes generating a query plan for a received query. The query plan includes a plurality of join operations with a plurality of hash-join-build (HJB) operations and a plurality of hash-join-probe (HJP) operations. A decision node of a plurality of decision nodes of the query plan is configured as a primary decision node. Build-side data information associated with build-side data and received from the plurality of HJB operations is decoded by the primary decision node. A data distribution method is determined by the primary decision node for each HJB operation of the plurality of HJB operations based on the build-side data information. The query plan is executed based on distributing the build-side data to the plurality of HJP operations using the data distribution method for each HJB operation of the plurality of HJB operations.
A data platform for executing containers is provided. In some examples, the data platform receives an application from an application package of a provider account, the application including a setup script and a manifest of a service. The data platform activates access roles based on the manifest and creates the service and a compute pool using the setup script and a specification file accessed from the application package using an access role. The service is executed in the compute pool, accessing objects of the application package and of the data platform using the access roles.
G06F 16/28 - Databases characterised by their database models, e.g. relational or object models
G06F 21/57 - Certifying or maintaining trusted computer platforms, e.g. secure boots or power-downs, version controls, system software checks, secure updates or assessing vulnerabilities
82.
LIVE METRIC AUTO-SCALING LEVERAGING TELEMETRY SERVICES
Autoscaling techniques can optimize usage of computing resources in a data system while also quickly reacting to change in workloads. The computing resources are arranged in different clusters. Autoscaling can be partitioned into two separate, independent autoscaling phases: a slow autoscaler and a fast autoscaler.
Autoscaling techniques can optimize usage of computing resources in a data system while also quickly reacting to change in workloads. The computing resources are arranged in different clusters. Autoscaling can be partitioned into two separate, independent autoscaling phases: a slow autoscaler and a fast autoscaler.
The subject technology receives, by one or more hardware processors, a request to execute a user-defined function (UDF) within a sandbox process. The subject technology establishes a secure egress path for the UDF using an overlay network, where the overlay network includes a dedicated DNS resolver at a proxy service. The subject technology receives, from the UDF, a DNS request to resolve a hostname. The subject technology validates, by the proxy service, that the hostname is included in an allowed host list associated with the UDF. The subject technology resolves, by the dedicated DNS resolver, the hostname to an IP address using a UDP listener configured to handle DNS protocol traffic on a designated port of the proxy service. The subject technology enables the UDF to communicate with a host at the resolved IP address via the secure egress path.
The subject technology receives a first semi-structured object. The subject technology iterates through a list of fields specified by a target object type. The subject technology, for each field, determines whether a field with a same name is present in the first semi-structured object. The subject technology, in response to the field being found in the first semi-structured object, converts a value of the field to a target field type according to defined type conversion rules. The subject technology stores the converted value in a unified representation comprising a data structure that stores both structured and semi-structured data types. The subject technology processes a query using the unified representation.
Techniques for creating and modifying database entities, such as tables, tasks, etc., using declarative statements are described. Declarative statements specify a target state of the entity without specifying specific actions. The techniques described herein apply changes to the database entity atomically and incrementally.
A method includes retrieving, by at least one hardware processor in a database system, a database table. The database table includes a plurality of partitions. A plurality of batches is generated for the database table based on a file selection task of the database system. Each batch of the plurality of batches includes a partition subset of the plurality of partitions. A plurality of execution jobs is configured based on an execution management task of the database system. Each execution job of the plurality of execution jobs includes a batch subset of the plurality of batches, and the skew of batch sizes for the batch subset is below a threshold skew. Concurrent execution of the plurality of execution jobs is performed to cluster the partition subset associated with each of the plurality of execution jobs.
Techniques for continuous ingestion of files using custom file formats are described. A custom file format may include formats not natively supported by a data system. Unstructured files (e.g., images) may also be considered custom file formats. A custom file format may be set using a user defined table function and scanner options.
Various embodiments provide for managing differential privacy on a database system using one or more differential privacy policies and one or more differential privacy budgets associated with the one or more differential privacy policies.
Various embodiments provide for using one or more differential privacy domains on a database system to execute a differentially private query on the database system.
42 - Scientific, technological and industrial services, research and design
Goods & Services
Data migration services; Cloud computing featuring software for use in data migration, data backup, and data retrieval; Software as a service (SAAS) services featuring software for use in data migration, data backup, and data retrieval; Providing temporary use of non-downloadable software for data extraction, data integration, data management, data consolidation, data migration, data configuration, data unification and data loading; Platform as a service (PAAS) featuring computer software platforms for use by others in connection with data migration; Computer consulting services in the field of data management, data migration, data analysis, and data reporting
Embodiments of the present disclosure describe systems, methods, and computer program products for redacting sensitive data within a database. An example method can include receiving a data query referencing unredacted data of a database, wherein the data query that is received comprises a value identifying a type of sensitive data to be redacted from the unredacted data, responsive to the data query, executing, by a processing device, a redaction operation to identify candidate sensitive data that matches the type of sensitive data to be redacted within the unredacted data of the database, and returning a redacted data set in which the candidate sensitive data that is provided is based on an authentication level utilized for execution of the redaction operation.
A low-code web application testing platform is provided. The low-code web application testing platform automates the testing process of web applications. The low-code web application testing platform executes a script that simulates the frontend of a web application, capturing output messages that detail the UI elements. The low-code web application testing platform then interprets these messages to construct a navigable structure that represents the application's UI. To emulate user interactions, the low-code web application testing platform performs test actions within this structure, subsequently rerunning the script with these interactions to capture additional output messages that reflect the application's response. The culmination of this process is the generation of a test report, which is based on the application's reaction to the emulated interactions, providing a comprehensive assessment of the application's functionality and user experience.
Systems and methods are provided for controlling the deletion of data in a database system. The system receives input comprising a deletion criterion for a database system. The system applies the deletion criterion to a set of tables of the database system. The system determines that an individual portion of the set of tables satisfies the deletion criterion. In response to determining that the individual portion of the set of tables satisfies the deletion criterion, the system transfers the individual portion of the set of tables to a temporary storage system.
A method implementing a fault-tolerant data warehouse using availability zones includes allocating a plurality of processing units to a data warehouse, the processing units located in different availability zones, an availability zone comprising one or more data centers. The method further includes routing a query to a processing unit within the data warehouse, the query having a common session identifier with a query previously provided to the processing unit, the processing unit determined to be caching a data segment associated with a cloud storage resource independent of the plurality of processing units. The method further includes, as a result of monitoring a number of queries running at an input degree of parallelism, determining that the processing capacity of the processing units has reached a threshold; and changing a total number of processing units using the input degree of parallelism and the number of queries.
H04L 67/1097 - Protocols in which an application is distributed across nodes in the network for distributed storage of data in networks, e.g. transport arrangements for network file system [NFS], storage area networks [SAN] or network attached storage [NAS]
G06F 9/50 - Allocation of resources, e.g. of the central processing unit [CPU]
G06F 16/28 - Databases characterised by their database models, e.g. relational or object models
H04L 41/0896 - Bandwidth or capacity management, i.e. automatically increasing or decreasing capacities
H04L 41/5025 - Ensuring fulfilment of SLA by proactively reacting to service quality change, e.g. by reconfiguration after service quality degradation or upgrade
H04L 43/0817 - Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters by checking availability by checking functioning
H04L 67/1008 - Server selection for load balancing based on parameters of servers, e.g. available memory or workload
97.
Distributed in-database vectorized operations using user defined table functions
The subject technology determines a set of shards of rows from a data set based on a number of rows and a number of execution nodes to execute a request for determining a correlation. For each shard from the set of shards, the subject technology sends a particular user defined table function (UDTF), including a particular shard of rows, to a different execution node to perform a set of operations for determining the correlation. The subject technology provides a set of output values of each particular UDTF corresponding to each shard from the set of shards in a second UDTF. The subject technology sends the second UDTF to a particular execution node to perform an aggregate operation using the set of output values of each particular UDTF. The subject technology receives a value of the correlation from the particular execution node based on the aggregate operation.
An advanced search system leverages a pre-trained large language model to enhance user query responses. The system, equipped with hardware processors, a search query via an interface and accesses a pre-trained large language model designed to respond to the search query. The system fine-tunes the model to generate a task-specific generative model. The system employs the task-specific generative model to generate a search result to the search query and analyzes the search result based on a performance metric associated with the task-specific generative model. The system refines the task-specific generative model based on the analyzing of the search result.
Techniques for generating and storing application telemetry data are described. An example method includes generating, by an application, a unit of telemetry data comprising metrics related to a runtime state of the application. The method also includes generating a character string comprising the metrics. The method also includes writing, by a processing device executing the application, a zero-byte file to a storage system using the character string as a file name of the zero-byte file.
Techniques for providing adaptive warehouses in a multi-tenant data system are described. The workloads for the account can be multiplexed in the adaptive warehouse environment. Warehouse endpoints in a warehouse layer can be defined for an account in the multi-tenant data system. A compute layer for the account can be divided into workload regions, where each workload region corresponds to a different workload type.