Cache circuitry may be configured to receive a first message to downgrade a permission associated with data stored in a current level cache. For example, the current level cache could be a level two (L2) cache. The cache circuitry could receive the first message from a processor core having a level one (L1) cache. The cache circuitry may forward the first message to a higher level cache. For example, the higher level cache could be a level three (L3) cache. The cache circuitry may downgrade the permission associated with data stored in the current level cache based on receiving a second message from the higher level cache. The cache circuitry may forward the first message before receiving the second message and downgrading the permission. The second message may cause downgrade of the permission in multiple caches (e.g., the L1, L2, and L3 caches).
A system can include a memory and a processing device, operatively coupled to the memory, configured to perform operations including receiving, from a client device using a camera, two-dimensional (2D) image data representing a scene including a subject, providing, to a camera pose identification model, an input including information identifying a set of attributes of the camera, obtaining, from the camera pose identification model, an output including information identifying at least one camera pose parameter, and performing at least one task based on the output. The set of attributes of the camera includes at least one orientation angle of the camera about at least one axis. Performing the at least one task can include generating a three-dimensional (3D) representation of the subject depicted in the 2D image data.
A system may include a processor having a pipeline, a plurality of counters, and trigger circuitry. The plurality of counters may mount events associated with processing instructions in the pipeline. Counters of the plurality of counters may count different events. The trigger circuitry may trigger a performance measurement for a first instruction after counters of the plurality of counters meet predefined values. Triggering the performance measurement may cause the plurality of counters to reset and then count events associated with processing the first instruction. In some implementations, the trigger circuitry may trigger the performance measurement based on an AND selection and/or an OR selection of multiple counters of the plurality of counters meeting predefined values.
Described are methods, logic, and circuitry which prevent translation lookaside buffer probing. Reporting a privilege violation fault is delayed for a defined period of time. The defined period of time can be a time frame needed to perform a long page table walk, which can be at least hundreds of clock cycles. A counter or a forced page table walk corresponding to the defined period of time can be used to implement the delay.
G06F 12/1027 - Traduction d'adresses utilisant des moyens de traduction d’adresse associatifs ou pseudo-associatifs, p. ex. un répertoire de pages actives [TLB]
G06F 12/1009 - Traduction d'adresses avec tables de pages, p. ex. structures de table de page
An integrated circuit design may be generated for an integrated circuit. The integrated circuit design may include an instance of a module description that describes a functional operation of a module. The integrated circuit design may be encoded in an intermediate representation (IR) data structure. A parameter comprising metadata associated with the instance may be received. A compiler may compile the IR data structure to produce a register-transfer level (RTL) data structure. The RTL data structure may encode a logic description associated with the instance. Compiling the IR data structure may include associating the parameter comprising metadata with the logic description.
G06F 30/327 - Synthèse logiqueSynthèse de comportement, p. ex. logique de correspondance, langage de description de matériel [HDL] à liste d’interconnections [Netlist], langage de haut niveau à langage de transfert entre registres [RTL] ou liste d’interconnections [Netlist]
Described are systems and methods for clock gating components on a system-on-chip. A processing system includes one or more cores, each core including a clock gating enable bit register which is set by software when an expected idle period of the core meets or exceeds a clock gating threshold, and a power management unit connected to the one or more cores. The power management unit configured to receive an idle notification from a core of the one or more cores and initiate clock gating a clock associated with the core when the core and additional logic is quiescent and the clock gating enable bit register is set. The clock gating threshold is a defined magnitude greater than a clock wake-up time.
Described are methods, data structures, logic, and circuitry which enable a cache replacement policy to track a number of cache states greater than a number of ways in a cache. A cache replacement policy state structure provides that the number of cache states is greater than the N ways for each way per cache index in the cache The additional states can enable different handling and prioritization of data that has been prefetched (prefetch cache states) or loaded with a non-temporal hint from software (non-temporal cache states). The number of prefetcher states and non-temporal states are each less than the number of ways. The non-temporal states can only be promoted to certain of the cache states.
G06F 12/123 - Commande de remplacement utilisant des algorithmes de remplacement avec listes d’âge, p. ex. file d’attente, liste du type le plus récemment utilisé [MRU] ou liste du type le moins récemment utilisé [LRU]
G06F 12/0862 - Adressage d’un niveau de mémoire dans lequel l’accès aux données ou aux blocs de données désirés nécessite des moyens d’adressage associatif, p. ex. mémoires cache avec pré-lecture
G06F 12/126 - Commande de remplacement utilisant des algorithmes de remplacement avec maniement spécial des données, p. ex. priorité des données ou des instructions, erreurs de maniement ou repérage
8.
INTEGRATED CIRCUIT DESIGN VERIFICATION WITH MODULE SWAPPING
An integrated circuit design may be generated for an integrated circuit. The integrated circuit design may include an instance of a module description that describes a functional operation of a module. The instance may include an input and an output. The integrated circuit design may be encoded in an intermediate representation (IR) data structure. Parameters may be received indicating that the instance should be replaced with a simulation model. The parameters may include a first parameter pointing to the instance and a second parameter pointing to the simulation model.
Described are systems and methods for power management. A processing system includes one or more cores and a connected power management unit (PMU). The PMU is selected from one of: a first level PMU which can power scale a; a second level PMU which can independently control power from a shared cluster power supply to each core of two or more cores in a cluster; a third level PMU where each core includes a power monitor which can track power performance metrics of an associated core; and a fourth level PMU when a complex includes multiple clusters and each cluster includes a set of the one or more cores, the fourth level PMU including a complex PMU and a cluster PMU for each of the multiple clusters, the complex PMU and cluster PMUs provide two-tier power management. Higher level PMUs include power management functionality of lower level PMUs.
A first circuitry may have a first interface. A response circuitry having a system interface may connect to the first circuitry. The response circuitry may receive an input selection that determines an error handling mode used to respond to an error (e.g., in a lockstep system, the error could be a difference between an output at the first interface and a second output at a second interface identified by comparing the first output to the second output). In some implementations, the error handling mode may cause the response circuitry to provide the output to the system interface and send an indication to software based on detecting the error. In some implementations, the error handling mode may cause the response circuitry to contain at least a portion of the output by disabling at least a portion of the system interface for multiple clock cycles based on detecting the error.
Apparatus and methods for dependency tracking, chaining, and/or fusing for vector instructions. A system, processor, or integrated circuit includes a renamer to generate a valid bit mask for each micro-operation decoded from a first vector instruction, where the valid bit mask indicates what portion of a mask register to write and generate a dependency bit mask for each micro-operation decoded from a second vector instruction, where the dependency bit mask is based on a relationship between the first vector instruction and the second vector instruction, and an issue queue configured to issue for execution each micro-operation from the second vector instruction when an associated dependency bit mask is cleared based on execution of appropriate micro-operations from the first vector instruction.
Apparatus and methods which bundle micro-operations with respect to a vector instruction, dynamically allocate register blocks for a vector instruction, and track the registers using valid bits. A method includes decoding, by a decoder, a vector instruction having a length multiplier of at least two into a number of micro-operations less than the length multiplier, allocating, by an issue queue, an issue queue entry to each of the number of micro-operations and executing, by the issue queue with execution units, each of the number of micro-operations a number of times from the issue queue entry to collectively match the length multiplier.
Described is a system and method for implementing a prefetcher with an out-of-order filtered prefetcher training queue. A processing system includes a prefetcher and a prefetcher training queue connected to the prefetcher. The prefetcher training queue configured to receive one or more demand requests from one or more load-store units, allocate a prefetcher training queue entry for a non-duplicative demand request, and send, to the prefetcher, a stored demand request together with a hit or miss indicator, wherein the prefetcher training queue sends stored demand requests without regard to program order.
G06F 12/0862 - Adressage d’un niveau de mémoire dans lequel l’accès aux données ou aux blocs de données désirés nécessite des moyens d’adressage associatif, p. ex. mémoires cache avec pré-lecture
14.
Debug In System On A Chip With Securely Partitioned Memory Space
Systems and methods are disclosed for debug in a system on a chip with a securely partitioned memory space. For example, an integrated circuit (e.g., a processor) for executing instructions includes a processor core configured to execute instructions, including a data store configured to store a first world identifier; an outer memory system configured to store instructions and data; a data store configured to store a debug world list that specifies which world identifiers supported by the integrated circuit are authorized for debugging; and a debug enable circuitry configured to generate a debug enable signal based on the first world identifier and the debug world list, wherein the processor core is configured to jump to debug handler instructions in response to a debug exception or ignore the debug exception depending on the debug enable signal.
Systems and methods are disclosed for page table entry caches with multiple tag lengths. For example, an integrated circuit (e.g., a processor) includes a page table walk circuitry including a page table entry cache, in which the page table walk circuitry is configured to access a multi-level page table, and in which a first entry of the page table entry cache combines a first number of multiple levels and a second entry of the page table entry cache combines a second number of multiple levels that is different from the first number of multiple levels.
G06F 12/1045 - Traduction d'adresses utilisant des moyens de traduction d’adresse associatifs ou pseudo-associatifs, p. ex. un répertoire de pages actives [TLB] associée à une mémoire cache de données
16.
INTEGRATED CIRCUIT GENERATION WITH IMPROVED INTERCONNECT
Disclosed are systems and methods that include accessing design parameters to configure an integrated circuit design. The integrated circuit design may include a transaction source or processing node to be included in an integrated circuit. The transaction source or processing node may be configured to transmit memory transactions to memory addresses. A compiler may compile the integrated circuit design with the transaction source or processing node to generate a design output. The design output may be configured to route memory transactions based on their targeting cacheable or non-cacheable memory addresses. The design output may be used to manufacture an integrated circuit.
G06F 13/16 - Gestion de demandes d'interconnexion ou de transfert pour l'accès au bus de mémoire
G06F 12/0888 - Adressage d’un niveau de mémoire dans lequel l’accès aux données ou aux blocs de données désirés nécessite des moyens d’adressage associatif, p. ex. mémoires cache utilisant la mémorisation cache sélective, p. ex. la purge du cache
17.
CONFIGURING A PREFETCHER ASSOCIATED WITH A PROCESSOR CORE
Disclosed are systems and methods for configuring a prefetcher. A process may reconfigure a prefetcher associated with a processor core responsive to a context switch. The context switch may comprise the processor core changing from executing a first process to a second process. In some implementations, reconfiguring the prefetcher may include updating a register controlling an operation of the prefetcher from a first set of parameters associated with the first process to a second set of parameters associated with the second process. In some implementations, the second set of parameters may be based on input from a process executed in a user mode.
Disclosed are systems and methods that include integrated circuit generation with composable interconnect. In some implementations, a system may access a design parameters data structure that specifies an interconnect topology to be included in an integrated circuit. The system may invoke an integrated circuit design generator that applies the design parameters data structure, including with the interconnect topology. In some implementations, the design parameters data structure may specify a definition for a hardware object (e.g., the interconnect topology) and instances of the hardware object. The definition and the instances may each be modifiable. The system may invoke the generator to apply the design parameters data structure to generate the design.
Systems and methods are disclosed for store-to-load forwarding for processor pipelines. For example, an integrated circuit (e.g., a processor) for executing instructions includes a processor pipeline; a store queue that has entries associated with respective store instructions that are being executed, wherein an entry of the store queue includes a tag that is determined based on a virtual address of a target of the associated store instruction; and store-to-load forwarding circuitry that is configured to: compare a first virtual address of a target of a first load instruction being executed by the load unit to respective tags of one or more entries in the store queue; select an entry of the store queue based on a match between the first virtual address and the tag of the selected entry; and forward data of the selected entry in the store queue to be returned by the first load instruction.
Systems and methods are disclosed for cycle accurate tracing of vector instructions. For example, a system may include a vector unit in communication with a scalar core. The vector unit may include a vector instruction queue that receives vector instructions from the scalar core. The vector unit may also include a vector execution unit that executes vector instructions from the vector instruction queue. The system may also include checkpoints in the vector unit including a first checkpoint including circuitry that sets a first bit for a first clock cycle in which a first vector instruction exits the vector instruction queue, and a second checkpoint including circuitry that sets a second bit for a second clock cycle in which a second vector instruction exits the vector execution unit.
G06F 11/14 - Détection ou correction d'erreur dans les données par redondance dans les opérations, p. ex. en utilisant différentes séquences d'opérations aboutissant au même résultat
G06F 9/30 - Dispositions pour exécuter des instructions machines, p. ex. décodage d'instructions
G06F 11/36 - Prévention d'erreurs par analyse, par débogage ou par test de logiciel
Systems and methods are disclosed for implementing a quad narrowing operation. The quad narrowing operation converts an output of a 32 bit floating-point operation to the 8 bit integer format by rounding the 32 bit floating-point operation and clamping the rounded 32 bit floating-point input by an 8 bit lower bound and an 8 bit upper bound which are defined in a 16 bit scalar register to generate the fixed-point output. The 8 bit lower bound is defined by the 8 most significant bits of the 16 bit scalar register and the 8 bit upper bound is defined by the 8 least significant bits of the 16 bit scalar register.
Systems and methods are disclosed for using hybrid floating-point and fixed-point computations for improved neural network accuracy. A neural network is defined using fixed-point computational units. Certain of the fixed-point computational units are identified based on replacement criteria. The identified fixed-point computational units are replaced with floating-point computational units to increase computational accuracy with minimal computational cost.
G06F 7/483 - Calculs avec des nombres représentés par une combinaison non linéaire de nombres codés, p. ex. nombres rationnels, système de numération logarithmique ou nombres à virgule flottante
Systems and methods are disclosed for processor crash analysis using register sampling. For example, an integrated circuit may include a processor core configured to execute instructions, wherein the processor core includes a program counter register and an exception program counter register (e.g., a machine exception program counter register) that is configured to store a program counter value that was current when an exception occurred; a data store connected to the exception program counter register via an ingress port that is configured to store a copy of an exception program counter value responsive to retirement of an instruction; and an exception program counter capture register, configured to store an exception program counter value from the second data store responsive to a reset signal for the processor core.
Systems and methods are disclosed for debug event tracing. For example, an integrated circuit (e.g., a processor) for executing instructions includes a processor core; a data store configured to store a code indicating a cause of an interrupt; a trace buffer configured to store a sequence of debug trace messages; and a debug trace circuitry that is configured to: responsive to a first interrupt to the processor core, generate a first debug trace message including a timestamp and a code from the data store that indicates a cause of the first interrupt; and store the first debug trace message in the trace buffer. In some implementations, the timestamp is generated using a Gray code counter of the integrated circuit.
A program sequence, comprising an inner loop nested in an outer loop, may be converted to multiple statements of the inner loop with a statement of the multiple statements changing by an index of the outer loop. A memory access for a first statement of the multiple statements may be combined with a memory access for a second statement of the multiple statements via a vector instruction (e.g., a segmented-strided vector instruction). The vector instruction may be configured to access sets of N data elements, where N is a segment size of data elements in consecutive locations in memory, and where sets of N data elements are spaced at a constant distance in memory.
Systems and methods are disclosed for error management in a system on a chip with a securely partitioned memory space. For example, an integrated circuit (e.g., a processor) for executing instructions includes a world identifier checker circuitry configured to check memory requests for one or more memory mapped resources that are received via the bus that have been tagged with a world identifier to determine whether to allow or reject access based on the tagged world identifier; a world identifier checker circuitry configured to compare the tagged world identifier to a world list for a resource that specifies which world identifiers supported by the integrated circuit are authorized for access to the resource; and a data store configured to store world error data, including the tagged world identifier of a memory request that has been rejected by the world identifier checker circuitry.
Apparatus and methods for cracking and processing vector instructions in a vector pipeline after decoding of a single or a first micro-operation in a main or primary pipeline are described. An integrated circuit includes a primary pipeline to decode a micro-operation from an instruction, create a reorder buffer entry in a reorder buffer for the micro-operation, responsive to a determination that the instruction is a vector instruction, send the micro-operation to a vector pipeline, and responsive to a determination that the instruction is a multiple register vector instruction, signal a vector pipeline to decode a remaining micro-operations from the instruction, and the vector pipeline to process the micro-operation, and process the remaining micro-operations when the instruction is the multiple register vector instruction.
Systems and methods are disclosed for address range encoding in a system on a chip with a securely partitioned memory space. For example, methods may include receiving, via a bus from a processor core, a memory request for a memory mapped resource; comparing an address included in the memory request to an address range, determined based on an address field and an address range configuration field, for a resource; comparing a first hardware security identifier from the memory request to a hardware security list associated with the resource when the address of the memory request is within the address range for the resource; and, based on the comparison of the first hardware security identifier with the hardware security list, determining whether to allow or reject access to the resource for the memory request.
Systems and methods are disclosed for serial wire timer distribution in an SoC. For example, some methods may include adding an offset to a current timestamp to obtain an offset timestamp; serially transmitting the offset timestamp on a first conductor according to a functional-logic clock signal; receiving the offset timestamp via the first conductor; and writing the offset timestamp to a timestamp register at a time according to an edge of a fixed-frequency reference clock signal on a second conductor.
Systems and methods are disclosed for macro-op fusion in pipelined architectures. For example, some methods include detecting a sequence of macro-ops stored in an instruction decode buffer, the sequence of macro-ops including a first macro-op, followed by one or more intervening macro-ops, followed by a last macro-op; determining a micro-op that is equivalent to the first macro-op combined with the last macro-op; and forwarding the micro-op to one or more execution resource circuitries for execution.
An integrated circuit for translating and reverse-translating the address included in a memory request is disclosed. The integrated circuit may first comprise a processor, a first boundary function, a second boundary function, and a component device. The processor is configured to transmit a memory request to a target module over a bus of the integrated circuit. The memory request requests access to one or more memory mapped resources and the memory request includes a physical address. The first boundary function is configured to translate the physical address to a relative address which operates in or applies to a different address space than an address space that the physical address operates in or applies to. The second boundary function is configured to translate the relative address to the physical address. The device is configured utilize the physical address transmitted by the second boundary function.
Systems and methods are disclosed for partitioning a cache for application of a replacement policy. For example, some methods may include partitioning entries of a set in a cache into two or more subsets; receiving a message that will cause a cache block replacement; responsive to the message, selecting a way of the cache by applying a replacement policy to entries of the cache from only a first subset of the two or more subsets; and responsive to the message, evicting an entry of the cache in the first subset and in the selected way.
G06F 12/0811 - Systèmes de mémoire cache multi-utilisateurs, multiprocesseurs ou multitraitement avec hiérarchies de mémoires cache multi-niveaux
G06F 12/123 - Commande de remplacement utilisant des algorithmes de remplacement avec listes d’âge, p. ex. file d’attente, liste du type le plus récemment utilisé [MRU] ou liste du type le moins récemment utilisé [LRU]
33.
Processing for vector load or store micro-operation with inactive mask elements
Apparatus and methods for processing of a vector load or store micro-operation with mask information as a no-operation (no-op) when a mask vector for the vector load or store micro-operation has all inactive mask elements or processing vector load or store sub-micro-operation(s) with active mask element(s) are described. An integrated circuit includes a load store unit configured to receive load or store micro-operations cracked from a vector load or store operation, determine that a mask vector for the vector load or store micro-operation is fully inactive, and process the vector load or store micro-operation as a no-operation. If the mask vector is not fully inactive, the vector load or store micro-operation is unrolled into vector load or store sub-micro-operation(s) which have active mask element(s). Vector load or store sub-micro-operation(s) which have inactive mask element(s) are ignored.
Apparatus and methods for tracking sub-micro-operations and groups thereof are described. An integrated circuit includes a load store unit configured to receive store micro-operations cracked from a vector store instruction. The load store unit is configured to unroll multiple store sub-micro-operations from each of the store micro-operations. The load store unit includes an issue status vector to track issuance of each sub-micro-operation, an unroll status vector to track unrolling of each sub-micro-operation associated with a group of sub-micro-operations, and a replay status vector to track a replayability of sub-micro-operations associated with the group of sub-micro-operations.
A system may generate an annotation based on an attribute determined in connection with logic. In some implementations, the logic may be between a first function (e.g., a first point of logic, such as an encoder) and a second function (e.g., a second point of logic, such as a decoder) in a first level circuit representation. The attribute may indicate, for example, fault protection using error correction code, parity, or Gray code, or a power level, frequency domain, or clock domain. The system may then identify circuitry in a second level circuit representation corresponding to the annotated logic in the first level circuit representation. The second level circuit representation may be generated based on the first level circuit representation. The system may then mark the identified circuitry in the second level circuit representation having the attribute. In some implementations, the system may determine a fault profile of an integrated circuit design based on the marking.
G06F 30/3323 - Vérification de la conception, p. ex. simulation fonctionnelle ou vérification du modèle utilisant des méthodes formelles, p. ex. vérification de l’équivalence ou vérification des propriétés
A system may provide a placeholder for a component in a block of a first-level integrated circuit design without wiring at least one port of the component. The system may determine a mapping to a provider interface in the block. The system may invoke an integrated circuit generator to generate a second-level integrated circuit design based on the first-level integrated circuit design. The generator when executed replaces the provider interface with the component including wiring ports of the component, including the at least one port, in the second-level integrated circuit design according to the mapping. In some implementations, an application program interface may enable a provider to communicate with the generator. The provider can utilize the API to instantiate the provider interface and determine the mapping.
G06F 30/327 - Synthèse logiqueSynthèse de comportement, p. ex. logique de correspondance, langage de description de matériel [HDL] à liste d’interconnections [Netlist], langage de haut niveau à langage de transfert entre registres [RTL] ou liste d’interconnections [Netlist]
37.
Concurrent support for multiple cache inclusivity schemes using low priority evict operations
Systems and methods are disclosed for concurrent support for multiple cache inclusivity schemes using low priority evict operations. For example, some methods may include, receiving a first eviction message having a lower priority than probe messages from a first inner cache; receiving a second eviction message having a higher priority than probe messages from a second inner cache; transmitting a third eviction message, determined based on the first eviction message, having the lower priority than probe messages to a circuitry that is closer to memory in a cache hierarchy; and, transmitting a fourth eviction message, determined based on the second eviction message, having the lower priority than probe messages to the circuitry that is closer to memory in the cache hierarchy.
Systems and methods are disclosed for debug path profiling. For example, a processor pipeline may execute instructions. A debug trace circuitry may, responsive to an indication of a non-sequential execution of an instruction by the processor pipeline, generate a record including an address pair and one or more counter values. The address pair may include a first address corresponding to a first instruction before the non-sequential execution and a second address corresponding to a second instruction resulting in the non-sequential execution. The one or more counter values may indicate, for example, a count of instructions executed, a type of instruction executed, cache misses, cycles consumed by cache misses, translation lookaside buffer misses, cycles consumed by translation lookaside buffer misses, and/or processor stalls.
Systems and methods are described for a flexible and selectable power management interface. The flexible and selectable power management interface can provide multiple power management interfaces which are selectable based on a selected processor IP core, a selected power management controller, and a variety of factors. The flexible and selectable power management interface can be a direct handshake hardware interface, a memory-mapped bus interface, or a combination of the direct handshake hardware interface and the memory-mapped bus interface.
Described herein is a bit pattern matching hardware prefetcher which captures complex repeating patterns, allows out-of-order (OOO) training, and allows OOO confirmations. The prefetcher includes a plurality of prefetch engines. Each prefetch engine is associated with a zone, each zone has a plurality of subzones, and each subzone has a plurality of cache lines. The prefetcher includes an access map for each subzone. Each bit position represents a cache line in the plurality of cache lines. The prefetcher determines whether a demand request matches one of the plurality of prefetch engines, updates, with respect to the demand request, a bit position in an access map for a subzone in a matching prefetch engine, determines a pattern from an access map for a subzone when a defined number of demand requests have been matched to the subzone; and generates a prefetch request based on at least the determined pattern.
Systems and methods are disclosed for transferring an operand between a vector pipeline and a scalar pipeline. For example, some methods may include transferring an operand from a scalar pipeline to a scalar-to-vector buffer responsive to the scalar pipeline executing a first micro-op, wherein the scalar-to-vector buffer includes an entry having a width equal to a width of a scalar register of the scalar pipeline and a data store configured to store an indication mapping the entry to the first micro-op; updating the data store to include the indication mapping the entry to the first micro-op; identifying, by the vector pipeline in response to execution of a second micro-op and in dependence on the indication mapping the entry to the first micro-op, the entry storing the operand; and transferring the operand from the entry in the scalar-to-vector buffer to the vector pipeline responsive to the vector pipeline executing the second micro-op.
Systems and methods are disclosed for transferring data between a memory system and a vector register file. For example, a system may include a vector pipeline including a vector physical register file; a load store unit; one or more pipeline stages configured to decode a vector memory instruction to obtain a macro-operation and dispatch the macro-operation to both the load store unit and the vector pipeline, and a baler circuitry, including a buffer with entries. The vector pipeline is configured to crack the macro-operation into multiple micro-operations. The baler circuitry is configured to implement the multiple micro-operations to transfer data between one or more selected entries of the buffer and respective registers of the vector physical register file. The load store unit is configured to implement the macro-operation to transfer data between one or more addresses in a memory system and the one or more selected entries of the buffer.
Systems and methods are disclosed for accelerated vector-reduction operations. Some systems may include a vector register file configured to store register values of an instruction set architecture in physical registers; and an execution circuitry configured to, responsive to a folding micro-op: read a vector from a physical register of the vector register file or from bypass circuitry; partition the elements of the vector into a first subset of elements with even indices and a second subset with elements with odd indices; and apply a reduction operation to combine elements from the second subset of elements with corresponding elements from the first subset of elements to obtain a set of reduced elements.
Systems and methods are disclosed for data storage in a non-inclusive cache. For example, an integrated circuit may include a cache that includes a databank with multiple entries configured to store respective cache lines; and an array of cache tags, wherein each cache tag includes a data pointer that points to an entry in the databank. For example, methods may include allocating the entry in the databank to the cache including the array of cache tags from amongst multiple caches in the integrated circuit by writing the data pointer to the cache tag in the array of cache tags.
Systems and methods are disclosed for variable depth pipelines for error correction. For example, some methods may include changing the depth of a pipeline in response to an error signal from a stage of the pipeline. Changing the depth of the pipeline may include routing signals from the stage of the pipeline that resulted in the error signal through an error correction stage of the pipeline to a next stage of the pipeline that previously received output signals from the stage of the pipeline that resulted in the error signal. The methods may include continuing to route signals through the error correction stage of the pipeline to the next stage of the pipeline for multiple clock cycles until a pipeline bubble event is detected.
Prefetch circuitry may be configured to transmit a message to cancel a prefetch of one or more cache blocks of a group. The message may correspond to a prefetch message by indicating an address for the group and a bit field for the one or more cache blocks of the group to cancel. In some implementations, the message may target a higher level cache to cancel prefetching the one or more cache blocks, and the message may be transmitted to the higher level cache via a lower level cache. In some implementations, the message may target a higher level cache to cancel prefetching the one or more cache blocks, the message may be transmitted to a lower level cache via a first command bus, and the lower level cache may forward the message to the higher level cache via a second command bus.
G06F 12/0862 - Adressage d’un niveau de mémoire dans lequel l’accès aux données ou aux blocs de données désirés nécessite des moyens d’adressage associatif, p. ex. mémoires cache avec pré-lecture
G06F 9/30 - Dispositions pour exécuter des instructions machines, p. ex. décodage d'instructions
G06F 12/0897 - Mémoires cache caractérisées par leur organisation ou leur structure avec plusieurs niveaux de hiérarchie de mémoire cache
47.
Prefetching Cache Blocks Based on an Address for a Group and a Bit Field
Prefetch circuitry may be configured to transmit a message to prefetch one or more cache blocks of a group. The message may indicate an address for the group of cache blocks and a bit field that indicates the one or more cache blocks of the group to prefetch. In some implementations, the message may target a higher level cache to prefetch the one or more cache blocks, and the message may be transmitted to the higher level cache via a lower level cache. In some implementations, the message may target a higher level cache to prefetch the one or more cache blocks, the message may be transmitted to a lower level cache via a first command bus, and the lower level cache may forward the message to the higher level cache via a second command bus.
G06F 12/0862 - Adressage d’un niveau de mémoire dans lequel l’accès aux données ou aux blocs de données désirés nécessite des moyens d’adressage associatif, p. ex. mémoires cache avec pré-lecture
48.
Re-triggering wake-up to handle time skew between scalar and vector sides
A method for re-triggering wakeup to handle time skew between a scalar operation and a vector operation is provided. The method includes: initiating, before a Load-Store (LST) pipeline completes an execution of a load operation corresponding to a vector micro-operation (uop) dispatched to a baler issue queue, a respective load operation in a Load (LD) pipeline corresponding to the vector uop; triggering a speculative wakeup from the LD pipeline during an execution of the respective load operation; triggering a second wakeup corresponding to the speculative wakeup from the LD pipeline; and waking up, based on the second wakeup, the vector micro-operation in the baler issue queue of the baler unit.
A method and apparatus for a speculative request indicator is described. A method includes providing, for a cache hierarchy, a messaging protocol used for transfer operations among agents in the cache hierarchy, the messaging protocol indicating acceptable cache coherency states for a cache block indicated in a request message and providing, in the messaging protocol for selection by an agent, a speculative request indicator when sending the request message, wherein the speculative request indicator differentiates between a demand request and a speculative request with respect to the cache block.
G06F 12/0862 - Adressage d’un niveau de mémoire dans lequel l’accès aux données ou aux blocs de données désirés nécessite des moyens d’adressage associatif, p. ex. mémoires cache avec pré-lecture
G06F 12/0811 - Systèmes de mémoire cache multi-utilisateurs, multiprocesseurs ou multitraitement avec hiérarchies de mémoires cache multi-niveaux
G06F 12/0815 - Protocoles de cohérence de mémoire cache
50.
Eviction operations based on eviction message types of different priorities
An agent may be configured to invoke a first eviction operation that is interruptible by probe operations when receiving a first type of eviction message and invoke a second eviction operation in which probe operations are interruptible by the second eviction operation when receiving a second type of eviction message. In some implementations, the agent may maintain a data storage that is inclusive of at least one of unique or dirty cache blocks in a cache maintained by an agent that transmits the second type of eviction message. In some implementations, the agent may prevent a cache block from transitioning from a modified state to an exclusive state when the agent invokes the second eviction operation to evict the cache block. In some implementations, the agent may convert from the second eviction operation to the first eviction operation when receiving the second type of eviction message.
Cache circuitry may be configured to receive a first message to downgrade a permission associated with data stored in a current level cache. For example, the current level cache could be a level two (L2) cache. The cache circuitry could receive the first message from a processor core having a level one (L1) cache. The cache circuitry may forward the first message to a higher level cache. For example, the higher level cache could be a level three (L3) cache. The cache circuitry may downgrade the permission associated with data stored in the current level cache based on receiving a second message from the higher level cache. The cache circuitry may forward the first message before receiving the second message and downgrading the permission. The second message may cause downgrade of the permission in multiple caches (e.g., the L1, L2, and L3 caches).
A method for managing orders of operations between one or more clients and one or more servers is disclosed. The method includes partitioning addressable regions of logical servers on or within an interconnect link into multiple regions including a first orderable region, and providing logical client an ability to push ordering responsibility within the first orderable region to a server. Over the first orderable region, two request messages for access to memory-mapped sources including two respective operations are transmitted, and the two request messages originate from a same logical client. The ordering responsibility can include a first rule for order of operations between the two request messages.
Systems and methods are described to collect latency data of transactions traversing a processor using a performance monitor, which uses a bucket timer based on a granule value in a configurable granule counter. The monitor can determine a transaction time for a transaction, which can then be compared to enumerated buckets (where the bucket size is based on the granule value), determine the appropriate bucket, and increment a latency counter associated with the bucket. The monitor can include a saturation mechanism to account for overflow or saturation. The collected data can be read by an external device to generate a histogram to identify potential problems in the processor or processor pipeline.
A data responder may determine a selection between granting a request for a priority byte to be prioritized for transmission ahead of other bytes via a bus and ignoring the request. Granting the request may include transferring a block of bytes of data across multiple clock cycles with the priority byte transferred in a first clock cycle before other clock cycles of the multiple clock cycles. Ignoring the request may include transferring the block across multiple clock cycles with the priority byte transferred in a clock cycle after the first clock cycle. The data responder may receive the request from a data requestor. The data responder may assert a signal on a wire, connected to the data requestor, to indicate a grant of the request and a transfer of the priority byte in the first clock cycle.
Systems and methods are disclosed for stateful vector group permutation with storage reuse. For example, some methods may include expanding an element index from a vector of indices to obtain byte indices for respective bytes of a corresponding element; storing the byte indices in corresponding bytes of an intermediate result operand buffer; updating bits in a completion flags buffer to indicate that the corresponding bytes of the intermediate result operand buffer store indices; identifying bytes of an element of the vector of source data pointed to by the element index based on the byte indices stored in the intermediate result operand buffer; overwriting the byte indices in the intermediate result operand buffer with the identified bytes; and, responsive to overwriting the byte indices, updating the corresponding bits in the completion flags buffer to indicate that the corresponding bytes store data to be written to the destination vector.
Systems and techniques are disclosed for relative age tracking for entries in a buffer. For example, some techniques may include pre-computing age matrix entries of an age matrix corresponding to invalid entries of a data buffer based on a validity indication (e.g., a valid bit mask), wherein the validity indication identifies valid entries in the data buffer and the age matrix tracks relative ages of the entries in the data buffer; responsive to data being received for storage in the data buffer, selecting an entry corresponding to an index value in the data buffer from among a set of invalid entries of the data buffer; storing the data in the entry corresponding to the index value; and updating the validity indication to indicate that the entry corresponding to the index value is valid.
G06F 12/08 - Adressage ou affectationRéadressage dans des systèmes de mémoires hiérarchiques, p. ex. des systèmes de mémoire virtuelle
G06F 12/0802 - Adressage d’un niveau de mémoire dans lequel l’accès aux données ou aux blocs de données désirés nécessite des moyens d’adressage associatif, p. ex. mémoires cache
57.
Tracking of Data Readiness for Load and Store Operations
A method for tracking of data readiness for load and store operations is disclosed. The method includes establishing a Load Transfer Buffer (LTB) entry, initializing a write counter configured to track a number or progress of write data that are expected to update the LTB entry, and tracking the number or progress of the write data that are expected to update the LTB entry. The tracking can include setting, in the write counter, the number of the write data that are expected to update the LTB entry, maintaining, until a next respective write data arrives to the LTB entry a current value corresponding to a respective number of write data left to update the LTB entry, and after receiving the next respective write data, adjusting the write counter to reflect a respective number of write data left to update the LTB entry.
A method for executing vector iota (viota) operation is disclosed. The method includes fetching a viota instruction, decoding the viota instruction into multiple viota micro-operations (uops), computing a first element viota value of a respective viota uop, determining a respective last element viota value of the respective viota uop based on the first element viota value of the respective uop, and writing the respective last element viota value of the respective viota uop to an allocated physical register. Each viota uop of the multiple viota uops has multiple elements, and each element has a viota value corresponding to a sum of active mask bits of preceding elements of the viota uops. The multiple elements of each viota uop comprise at least a first element that has a starting bit position of a respective uop and a last element that has an ending bit position of the respective uop.
A method and apparatus for a cache coherency state request vector is described. A method includes selecting, by a first agent, one or more bits in a cache coherency state request vector, where a selected bit in the cache coherency state request vector indicates an acceptable cache coherency state for a cache block indicated in a request message, transmitting, by the first agent to a second agent, the request message for the cache block, the request message including the cache coherency state request vector, and receiving, by the first agent from the second agent, a response message with a cache coherency response state, wherein the cache coherency response state indicates a cache coherency state responsive to the cache coherency state request vector.
First agent circuitry may receive from a second agent a first request and a first set of one or more bits. The first request may be part of a data operation. The first agent circuitry may transmit to the second agent a message including a first response to the first request, the first set of one or more bits, a second request, and a second set of one or more bits. The second set of one or more bits may be generated by the first agent circuitry to transmit state information about the second request. In some implementations, a set of one or more wires may be generated for transmission of the second set of one or more bits. The first agent circuitry may receive from the second agent a second response to the second request and the second set of one or more bits.
A method for renaming architectural register, such as control and status register (CSR), is disclosed. The method includes decoding an instruction for updating CSR, updating the CSR based on a respective instruction of the one or more instructions, allocating a unique tag to the instruction in pipeline, and writing the tag into a mapping table for renaming the CSR. The tag can identify the CSR included in the instruction or the updated values of the CSR. The tag can be associated with a unique value. Moreover, the method can employ a First In, First Out (FIFO) queuing technique and virtual bits.
A method for performing segmented vector store operations is disclosed. The method includes decoding one or more segmented vector store micro-operations (uops) of segmented vector store instruction in a pipeline, allocating, based on each of the one or more segmented vector store uops, one or more respective store buffer entries in First-in, First-out (FIFO) order, stalling the pipeline until store buffer entries are allocated by all of segmented vector store uops of the segmented vector store instruction, and writing, based on each of the one or more segmented vector store uops, manipulated vector data into multiple store buffer entries.
Systems and methods are disclosed for a configurable interconnect address remapper with event detection. For example, an integrated circuit can include a processor core configured to execute instructions. The processor core includes region registers defined by a From Address range and a To Address, a register storing a number of regions defined in the integrated circuit, interrupt enable registers associated with each pair of region registers, and event flags associated with each pair of region registers; an interconnection system handling transactions from the processor core; an interconnect address remapper translating an address associated with a transaction using the one or more pair of region registers; and an interrupt controller receiving an interrupt signal from the interconnect address remapper when the interrupt enable registers are enabled and at least one raised event flags when at least one of the one or more pair of region registers matches the transaction address.
Prediction circuitry may generate a vector length prediction associated with a configuration instruction prior to completion of execution of the configuration instruction. The vector length prediction may indicate a number of data elements on which a vector instruction will operate. In some implementations, the prediction circuitry can generate the vector length prediction by detecting a repeating pattern of vector lengths associated with earlier executions of the configuration instruction. In some implementations, the prediction circuitry can generate the vector length prediction by detecting decrements of an application vector length by a constant value.
Systems and methods are disclosed for event recording and notification. For example, an integrated circuit (e.g., a processor) for executing instructions includes multiple event reporting circuitries that are each configured to store context data for one or more detected events, including an event identifier; an event map circuitry configured to control one or more notification channels based on an input event identifier; and an event summarization circuitry configured to: receive event identifiers from a plurality of child nodes, wherein each child node is one of the reporting circuitries or another event summarization circuitry configured to output an event identifier from one of the event reporting circuitries; select a highest priority event identifier from a set of event identifiers currently output by the plurality of child nodes; output the selected event identifier to the event map circuitry; and store a pointer to the child node that output the selected event identifier.
A first circuitry may be configured to generate a first output based on an input and send an indication responsive to detection of an error while the first output is generated. A second circuitry may be configured to generate a second output based on the input. A wrapper circuitry may be configured to compare the first output and the second output to check correctness by default, and responsive to receipt of the indication, select the second output as a system output without checking for correctness by comparing the first output and the second output. In some implementations, the second circuitry may be configured to send a second indication responsive to detection of a second error while the second output is generated. The wrapper circuitry may be configured to, responsive to receipt of the second indication, ignore the check for correctness.
Apparatus and methods for vector instruction cracking after scalar dispatch are described. An integrated circuit includes a primary pipeline and a vector pipeline. The primary pipeline is configured to determine a type of instruction, responsive to a determination that the instruction is a vector instruction, create a reorder buffer entry in a reorder buffer for the vector instruction prior to out-of-order processing in the primary pipeline, and send the vector instruction to a vector pipeline. The vector pipeline is configured to process the vector instruction.
Systems and methods are disclosed for atomic memory operations for address translation. For example, an integrated circuit (e.g., a processor) for executing instructions includes a memory system including random access memory; a bus connected to the memory system; and an atomic memory operation circuitry configured to receive a request from the bus to access an entry in a page table stored in the memory system, wherein the request includes an indication of whether an instruction that references an address being translated using the entry is a store instruction; access the entry in the page table; responsive to the indication indicating that the instruction is a store instruction, set a dirty bit of the entry in the page table; and transmit contents of the entry on the bus in response to the request.
A processor core may include circuitry that fetches a first instruction followed by a second instruction. The first instruction may be configured to cause a first memory request, and the second instruction may be configured to cause a second memory request. The circuitry may determine that the first memory request is a candidate for combination with the second memory request. Responsive to the determination, the circuitry may send an indication, from the processor core via a bus, that the first memory request is a candidate for combination.
Systems and methods are disclosed for supporting multiple vector lengths with a configurable vector register file. For example, an integrated circuit (e.g., a processor) includes a data store configured to store a vector length parameter; a processor core including a vector register, wherein the processor core is configured to: while a first value of the vector length parameter is stored in the data store, store a single architectural register of an instruction set architecture in the vector register; and, while a second value of the vector length parameter is stored in the data store, store multiple architectural registers of the instruction set architecture in respective disjoint portions of the vector register. For example, the integrated circuit may be used to emulate a processor with smaller vector registers for the purpose of migrating a thread to a processor core of the integrated circuit for continued execution.
Systems and methods are disclosed for fusion with destructive instructions. For example, an integrated circuit (e.g., a processor) for executing instructions includes a fusion circuitry that is configured to detect a sequence of macro-ops stored in a processor pipeline of the processor core, the sequence of macro-ops including a first macro-op identifying a first register as a destination register followed by a second macro-op identifying the first register as both a source register and as a destination register, wherein one or more intervening macro-ops occur between the first macro-op and the second macro-op in the program order; determine a micro-op that is equivalent to the first macro-op followed by the second macro-op; and forward the micro-op to at least one of the one or more execution resource circuitries for execution. For example, the sequence of macro-ops may be detected in a vector dispatch stage of a processor pipeline.
A trace circuitry may be configured to receive a selection of one or more types of events possible in a processor core among multiple types of events. The trace circuitry may generate a message including trace information when an event corresponding to the selection occurs in the processor core. The trace information may include an address associated with the event and an indication of the type of event and/or cause for why the event occurred. In some implementations, the trace circuitry may use an event filter to pass events corresponding to the one or more types of events that are selected and block events corresponding to one or more types of events that are not selected. In some implementations, the trace circuitry may generate timestamps based on events to enable measurements between the events.
Disclosed herein are systems and methods for processing masked memory accesses including handling fault exceptions and checking memory attributes of memory region(s) to be accessed. Implementations perform a two-level memory protection violation scheme for masked vector memory instructions. The first level memory check ignores mask information associated with a masked vector memory instruction and operates on a memory footprint associated with the masked vector memory instruction. If a memory protection violation is detected or speculative access is denied with respect to the memory footprint, a second level memory check evaluates mask information at a vector element level to determine whether a fault exception should be raised. If a mask bit for a vector element is set and a memory violation is detected, then a fault exception is raised for the masked vector memory instruction. If a mask bit is not set, execution of the masked vector memory instruction can continue.
G06F 21/78 - Protection de composants spécifiques internes ou périphériques, où la protection d'un composant mène à la protection de tout le calculateur pour assurer la sécurité du stockage de données
74.
TRANSLATION LOOKASIDE BUFFER (TLB) PREFETCHER WITH MULTI- LEVEL TLB PREFETCHES AND FEEDBACK ARCHITECTURE
Described is a translation lookaside buffer (TLB) prefetcher with multi-level TLB prefetches and feedback architecture. A processing system includes two or more translation lookaside buffer (TLB) levels, each TLB level including a miss queue, and a TLB prefetcher connected to each of the two or more TLB levels. The TLB prefetcher configured to receive feedback from the miss queue at each TLB level for previously sent TLB prefetches and control number of TLB prefetches sent for a trained TLB entry to each TLB level of the two or more TLB levels based on the feedback.
G06F 12/1036 - Traduction d'adresses utilisant des moyens de traduction d’adresse associatifs ou pseudo-associatifs, p. ex. un répertoire de pages actives [TLB] pour espaces adresse virtuels multiples, p. ex. segmentation
Systems and methods are disclosed for load-store pipeline selection for vectors. For example, an integrated circuit (e.g., a processor) for executing instructions includes an L1 cache that provides an interface to a memory system; an L2 cache connected to the L1 cache that implements a cache coherency protocol with the L1 cache; a first store unit configured to write data to the memory system via the L1 cache; a second store unit configured to bypass the L1 cache and write data to the memory system via the L2 cache; and a store pipeline selection circuitry configured to: identify an address associated with a first beat of a store instruction with a vector argument; select between the first store unit and the second store unit based on the address associated with the first beat of the store instruction; and dispatch the store instruction to the selected store unit.
G06F 12/0855 - Accès de mémoire cache en chevauchement, p. ex. pipeline
G06F 12/0815 - Protocoles de cohérence de mémoire cache
G06F 12/0875 - Adressage d’un niveau de mémoire dans lequel l’accès aux données ou aux blocs de données désirés nécessite des moyens d’adressage associatif, p. ex. mémoires cache avec mémoire cache dédiée, p. ex. instruction ou pile
G06F 12/0897 - Mémoires cache caractérisées par leur organisation ou leur structure avec plusieurs niveaux de hiérarchie de mémoire cache
Systems and methods are disclosed for vector gather with a narrow datapath. For example, some methods may include reading b bits of a vector of indices into a first operand buffer; reading b bits of the vector of source data into a second operand buffer, including an element indexed by a first index stored in the first operand buffer; checking whether other indices stored in the first operand buffer point to elements of the vector of source data stored in the second operand buffer; during a single clock cycle, copying a plurality of elements stored in the second operand buffer that are pointed to by indices stored in the first operand buffer to a third operand buffer; and updating flags in a completion flags buffer corresponding to those indices to indicate that handling of those indices has completed.
One or more zero counters may be configured to count trailing zeros of significands of operands to produce a trailing zero count from the first operand and the second operand. A round-bit position circuit may be configured to determine a predicted round-bit position of an expected significand multiplier result. The predicted round-bit position may be used to determine a trailing bit count indicating a number of bits that are less significant than a round-bit in the predicted round-bit position. A compare circuit may be configured to compare the trailing zero count to the trailing bit count to determine whether the expected significand multiplier result will cause a tie when rounding the expected significand multiplier result.
One or more zero counters may be configured to count trailing zeros of significands of operands to produce a trailing zero count from the first operand and the second operand. A round-bit position circuit may be configured to determine a predicted round-bit position of an expected significand multiplier result. The predicted round-bit position may be used to determine a trailing bit count indicating a number of bits that are less significant than a round-bit in the predicted round-bit position. A compare circuit may be configured to compare the trailing zero count to the trailing bit count to determine whether the expected significand multiplier result will cause a tie when rounding the expected significand multiplier result.
A first operating system process may be identified. The first operating system process may have instructions configured to be executed by a processor core. A first set of parameters may be determined based on an attribute of the first operating system process. For example, the first set of parameters may be determined based on an address space identifier, an address space stored in a page table base register, a virtual machine identifier, or a combination thereof. A component of the processor core may be configured using the first set of parameters. For example, one or more components, such as a branch predictor, a prefetcher, a dispatch unit, a vector unit, a clock controller, and the like, may be configured using the first set of parameters.
A first operating system process may be identified. The first operating system process may have instructions configured to be executed by a processor core. A first set of parameters may be determined based on an attribute of the first operating system process. For example, the first set of parameters may be determined based on an address space identifier, an address space stored in a page table base register, a virtual machine identifier, or a combination thereof. A component of the processor core may be configured using the first set of parameters. For example, one or more components, such as a branch predictor, a prefetcher, a dispatch unit, a vector unit, a clock controller, and the like, may be configured using the first set of parameters.
G06F 11/34 - Enregistrement ou évaluation statistique de l'activité du calculateur, p. ex. des interruptions ou des opérations d'entrée–sortie
G06F 21/53 - Contrôle des utilisateurs, des programmes ou des dispositifs de préservation de l’intégrité des plates-formes, p. ex. des processeurs, des micrologiciels ou des systèmes d’exploitation au stade de l’exécution du programme, p. ex. intégrité de la pile, débordement de tampon ou prévention d'effacement involontaire de données par exécution dans un environnement restreint, p. ex. "boîte à sable" ou machine virtuelle sécurisée
Systems and methods are disclosed for memory protection for vector operations. For example, a method includes fetching a vector memory instruction using a processor core including a pipeline configured to execute instructions, including constant-stride vector memory instructions; partitioning a vector that is identified by the vector memory instruction into a subvector of a maximum length, greater than one, and one or more additional subvectors with lengths less than or equal to the maximum length; checking, using a memory protection circuit, whether accessing elements of the subvector will cause a memory protection violation; and accessing the elements of the subvector before checking, using the memory protection circuit, whether accessing elements of one of the one or more additional subvectors will cause a memory protection violation.
Systems and methods are disclosed for register renaming. For example, an integrated circuit is described that includes a first cluster including a first set of physical registers and a first execution resource circuit, wherein the inputs for operations of the first execution resource circuit are of a first data type; a second cluster including a second set of physical registers and a second execution resource circuit, wherein the inputs for operations of the second execution resource circuit are of a second data type that is different than the first data type; and a register renaming circuit configured to: determine a data type prediction for a result of a first instruction that will be stored in a first logical register; and, based on the data type prediction matching the first data type, rename the first logical register to be stored in a physical register of the first set of physical registers.
Systems and methods are disclosed for memory protection for memory protection for gather-scatter operations. For example, an integrated circuit may include a processor core; a memory protection circuit configured to check for memory protection violations with a protection granule; and an index range circuit configured to: memoize a maximum value and a minimum value of a tuple of indices stored in a vector register of the processor core as the tuple of indices is written to the vector register; determine a range of addresses for a gather-scatter memory instruction that takes the vector register as a set of indices based on a base address of a vector in memory, the memoized minimum value, and the memoized maximum value; and check, using the memory protection circuit during a single clock cycle, whether accessing elements of the vector within the range of addresses will cause a memory protection violation.
An integrated circuit design is generated for an integrated circuit. The integrated circuit design includes an instance of a module description that describes a functional operation of a module. The integrated circuit design is encoded in an intermediate representation, IR, data structure (702). A parameter comprising metadata associated with the instance is received (704). A compiler compiles the IR data structure to produce a register-transfer level, RTL, data structure (706). The RTL data structure encodes a logic description associated with the instance. Compiling the IR data structure includes associating the parameter comprising metadata with the logic description (708).
G06F 30/327 - Synthèse logiqueSynthèse de comportement, p. ex. logique de correspondance, langage de description de matériel [HDL] à liste d’interconnections [Netlist], langage de haut niveau à langage de transfert entre registres [RTL] ou liste d’interconnections [Netlist]
G06F 30/323 - Traduction ou migration, p. ex. logique à logique, traduction de langage descriptif de matériel ou traduction de liste d’interconnections [Netlist]
85.
INTEGRATED CIRCUIT DESIGN VERIFICATION WITH SIGNAL FORCING
An integrated circuit design may be generated for an integrated circuit. The integrated circuit design may include an instance of a module description that describes a functional operation of a module (702). The instance may include an input that is internal to the integrated circuit design. The integrated circuit design may be encoded in an intermediate representation, IR, data structure. A parameter may be received indicating that the input should be exposed to a simulator (704). The IR data structure may be compiled to produce a register-transfer level, RTL, data structure (708). The RTL data structure may encode a logic description associated with the instance. The parameter may be used to permit a simulator to access a node in the RTL data structure that is associated with the input (710).
G06F 30/327 - Synthèse logiqueSynthèse de comportement, p. ex. logique de correspondance, langage de description de matériel [HDL] à liste d’interconnections [Netlist], langage de haut niveau à langage de transfert entre registres [RTL] ou liste d’interconnections [Netlist]
G06F 30/3308 - Vérification de la conception, p. ex. simulation fonctionnelle ou vérification du modèle par simulation
G06F 30/333 - Conception en vue de la testabilité [DFT], p. ex. chaîne de balayage ou autotest intégré [BIST]
86.
INTEGRATED CIRCUIT DESIGN VERIFICATION WITH MODULE SWAPPING
An integrated circuit design is generated for an integrated circuit (702). The integrated circuit design includes an instance of a module description that describes a functional operation of a module (702). The instance includes an input and an output. The integrated circuit design may be encoded in an intermediate representation (IR) data structure. Parameters are received indicating that the instance should be replaced with a simulation model (704). The parameters may include a first parameter pointing to the instance and a second parameter pointing to the simulation model.
G06F 30/3308 - Vérification de la conception, p. ex. simulation fonctionnelle ou vérification du modèle par simulation
G06F 30/327 - Synthèse logiqueSynthèse de comportement, p. ex. logique de correspondance, langage de description de matériel [HDL] à liste d’interconnections [Netlist], langage de haut niveau à langage de transfert entre registres [RTL] ou liste d’interconnections [Netlist]
G06F 30/333 - Conception en vue de la testabilité [DFT], p. ex. chaîne de balayage ou autotest intégré [BIST]
Systems and methods are disclosed for automated generation of integrated circuit designs and associated data. These allow the design of processors and SoCs by a single, non-expert who understands high-level requirements; allow the en masse exploration of the design-space through the generation processors across the design-space via simulation, or emulation; allow the easy integration of IP cores from multiple third parties into an SoC; allow for delivery of a multi-tenant service for producing processors and SoCs that are customized while also being pre-verified and delivered with a complete set of developer tools, documentation and related outputs. Some embodiments, provide direct delivery, or delivery into a cloud hosting environment, of finished integrated circuits embodying the processors and SoCs.
Systems and methods are disclosed for to generation of dynamic design flows for integrated circuits. For example, a method may include accessing a design flow configuration data structure, wherein the design flow configuration data structure is encoded in a tool control language; based on the design flow configuration data structure, selecting multiple flowmodules from a set of flowmodules, wherein each flowmodule provides an application programming interface, in the tool control language, to a respective electronic design automation tool; based on the design flow configuration data structure, generating a design flow as a directed acyclic graph including the selected flowmodules as vertices; and generating an output integrated circuit design data structure, based on one or more input integrated circuit design data structures, using the design flow to control the respective electronic design automation tools of the selected flowmodules.
Described is a system and method for implementing a prefetcher with an out-of-order filtered prefetcher training queue. A processing system includes a prefetcher and a prefetcher training queue connected to the prefetcher. The prefetcher training queue configured to receive one or more demand requests from one or more load-store units, allocate a prefetcher training queue entry for a non-duplicative demand request, and send, to the prefetcher, a stored demand request together with a hit or miss indicator, wherein the prefetcher training queue sends stored demand requests without regard to program order.
G06F 12/0862 - Adressage d’un niveau de mémoire dans lequel l’accès aux données ou aux blocs de données désirés nécessite des moyens d’adressage associatif, p. ex. mémoires cache avec pré-lecture
Systems and methods are disclosed for cycle accurate tracing of vector instructions. For example, a system may include a vector unit in communication with a scalar core. The vector unit may include a vector instruction queue that receives vector instructions from the scalar core. The vector unit may also include a vector execution unit that executes vector instructions from the vector instruction queue. The system may also include checkpoints in the vector unit including a first checkpoint including circuitry that sets a first bit for a first clock cycle in which a first vector instruction exits the vector instruction queue, and a second checkpoint including circuitry that sets a second bit for a second clock cycle in which a second vector instruction exits the vector execution unit.
Described is a data cache with prediction hints for a cache hit. The data cache includes a plurality of cache lines, where a cache line includes a data field, a tag field, and a prediction hint field. The prediction hint field is configured to store a prediction hint which directs alternate behavior for a cache hit against the cache line. The prediction hint field is integrated with the tag field or is integrated with a way predictor field.
G06F 12/0888 - Adressage d’un niveau de mémoire dans lequel l’accès aux données ou aux blocs de données désirés nécessite des moyens d’adressage associatif, p. ex. mémoires cache utilisant la mémorisation cache sélective, p. ex. la purge du cache
G06F 12/0862 - Adressage d’un niveau de mémoire dans lequel l’accès aux données ou aux blocs de données désirés nécessite des moyens d’adressage associatif, p. ex. mémoires cache avec pré-lecture
Systems and methods are disclosed for store-to-load forwarding for processor pipelines. For example, an integrated circuit (e.g., a processor) for executing instructions includes a processor pipeline; a store queue that has entries associated with respective store instructions that are being executed, wherein an entry of the store queue includes a tag that is determined based on a virtual address of a target of the associated store instruction; and store-to-load forwarding circuitry that is configured to: compare a first virtual address of a target of a first load instruction being executed by the load unit to respective tags of one or more entries in the store queue; select an entry of the store queue based on a match between the first virtual address and the tag of the selected entry; and forward data of the selected entry in the store queue to be returned by the first load instruction.
Systems and methods are disclosed for page table entry caches with multiple tag lengths. For example, an integrated circuit (e.g., a processor) includes a page table walk circuitry including a page table entry cache, in which the page table walk circuitry is configured to access a multi-level page table, and in which a first entry of the page table entry cache combines a first number of multiple levels and a second entry of the page table entry cache combines a second number of multiple levels that is different from the first number of multiple levels.
Disclosed are systems and methods for configuring a prefetcher. A process may reconfigure a prefetcher associated with a processor core responsive to a context switch. The context switch may comprise the processor core changing from executing a first process to a second process. In some implementations, reconfiguring the prefetcher may include updating a register controlling an operation of the prefetcher from a first set of parameters associated with the first process to a second set of parameters associated with the second process. In some implementations, the second set of parameters may be based on input from a process executed in a user mode.
G06F 12/0862 - Adressage d’un niveau de mémoire dans lequel l’accès aux données ou aux blocs de données désirés nécessite des moyens d’adressage associatif, p. ex. mémoires cache avec pré-lecture
95.
INTEGRATED CIRCUIT GENERATION WITH IMPROVED INTERCONNECT
Disclosed are systems and methods that include accessing design parameters to configure an integrated circuit design. The integrated circuit design may include a transaction source or processing node to be included in an integrated circuit. The transaction source or processing node may be configured to transmit memory transactions to memory addresses. A compiler may compile the integrated circuit design with the transaction source or processing node to generate a design output. The design output may be configured to route memory transactions based on their targeting cacheable or non-cacheable memory addresses. The design output may be used to manufacture an integrated circuit.
G06F 30/32 - Conception de circuits au niveau numérique
G06F 12/0888 - Adressage d’un niveau de mémoire dans lequel l’accès aux données ou aux blocs de données désirés nécessite des moyens d’adressage associatif, p. ex. mémoires cache utilisant la mémorisation cache sélective, p. ex. la purge du cache
G06F 115/02 - Conception de systèmes sur une puce [SoC]
Disclosed are systems and methods that include integrated circuit generation with composable interconnect. A system accesses a design parameters data structure that specifies an interconnect topology to be included in an integrated circuit (302). The system invokes an integrated circuit design generator that applies the design parameters data structure, including with the interconnect topology (304). The design parameters data structure specifies a definition for a hardware object (e.g., the interconnect topology) and instances of the hardware object. The definition and the instances may each be modifiable. The system invokes the generator to apply the design parameters data structure to generate the design (306).
G06F 30/323 - Traduction ou migration, p. ex. logique à logique, traduction de langage descriptif de matériel ou traduction de liste d’interconnections [Netlist]
G06F 30/327 - Synthèse logiqueSynthèse de comportement, p. ex. logique de correspondance, langage de description de matériel [HDL] à liste d’interconnections [Netlist], langage de haut niveau à langage de transfert entre registres [RTL] ou liste d’interconnections [Netlist]
97.
Integrated Circuit Generation Using an Integrated Circuit Shell
Systems and methods are disclosed for integrated circuit design using integrated circuit shells. For example, a system may generate an integrated circuit core design expressed in a hardware description language. The integrated circuit core design may express circuitry that describes one or more functions to be included in an application specific integrated circuit (ASIC). The one or more functions may have connection points providing first inputs and outputs to the one or more functions. The system may query an integrated circuit shell expressed in a hardware description language. The integrated circuit shell may express circuitry that describes a limited set of pads to be implemented in the ASIC. The limited set of pads may provide second inputs and outputs to the integrated circuit. The query may determine availability of pads of the limited set of pads to connect to the connection points of the one or more functions.
G06F 30/31 - Saisie informatique, p. ex. éditeurs spécifiquement adaptés à la conception de circuits
G06F 30/327 - Synthèse logiqueSynthèse de comportement, p. ex. logique de correspondance, langage de description de matériel [HDL] à liste d’interconnections [Netlist], langage de haut niveau à langage de transfert entre registres [RTL] ou liste d’interconnections [Netlist]
98.
Logging Guest Physical Address for Memory Access Faults
Systems and methods are disclosed for logging guest physical address for memory access faults. For example, a method for logging guest physical address includes receiving a first address translation request from a processor pipeline at a translation lookaside buffer for a first guest virtual address; identifying a hit with a fault condition corresponding to the first guest virtual address; responsive to the fault condition, invoking a single-stage page table walk with the first guest virtual address to obtain a first guest physical address; and storing the first guest physical address with the first guest virtual address in a data store, wherein the data store is separate from an entry in the translation lookaside buffer that includes a tag that includes the first guest virtual address and data that includes a physical address.
G06F 12/1009 - Traduction d'adresses avec tables de pages, p. ex. structures de table de page
G06F 12/1027 - Traduction d'adresses utilisant des moyens de traduction d’adresse associatifs ou pseudo-associatifs, p. ex. un répertoire de pages actives [TLB]
Described are systems and methods for clock gating components on a system-on-chip. A processing system includes one or more cores, each core including a clock gating enable bit register which is set by software when an expected idle period of the core meets or exceeds a clock gating threshold, and a power management unit connected to the one or more cores. The power management unit configured to receive an idle notification from a core of the one or more cores and initiate clock gating a clock associated with the core when the core and additional logic is quiescent and the clock gating enable bit register is set. The clock gating threshold is a defined magnitude greater than a clock wake-up time.
Systems and methods are disclosed for implementing a quad narrowing operation. The quad narrowing operation converts an output of a 32 bit floating-point operation to the 8 bit integer format by rounding the 32 bit floating-point operation and clamping the rounded 32 bit floating-point input by an 8 bit lower bound and an 8 bit upper bound which are defined in a 16 bit scalar register to generate the fixed-point output. The 8 bit lower bound is defined by the 8 most significant bits of the 16 bit scalar register and the 8 bit upper bound is defined by the 8 least significant bits of the 16 bit scalar register.
G06N 3/063 - Réalisation physique, c.-à-d. mise en œuvre matérielle de réseaux neuronaux, de neurones ou de parties de neurone utilisant des moyens électroniques