Techniques for data manipulation using integer matrix multiplication using pipelining are disclosed. A first integer matrix with dimensions m×k and a second integer matrix with dimensions k×n are obtained for matrix multiplication within a processor. The first and second integer matrices employ a two's complement variable radix point data representation. The first and second integer matrices are distilled into (j×j) submatrices. A first variable radix point format and an initial value for an accumulator register are configured dynamically. A first variable radix point format is configured dynamically for the first integer matrix and a second variable radix point format is configured dynamically for the second integer matrix. Multiply-accumulate operations are executed in a pipelined fashion on the (j×j) submatrices of the first integer matrix and the second integer matrix, where a third variable radix point format is configured for the result.
Techniques for data manipulation using processor cluster address generation are disclosed. One or more processor clusters capable of executing software-initiated work requests are accessed. A plurality of dimensions from a tensor is flattened into a single dimension. A work request address field is parsed, where the address field contains unique address space descriptors for each of the plurality of dimensions, along with a common address space descriptor. A direct memory access (DMA) engine coupled to the one or more processor clusters is configured. Addresses are generated based on the unique address space descriptors and the common address space descriptor. The plurality of dimensions can be summed to generate a single address. Memory is accessed using two or more of the addresses that were generated. The addresses are used to enable DMA access.
G06F 12/06 - Adressage d'un bloc physique de transfert, p. ex. par adresse de base, adressage de modules, extension de l'espace d'adresse, spécialisation de mémoire
G06F 13/28 - Gestion de demandes d'interconnexion ou de transfert pour l'accès au bus d'entrée/sortie utilisant le transfert par rafale, p. ex. acces direct à la mémoire, vol de cycle
3.
PROCESSOR GRAPH EXECUTION USING INTERRUPT CONSERVATION
Techniques for data manipulation using processor graph execution using interrupt conservation are disclosed. Processing elements are configured to implement a data flow graph. The processing elements comprise a multilayer graph execution engine. A data engine is loaded with computational parameters for the multilayer graph execution engine. The data engine is coupled to the multilayer graph execution engine, and the computational parameters supply layer-by-layer execution data to the multilayer graph execution engine for data flow graph execution. A first command FIFO is used for loading the data engine with computational parameters, and a second command FIFO is used for loading the multilayer graph execution engine with layer definition data. An input image is provided for a first layer of the multilayer graph execution engine. The data flow graph is executed using the input image and the computational parameters. The executing is controlled by interrupts only when an uncertainty exists within the data flow graph.
Techniques for data manipulation using integer matrix multiplication using pipelining are disclosed. A first integer matrix with dimensions m×k and a second integer matrix with dimensions k×n are obtained for matrix multiplication within a processor. The first and second integer matrices employ a two's complement variable radix point data representation. The first and second integer matrices are distilled into (j×j) submatrices. A first variable radix point format and an initial value for an accumulator register are configured dynamically. A first variable radix point format is configured dynamically for the first integer matrix and a second variable radix point format is configured dynamically for the second integer matrix. Multiply-accumulate operations are executed in a pipelined fashion on the (j×j) submatrices of the first integer matrix and the second integer matrix, where a third variable radix point format is configured for the result.
Techniques for data manipulation using processor graph execution using interrupt conservation are disclosed. Processing elements are configured to implement a data flow graph. The processing elements comprise a multilayer graph execution engine. A data engine is loaded with computational parameters for the multilayer graph execution engine. The data engine is coupled to the multilayer graph execution engine, and the computational parameters supply layer-by-layer execution data to the multilayer graph execution engine for data flow graph execution. A first command FIFO is used for loading the data engine with computational parameters, and a second command FIFO is used for loading the multilayer graph execution engine with layer definition data. An input image is provided for a first layer of the multilayer graph execution engine. The data flow graph is executed using the input image and the computational parameters. The executing is controlled by interrupts only when an uncertainty exists within the data flow graph.
Techniques for data manipulation using processor cluster address generation are disclosed. One or more processor clusters capable of executing software-initiated work requests are accessed. A plurality of dimensions from a tensor is flattened into a single dimension. A work request address field is parsed, where the address field contains unique address space descriptors for each of the plurality of dimensions, along with a common address space descriptor. A direct memory access (DMA) engine coupled to the one or more processor clusters is configured. Addresses are generated based on the unique address space descriptors and the common address space descriptor. The plurality of dimensions can be summed to generate a single address. Memory is accessed using two or more of the addresses that were generated. The addresses are used to enable DMA access.
G06F 12/06 - Adressage d'un bloc physique de transfert, p. ex. par adresse de base, adressage de modules, extension de l'espace d'adresse, spécialisation de mémoire
G06F 13/28 - Gestion de demandes d'interconnexion ou de transfert pour l'accès au bus d'entrée/sortie utilisant le transfert par rafale, p. ex. acces direct à la mémoire, vol de cycle
7.
Integer matrix multiplication engine using pipelining
Techniques for data manipulation using integer matrix multiplication using pipelining are disclosed. A first integer matrix with dimensions m×k and a second integer matrix with dimensions k×n are obtained for matrix multiplication within a processor. The first and second integer matrices employ a two's complement variable radix point data representation. The first and second integer matrices are distilled into (j×j) submatrices. A first variable radix point format and an initial value for an accumulator register are configured dynamically. A first variable radix point format is configured dynamically for the first integer matrix and a second variable radix point format is configured dynamically for the second integer matrix. Multiply-accumulate operations are executed in a pipelined fashion on the (j×j) submatrices of the first integer matrix and the second integer matrix, where a third variable radix point format is configured for the result.
Techniques for data manipulation using processor cluster address generation are disclosed. One or more processor clusters capable of executing software-initiated work requests are accessed. A direct memory access (DMA) engine, coupled to the one or more processor clusters, is configured, wherein the DMA engine employs address generation across a plurality of tensor dimensions. A work request address field is parsed, where the address field contains unique address space descriptors for each of the plurality of dimensions, along with a common address space descriptor. DMA addresses are generated based on the unique address space descriptors and the common address space descriptor. Memory using two or more of the DMA addresses that were generated is accessed, where the two or more DMA addresses enable processing within the one or more processor clusters.
G06F 13/28 - Gestion de demandes d'interconnexion ou de transfert pour l'accès au bus d'entrée/sortie utilisant le transfert par rafale, p. ex. acces direct à la mémoire, vol de cycle
Techniques for data manipulation using a matrix multiplication engine using pipelining are disclosed. A first and a second matrix are obtained for matrix multiplication. A first matrix multiply-accumulate (MAC) unit is configured, where a first matrix element and a second matrix element are presented to the MAC unit on a first cycle. A second MAC unit is configured in pipelined fashion, where the first element of the first matrix and a second element of the second matrix are presented to the second MAC unit on a second cycle, and where a second element of the first matrix and the first element of the second matrix are presented to the first MAC unit on the second cycle. Additional MAC units are further configured within the processor in pipelined fashion. Multiply-accumulate operations are executed in pipelined fashion on each of n MAC units over additional k sets of m cycles.
Disclosed embodiments provide an interface circuit for the transfer of data from a synchronous circuit to an asynchronous circuit. Data from the synchronous circuit is received into a memory in the interface circuit. The data in the memory is then sent to the asynchronous circuit based on an instruction in a circular buffer that is part of the interface circuit. Processing elements within the interface circuit execute instructions contained within the circular buffer. The circular buffer rotates to provide new instructions to the processing elements. Flow control paces the data from the synchronous circuit to the asynchronous circuit.
H04L 7/02 - Commande de vitesse ou de phase au moyen des signaux de code reçus, les signaux ne contenant aucune information de synchronisation particulière
H04L 5/24 - Dispositions destinées à permettre l'usage multiple de la voie de transmission utilisant le multiplex à division de temps avec des convertisseurs synchrones marche-arrêt
H04L 12/801 - Commande de flux ou commande de congestion
H04L 12/933 - Cœur de commutateur, p.ex. barres croisées, mémoire partagée ou support partagé
H04L 12/863 - Ordonnancement de file d’attente, p.ex. ordonnancement circulaire
Techniques for a neural network output layer for machine learning are disclosed. A plurality of processing elements within a reconfigurable fabric is configured to implement a data flow graph, where the data flow graph implements a neural network. The data flow graph can include machine learning or deep learning. A layer is implemented, within the neural network, that maps a first vector of real values to a second vector of real values bounded by zero and one, where the second vector sums to a value of one using fixed-point calculations. The layer can include a final layer within the neural network. The layer that maps the first vector includes a Softmax function. Results of the neural network are classified based on a value of the second vector. The classifying can include part of a machine learning or a deep learning process.
Techniques are disclosed for data manipulation within a reconfigurable computing environment for data flow graph computation using exceptions. Processing elements are configured within a reconfigurable fabric to implement a data flow graph. The processing elements are loaded with process agents. Valid data is executed by a first process agent on a first processing element, where the first process agent corresponds to a starting node of the data flow graph. A second processing element detects that an error exception has occurred, where a second process agent is running on the second processing element. A done signal to a third process agent is withheld by the second process agent, where the third process agent is running on a third processing element. The second process agent raises an interrupt request, where the interrupt request is based on the detecting that an error exception has occurred.
Techniques are disclosed for data flow graph node parallel update for machine learning. A first plurality of processing elements is configured to implement a portion of a data flow graph. The nodes include at least one variable node and implement part of a neural network. A second plurality of processing elements is configured to implement a second portion of the data flow graph. These nodes include at least one additional variable node and implement an additional part of the neural network. Training data is issued to the first plurality of processing elements. The training data is used to update variables within the at least one variable node. Additional variables are updated within the at least one additional variable node. The updating includes forwarding training data from the first plurality to the second plurality. The neural network is trained based on the variables that were updated and the additional variables.
An interface circuit is disclosed for the transfer of data from a synchronous circuit, with multiple source elements, to an asynchronous circuit. Data from the synchronous circuit is received into a memory in the interface circuit. The data in the memory is then sent to the asynchronous circuit based on an instruction in a circular buffer that is part of the interface circuit. Processing elements within the interface circuit execute instructions contained within the circular buffer. The circular buffer rotates to provide new instructions to the processing elements. Flow control paces the data from the synchronous circuit to the asynchronous circuit.
H04L 12/28 - Réseaux de données à commutation caractérisés par la configuration des liaisons, p. ex. réseaux locaux [LAN Local Area Networks] ou réseaux étendus [WAN Wide Area Networks]
H04L 12/861 - Mise en mémoire tampon de paquets ou mise en file d’attente; Ordonnancement de file d’attente
H04L 12/863 - Ordonnancement de file d’attente, p.ex. ordonnancement circulaire
H04L 12/801 - Commande de flux ou commande de congestion
H04L 12/859 - Actions liées à la commande de flux basée sur la nature de l’application, p.ex. contrôle de navigation sur l’Internet ou contrôle du trafic de courrier électronique
Techniques are disclosed for power conservation. A plurality of processing elements and a plurality of instructions are configured. The plurality of processing elements is controlled by instructions contained in a plurality of circular buffers. The plurality of processing elements can comprise a data flow processor. A first processing element, from the plurality of interconnected processing elements, is set into a sleep state by a first instruction from the plurality of instructions. The first processing element is woken from the sleep state as a result of valid data being presented to the first processing element. A subsection of the plurality of interconnected processing elements is also set into a sleep state based on the first processing element being set into a sleep state. At least one circular buffer from the plurality of circular buffers remains awake while the first processing element is in the sleep state, and the at least one circular buffer provides for data steering through a reconfigurable fabric.
H03K 19/177 - Circuits logiques, c.-à-d. ayant au moins deux entrées agissant sur une sortieCircuits d'inversion utilisant des éléments spécifiés utilisant des circuits logiques élémentaires comme composants disposés sous forme matricielle
H01L 25/00 - Ensembles consistant en une pluralité de dispositifs à semi-conducteurs ou d'autres dispositifs à l'état solide
G06F 5/08 - Procédés ou dispositions pour la conversion de données, sans modification de l'ordre ou du contenu des données maniées pour modifier la vitesse de débit des données, c.-à-d. régularisation de la vitesse ayant une séquence d'emplacements d'emmagasinage, les emplacements intermédiaires n'étant pas accessibles pour des opérations soit de mise en file d'attente, soit de retrait de file d'attente, p. ex. utilisant un registre à décalage
G06F 1/3287 - Économie d’énergie caractérisée par l'action entreprise par la mise hors tension d’une unité fonctionnelle individuelle dans un ordinateur
Techniques are disclosed for designing a reconfigurable fabric. The reconfigurable fabric is designed using logical elements, configurable connections between and among the logical elements, and rotating circular buffers. The circular buffers contain configuration instructions. The configuration instructions control connections between and among logical elements. The logical elements change operation based on the instructions that rotate through the circular buffers. Clusters of logical elements are interconnected by a switching fabric. Each cluster contains processing elements, storage elements, and switching elements. A circular buffer within a cluster contains multiple switching instructions to control the flow of data throughout the switching fabric. The circular buffer provides a pipelined execution of switching instructions for the implementation of multiple functions. Each cluster contains multiple processing elements, and each cluster further comprises an additional circular buffer for each processing element. Logical operations are controlled by the circular buffers.
H03K 19/0175 - Dispositions pour le couplageDispositions pour l'interface
H03K 19/177 - Circuits logiques, c.-à-d. ayant au moins deux entrées agissant sur une sortieCircuits d'inversion utilisant des éléments spécifiés utilisant des circuits logiques élémentaires comme composants disposés sous forme matricielle
17.
Branchless instruction paging in reconfigurable fabric
Circular buffers containing instructions that enable the execution of operations on logical elements are described where data in the circular buffers is swapped to storage. The instructions comprise a branchless instruction set. Data stored in circular buffers is paged in and out to a second level memory. State information for each logical element is also saved and restored using paging memory. Instructions are provided to logical elements, such as processing elements, via circular buffers. The instructions enable a group of processing elements to perform operations implementing a desired functionality. That functionality is changed by updating the circular buffers with new instructions that are transferred from paging memory. The previous instructions can be saved off in paging memory before the new instructions are copied over to the circular buffers. This enables the hardware to be rapidly reconfigured amongst multiple functions.
G06F 12/00 - Accès à, adressage ou affectation dans des systèmes ou des architectures de mémoires
G06F 12/0875 - Adressage d’un niveau de mémoire dans lequel l’accès aux données ou aux blocs de données désirés nécessite des moyens d’adressage associatif, p. ex. mémoires cache avec mémoire cache dédiée, p. ex. instruction ou pile
G06F 9/38 - Exécution simultanée d'instructions, p. ex. pipeline ou lecture en mémoire
G06F 15/78 - Architectures de calculateurs universels à programmes enregistrés comprenant une seule unité centrale
G06F 3/06 - Entrée numérique à partir de, ou sortie numérique vers des supports d'enregistrement
G06F 12/08 - Adressage ou affectationRéadressage dans des systèmes de mémoires hiérarchiques, p. ex. des systèmes de mémoire virtuelle
G06F 13/28 - Gestion de demandes d'interconnexion ou de transfert pour l'accès au bus d'entrée/sortie utilisant le transfert par rafale, p. ex. acces direct à la mémoire, vol de cycle
G06N 3/04 - Architecture, p. ex. topologie d'interconnexion
G06N 3/063 - Réalisation physique, c.-à-d. mise en œuvre matérielle de réseaux neuronaux, de neurones ou de parties de neurone utilisant des moyens électroniques
G06F 9/50 - Allocation de ressources, p. ex. de l'unité centrale de traitement [UCT]
Techniques are disclosed for managing data within a reconfigurable computing environment. In a multiple processing element environment, such as a mesh network or other suitable topology, there is an inherent need to pass data between processing elements. Subtasks are divided among multiple processing elements. The output resulting from the subtasks is then merged by a downstream processing element. In such cases, a join operation can be used to combine data from multiple upstream processing elements. A control agent executes on each processing element. A memory buffer is disposed between upstream processing elements and the downstream processing element. The downstream processing element is configured to automatically perform an operation based on the availability of valid data from the upstream processing elements.
H04L 12/861 - Mise en mémoire tampon de paquets ou mise en file d’attente; Ordonnancement de file d’attente
H04L 12/863 - Ordonnancement de file d’attente, p.ex. ordonnancement circulaire
H04L 12/801 - Commande de flux ou commande de congestion
H04L 12/859 - Actions liées à la commande de flux basée sur la nature de l’application, p.ex. contrôle de navigation sur l’Internet ou contrôle du trafic de courrier électronique
H04L 12/879 - Opérations simples sur la mémoire-tampon, p.ex. pointeurs de mémoire-tampon ou descripteurs de mémoire-tampon
H04L 12/939 - Dispositions pour la commutation redondante, p.ex. utilisant des plans de commutation parallèles
H04L 12/933 - Cœur de commutateur, p.ex. barres croisées, mémoire partagée ou support partagé
19.
Reconfigurable processor fabric implementation using satisfiability analysis
Disclosed techniques utilize a satisfiability solver for allocation and/or configuration of resources in a reconfigurable fabric of processing elements. A dataflow graph is an input provided to a toolchain that includes a satisfiability solver. The satisfiability solver operates on subsets of interconnected nodes within a dataflow graph to derive a solution. The solution is trimmed by removing artifacts and unnecessary parts. The solutions of subsets are then used as an input to additional subsets of nodes within the dataflow graph in an iterative process to derive a complete solution. The satisfiability solver technique uses adaptive windowing in both the time dimension and the spatial dimensions of the dataflow graph. Processing elements and routing elements within the reconfigurable fabric are configured based on the complete solution. Data computation is performed based on the dataflow graph using the processing elements and the routing resources.
An apparatus for mathematical manipulation is described allowing the selective combination of shifters to shift binary numbers of various widths. Selective combination allows on-the-fly adjustment of shifters from independent to coordinated shifting operations. Selective combination allows adjustable hardware-based shifting while saving space and resources. Multiple eight-bit shifters can be configured for a variety of operand widths, such as a 32-bit width, a 24-bit width, a 16-bit width, or an eight-bit width. Multiplexers route the appropriate input data to the appropriate shifters. Bidirectional shifting is configured through a selector tree, including both shift left and shift right operations. Opcodes configure the shifters for the desired type of shift and a shifted result is generated.
G06F 5/01 - Procédés ou dispositions pour la conversion de données, sans modification de l'ordre ou du contenu des données maniées pour le décalage, p. ex. la justification, le changement d'échelle, la normalisation
21.
Reconfigurable fabric direct memory access with multiple read or write elements
Techniques are disclosed for data manipulation. Data is obtained from a first switching element where the first switching element is controlled by a first circular buffer. Data is sent to a second switching element where the second switching element is controlled by a second circular buffer. Data is controlled by a third switching element that is controlled by a third circular buffer. The third switching element hierarchically controls the first switching element and the second switching element. Data is routed through a fourth switching element that is controlled by a fourth circular buffer. The circular buffers are statically scheduled. The obtaining data from a first switching element and the sending the data to a second switching element includes a direct memory access (DMA). The switching elements can operate as a master controller or as a slave device. The switching elements can comprise clusters within an asynchronous reconfigurable fabric.
G06F 13/28 - Gestion de demandes d'interconnexion ou de transfert pour l'accès au bus d'entrée/sortie utilisant le transfert par rafale, p. ex. acces direct à la mémoire, vol de cycle
G06F 13/16 - Gestion de demandes d'interconnexion ou de transfert pour l'accès au bus de mémoire
G06F 13/42 - Protocole de transfert pour bus, p. ex. liaisonSynchronisation
22.
Communication between dataflow processing units and memories
A combination of memory units and dataflow processing units is disclosed for computation. A first memory unit is interposed between a first dataflow processing unit and a second dataflow processing unit. Operations for a dataflow graph are allocated across the first dataflow processing unit and the second dataflow processing unit. The first memory unit passes data between the first dataflow processing unit and the second dataflow processing unit to execute the dataflow graph. The first memory unit is a high bandwidth, shared memory device including a hybrid memory cube. The first dataflow processing unit and second dataflow processing unit include a plurality of circular buffers containing instructions for controlling data transfer between the first dataflow processing unit and second dataflow processing unit. Additional dataflow processing units and additional memory units are included for additional functionality and efficiency.
G06F 5/10 - Procédés ou dispositions pour la conversion de données, sans modification de l'ordre ou du contenu des données maniées pour modifier la vitesse de débit des données, c.-à-d. régularisation de la vitesse ayant une séquence d'emplacements d'emmagasinage, chacun étant individuellement accessible à la fois pour des opérations de mise en file d'attente et pour des opérations de retrait de file d'attente, p. ex. utilisant une mémoire à accès aléatoire
G06F 13/16 - Gestion de demandes d'interconnexion ou de transfert pour l'accès au bus de mémoire
G06F 15/82 - Architectures de calculateurs universels à programmes enregistrés commandés par des données ou à la demande
G11C 19/00 - Mémoires numériques dans lesquelles l'information est déplacée par échelons, p. ex. registres à décalage
Methods and systems for timing analysis and optimization of asynchronous circuit designs are disclosed. Registration stages are placed between combinational logic circuits. For timing purposes, the registration stages are modified to have a duplicate set of pins. New paths are formed in the circuit for the purposes of timing analysis. The paths are analyzable by timing tools. Once the timing analysis is complete, the paths are reverted to original paths, and new devices are selected for the circuit design based on results of the timing analysis. An updated design is sent for manufacture, based on the timing analysis and optimization of the asynchronous circuit.
Techniques are disclosed for power conservation. A plurality of processing elements and a plurality of instructions are configured. The plurality of processing elements is controlled by instructions contained in a plurality of circular buffers. The plurality of processing elements can comprise a dataflow processor. A first processing element, from the plurality of interconnected processing elements, is set into a sleep state by a first instruction from the plurality of instructions. The first processing element is woken from the sleep state as a result of valid data being presented to the first processing element. A subsection of the plurality of interconnected processing elements is also set into a sleep state based on the first processing element being set into a sleep state. At least one circular buffer from the plurality of circular buffers remains awake while the first processing element is in the sleep state, and the at least one circular buffer provides for data steering through a reconfigurable fabric.
H03K 19/177 - Circuits logiques, c.-à-d. ayant au moins deux entrées agissant sur une sortieCircuits d'inversion utilisant des éléments spécifiés utilisant des circuits logiques élémentaires comme composants disposés sous forme matricielle
H01L 25/00 - Ensembles consistant en une pluralité de dispositifs à semi-conducteurs ou d'autres dispositifs à l'état solide
G06F 7/38 - Méthodes ou dispositions pour effectuer des calculs en utilisant exclusivement une représentation numérique codée, p. ex. en utilisant une représentation binaire, ternaire, décimale
G06F 5/08 - Procédés ou dispositions pour la conversion de données, sans modification de l'ordre ou du contenu des données maniées pour modifier la vitesse de débit des données, c.-à-d. régularisation de la vitesse ayant une séquence d'emplacements d'emmagasinage, les emplacements intermédiaires n'étant pas accessibles pour des opérations soit de mise en file d'attente, soit de retrait de file d'attente, p. ex. utilisant un registre à décalage
Disclosed embodiments select a proper hum frequency reference by utilizing one or more functional logic circuits within a cluster. The slowest logic circuit is determined, and an instance of that logic circuit is used in timing circuitry for the cluster. Multiple logic circuits with similar characteristics are incorporated into the timing circuit. Each cluster is interconnected to a second level timing circuit. Each cluster inputs timing information into the second level timing circuit. The second level timing circuit then determines when the next cycle, or tic, of the self-generated clock starts, and the process repeats, providing a self-generated clock signal.
H03K 19/20 - Circuits logiques, c.-à-d. ayant au moins deux entrées agissant sur une sortieCircuits d'inversion caractérisés par la fonction logique, p. ex. circuits ET, OU, NI, NON
H03K 5/135 - Dispositions ayant une sortie unique et transformant les signaux d'entrée en impulsions délivrées à des intervalles de temps désirés par l'utilisation de signaux de référence de temps, p. ex. des signaux d'horloge
26.
Logical elements with switchable connections for multifunction operation
Clusters of logical elements are interconnected by a switching fabric. Each cluster contains processing elements, storage elements, and switching elements. A circular buffer within a cluster contains multiple switching instructions to control the flow of data throughout the switching fabric. The circular buffer provides a pipelined execution of switching instructions for the implementation of multiple functions. Each cluster contains multiple processing elements, and each cluster further comprises an additional circular buffer for each processing element. Logical operations are controlled by the circular buffers.
H03K 19/0175 - Dispositions pour le couplageDispositions pour l'interface
H03K 19/177 - Circuits logiques, c.-à-d. ayant au moins deux entrées agissant sur une sortieCircuits d'inversion utilisant des éléments spécifiés utilisant des circuits logiques élémentaires comme composants disposés sous forme matricielle
A plurality of software programmable processors is disclosed. The software programmable processors are controlled by rotating circular buffers. A first processor and a second processor within the plurality of software programmable processors are individually programmable. The first processor within the plurality of software programmable processors is coupled to neighbor processors within the plurality of software programmable processors. The first processor sends and receives data from the neighbor processors. The first processor and the second processor are configured to operate on a common instruction cycle. An output of the first processor from a first instruction cycle is an input to the second processor on a subsequent instruction cycle.
G06F 9/38 - Exécution simultanée d'instructions, p. ex. pipeline ou lecture en mémoire
G06F 15/173 - Communication entre processeurs utilisant un réseau d'interconnexion, p. ex. matriciel, de réarrangement, pyramidal, en étoile ou ramifié
G06F 15/82 - Architectures de calculateurs universels à programmes enregistrés commandés par des données ou à la demande
G06F 1/324 - Économie d’énergie caractérisée par l'action entreprise par réduction de la fréquence d’horloge
Circular buffers containing instructions that enable the execution of operations on logical elements are described where data in the circular buffers is swapped to storage. Data stored in circular buffers is paged in and out to a second level memory. State information for each logical element is also saved and restored using paging memory. Logical elements such as processing elements are provided instructions via circular buffers. The instructions enable a group of processing elements to perform operations implementing a desired functionality. That functionality is changed by updating the circular buffers with new instructions that are transferred from paging memory. The previous instructions can be saved off in paging memory before the new instructions are copied over to the circular buffers. This enables the hardware to be rapidly reconfigured amongst multiple functions.
G06F 12/00 - Accès à, adressage ou affectation dans des systèmes ou des architectures de mémoires
G06F 12/0802 - Adressage d’un niveau de mémoire dans lequel l’accès aux données ou aux blocs de données désirés nécessite des moyens d’adressage associatif, p. ex. mémoires cache
G06F 12/0868 - Transfert de données entre une mémoire cache et d'autres sous-systèmes, p. ex. des dispositifs de stockage ou des systèmes hôtes
29.
Compact logic evaluation gates using null convention
Compact logic evaluation gates are built using null convention logic (NCL) circuits. The inputs to a null convention circuit include a NCL true input and a NCL complement input. The NCL circuit includes a gate coupled to the pair of inputs, where the gate comprises a plurality of transistors. The transistors allow for logical signal capture, provide a pair of cross-coupled inverters for data storage, and include a first and second pull-down device. The first pull-down device causes a first side of the pair of cross-coupled inverters to go to a “0” state when a “1” is applied to the NCL true input, and the second pull-down device causes a second side of the pair of cross-coupled inverters to go to a “0” state when a “1” is applied to the NCL complement input.
H03K 19/00 - Circuits logiques, c.-à-d. ayant au moins deux entrées agissant sur une sortieCircuits d'inversion
H03K 19/094 - Circuits logiques, c.-à-d. ayant au moins deux entrées agissant sur une sortieCircuits d'inversion utilisant des éléments spécifiés utilisant des dispositifs à semi-conducteurs utilisant des transistors à effet de champ
30.
Computing resource allocation based on flow graph translation
Systems and methods are disclosed for computing resource allocation based on flow graph translation. First, a high-level description of logic circuitry is obtained and translated to generate a flow graph representing sequential operations. Using the flow graph, similar processing elements in an array are interchangeably allocated to perform computational, communication, and storage tasks as needed. The sequential operations are executed using the array of interchangeable processing elements. Data is provided from the storage elements through the communication elements to the computational elements. Computational results are stored in the storage elements. Outputs from some of the computational elements provide inputs to other computational elements. Execution of the instructions can be controlled with time stepping. The processors are reallocated as needed, based on changes to the flow graph.
Multi-threshold flash Null Convention Logic (NCL) includes one or more high threshold voltage transistors within a flash NCL gate to reduce power consumption due to current leakage by transistors of the NCL gate. High-threshold voltage transistors may be added and/or may be used in place of one or more lower voltage threshold transistors of the NCL gate. A high-Vt device is included in the pull-up path to reduce power when the flash NCL logic gate is in the null state.
H03K 19/00 - Circuits logiques, c.-à-d. ayant au moins deux entrées agissant sur une sortieCircuits d'inversion
H03K 19/094 - Circuits logiques, c.-à-d. ayant au moins deux entrées agissant sur une sortieCircuits d'inversion utilisant des éléments spécifiés utilisant des dispositifs à semi-conducteurs utilisant des transistors à effet de champ
H03K 19/23 - Circuits de majorité ou de minorité, c.-à-d. donnant un signal de sortie dont l'état est celui de la majorité ou de la minorité des signaux d'entrée
H03K 23/40 - Signaux d'ouverture de porte ou d'horloge appliqués à tous les étages, c.-à-d. compteurs synchrones
H03K 23/58 - Signaux d'ouverture de porte ou d'horloge non appliqués à tous les étages, c.-à-d. compteurs asynchrones
Clusters of logical elements are interconnected by a switching fabric. Each cluster contains processing elements, storage elements, and switching elements. A circular buffer within a cluster contains multiple switching instructions to control the flow of data throughout the switching fabric. The circular buffer provides a pipelined execution of switching instructions. Each cluster contains multiple processing elements, and each cluster further comprises an additional circular buffer for each processing element. Logical operations are controlled by the circular buffers.
H03K 19/0175 - Dispositions pour le couplageDispositions pour l'interface
H03K 19/177 - Circuits logiques, c.-à-d. ayant au moins deux entrées agissant sur une sortieCircuits d'inversion utilisant des éléments spécifiés utilisant des circuits logiques élémentaires comme composants disposés sous forme matricielle
33.
Multi-threshold circuitry based on silicon-on-insulator technology
Multiple threshold voltage circuitry based on silicon-on-insulator (SOI) technology is disclosed which utilizes N-wells and/or P-wells underneath the insulator in SOI FETs. The well under a FET is biased to influence the threshold voltage of the FET. A PFET and an NFET share a common buried P-well or N-well. Various types of logic can be fabricated in silicon-on-insulator (SOI) technology using multiple threshold voltage FETs. Embodiments provide circuits including the advantageous properties of both low-leakage transistors and high-speed transistors.
H01L 27/10 - Dispositifs consistant en une pluralité de composants semi-conducteurs ou d'autres composants à l'état solide formés dans ou sur un substrat commun comprenant des éléments de circuit passif intégrés avec au moins une barrière de potentiel ou une barrière de surface le substrat étant un corps semi-conducteur comprenant une pluralité de composants individuels dans une configuration répétitive
H03K 19/0948 - Circuits logiques, c.-à-d. ayant au moins deux entrées agissant sur une sortieCircuits d'inversion utilisant des éléments spécifiés utilisant des dispositifs à semi-conducteurs utilisant des transistors à effet de champ utilisant des transistors MOSFET utilisant des dispositifs CMOS
H01L 27/12 - Dispositifs consistant en une pluralité de composants semi-conducteurs ou d'autres composants à l'état solide formés dans ou sur un substrat commun comprenant des éléments de circuit passif intégrés avec au moins une barrière de potentiel ou une barrière de surface le substrat étant autre qu'un corps semi-conducteur, p.ex. un corps isolant
H01L 21/84 - Fabrication ou traitement de dispositifs consistant en une pluralité de composants à l'état solide ou de circuits intégrés formés dans ou sur un substrat commun avec une division ultérieure du substrat en plusieurs dispositifs individuels pour produire des dispositifs, p.ex. des circuits intégrés, consistant chacun en une pluralité de composants le substrat étant autre chose qu'un corps semi-conducteur, p.ex. étant un corps isolant
H03K 19/08 - Circuits logiques, c.-à-d. ayant au moins deux entrées agissant sur une sortieCircuits d'inversion utilisant des éléments spécifiés utilisant des dispositifs à semi-conducteurs
H03K 19/20 - Circuits logiques, c.-à-d. ayant au moins deux entrées agissant sur une sortieCircuits d'inversion caractérisés par la fonction logique, p. ex. circuits ET, OU, NI, NON
34.
Software based application specific integrated circuit
A processing device is provided. A cluster includes a plurality of groups of processing elements. A multi-word device is connected to the processing elements within the groups. Each processing element in a particular group is in communication with all other processing elements within the particular group, and only one of the processing elements within other groups in the cluster. Each processing element is limited to operations in which input bits can be processed and an output obtained without reference to other bits. The multi-word device is configured to cooperate with at least two other processing elements to perform processing that requires reference to other bits to obtain a result.
G06F 9/38 - Exécution simultanée d'instructions, p. ex. pipeline ou lecture en mémoire
G06F 9/30 - Dispositions pour exécuter des instructions machines, p. ex. décodage d'instructions
G06F 15/173 - Communication entre processeurs utilisant un réseau d'interconnexion, p. ex. matriciel, de réarrangement, pyramidal, en étoile ou ramifié
An apparatus for mathematical manipulation is described allowing the selective combination of shifters to shift binary numbers of various widths. Selective combination allows on-the-fly adjustment of shifters from independent to coordinated shifting operations. Selective combination allows adjustable hardware-based shifting while saving space and resources. Multiple eight-bit shifters can be configured for a variety of operand widths, such as a 32-bit width, a 24-bit width, a 16-bit width, or an eight-bit width. Multiplexers route the appropriate input data to the appropriate shifters. Opcodes configure the shifters for the desired type of shift and a shifted result is generated.
G06F 5/01 - Procédés ou dispositions pour la conversion de données, sans modification de l'ordre ou du contenu des données maniées pour le décalage, p. ex. la justification, le changement d'échelle, la normalisation
An extensible iterative multiplier design is provided. Embodiments provide cascaded 8-bit multipliers for simplifying the performance of multi-byte multiplications. Booth encoding is performed in the lowest order multiplier, with the result of the Booth encoding then provided to higher order multipliers. Additionally, multiply-add operations can be performed by initializing a partial product sum register. Configurable connections between the multipliers facilitate a variety of possible multiplication options, including the possibility of varying the width of the operands.
Systems and methods for clock generation and distribution are disclosed. Embodiments include arrangements of synchronization signals implemented using a mesh circuit. The mesh circuit is comprised of a plurality of null convention logic (NCL) gates organized into rings. Each ring shares at least one NCL gate with an adjacent ring. The rings are configured in such a way that each ring in the mesh operates synchronously with the other rings in the mesh.
An implementation method for a fast Null Convention Logic (NCL) data path includes a pipeline that is assembled from gates of various types of NCL. Self-ready flash NCL gates include a one-shot circuit to reset the gates to a null state and prepare the gates for the next wave of asserted data. In one embodiment, the one-shot circuit creates a flash pulse inside a gate in response to a change of a flash input line and ends the flash pulse in response to the gate output being reset to a null state. Conventional logic can be included in the data path as well.
H03K 19/094 - Circuits logiques, c.-à-d. ayant au moins deux entrées agissant sur une sortieCircuits d'inversion utilisant des éléments spécifiés utilisant des dispositifs à semi-conducteurs utilisant des transistors à effet de champ
H03K 19/00 - Circuits logiques, c.-à-d. ayant au moins deux entrées agissant sur une sortieCircuits d'inversion
G06F 9/38 - Exécution simultanée d'instructions, p. ex. pipeline ou lecture en mémoire
Multi-threshold flash Null Convention Logic (NCL) includes one or more high threshold voltage transistors within a flash NCL gate to reduce power consumption due to current leakage by transistors of the NCL gate. High-threshold voltage transistors may be added and/or may be used in place of one or more lower voltage threshold transistors of the NCL gate. A high-Vt device is included in the pull-up path to reduce power when the flash NCL logic gate is in the null state.
H03K 19/00 - Circuits logiques, c.-à-d. ayant au moins deux entrées agissant sur une sortieCircuits d'inversion
H03K 19/094 - Circuits logiques, c.-à-d. ayant au moins deux entrées agissant sur une sortieCircuits d'inversion utilisant des éléments spécifiés utilisant des dispositifs à semi-conducteurs utilisant des transistors à effet de champ
A self-ready flash null Convention Logic (NCL) gate includes a one-shot circuit to create the flash timing to reset the gate to a null state. The one-shot circuit may be any type of circuit to generate a pulse in response to a change of state of an input line. In one embodiment, the one-shot circuit may start the pulse in response to a change of a flash input line and end the pulse in response to the NCL output being reset to a null state.
H03K 19/094 - Circuits logiques, c.-à-d. ayant au moins deux entrées agissant sur une sortieCircuits d'inversion utilisant des éléments spécifiés utilisant des dispositifs à semi-conducteurs utilisant des transistors à effet de champ
41.
Method and apparatus for ensuring data cache coherency
A multithreaded processor can concurrently execute a plurality of threads in a processor core. The threads can access a shared main memory through a memory interface; the threads can generate read and write transactions that cause shared main memory access. An incoherency detection module prevents incoherency by maintaining a record of outstanding global writes, and detecting a conflicting global read. A barrier is sequenced with the conflicting global write. The conflicting global read is allowed to proceed after the sequence of the conflicting global write and the barrier are cleared. The sequence can be maintained by a separate queue for each thread of the plurality.