Techniques for data manipulation using integer matrix multiplication using pipelining are disclosed. A first integer matrix with dimensions m×k and a second integer matrix with dimensions k×n are obtained for matrix multiplication within a processor. The first and second integer matrices employ a two's complement variable radix point data representation. The first and second integer matrices are distilled into (j×j) submatrices. A first variable radix point format and an initial value for an accumulator register are configured dynamically. A first variable radix point format is configured dynamically for the first integer matrix and a second variable radix point format is configured dynamically for the second integer matrix. Multiply-accumulate operations are executed in a pipelined fashion on the (j×j) submatrices of the first integer matrix and the second integer matrix, where a third variable radix point format is configured for the result.
Techniques for data manipulation using processor cluster address generation are disclosed. One or more processor clusters capable of executing software-initiated work requests are accessed. A plurality of dimensions from a tensor is flattened into a single dimension. A work request address field is parsed, where the address field contains unique address space descriptors for each of the plurality of dimensions, along with a common address space descriptor. A direct memory access (DMA) engine coupled to the one or more processor clusters is configured. Addresses are generated based on the unique address space descriptors and the common address space descriptor. The plurality of dimensions can be summed to generate a single address. Memory is accessed using two or more of the addresses that were generated. The addresses are used to enable DMA access.
G06F 12/06 - Addressing a physical block of locations, e.g. base addressing, module addressing, address space extension, memory dedication
G06F 13/28 - Handling requests for interconnection or transfer for access to input/output bus using burst mode transfer, e.g. direct memory access, cycle steal
3.
PROCESSOR GRAPH EXECUTION USING INTERRUPT CONSERVATION
Techniques for data manipulation using processor graph execution using interrupt conservation are disclosed. Processing elements are configured to implement a data flow graph. The processing elements comprise a multilayer graph execution engine. A data engine is loaded with computational parameters for the multilayer graph execution engine. The data engine is coupled to the multilayer graph execution engine, and the computational parameters supply layer-by-layer execution data to the multilayer graph execution engine for data flow graph execution. A first command FIFO is used for loading the data engine with computational parameters, and a second command FIFO is used for loading the multilayer graph execution engine with layer definition data. An input image is provided for a first layer of the multilayer graph execution engine. The data flow graph is executed using the input image and the computational parameters. The executing is controlled by interrupts only when an uncertainty exists within the data flow graph.
Techniques for data manipulation using integer matrix multiplication using pipelining are disclosed. A first integer matrix with dimensions m×k and a second integer matrix with dimensions k×n are obtained for matrix multiplication within a processor. The first and second integer matrices employ a two's complement variable radix point data representation. The first and second integer matrices are distilled into (j×j) submatrices. A first variable radix point format and an initial value for an accumulator register are configured dynamically. A first variable radix point format is configured dynamically for the first integer matrix and a second variable radix point format is configured dynamically for the second integer matrix. Multiply-accumulate operations are executed in a pipelined fashion on the (j×j) submatrices of the first integer matrix and the second integer matrix, where a third variable radix point format is configured for the result.
Techniques for data manipulation using processor graph execution using interrupt conservation are disclosed. Processing elements are configured to implement a data flow graph. The processing elements comprise a multilayer graph execution engine. A data engine is loaded with computational parameters for the multilayer graph execution engine. The data engine is coupled to the multilayer graph execution engine, and the computational parameters supply layer-by-layer execution data to the multilayer graph execution engine for data flow graph execution. A first command FIFO is used for loading the data engine with computational parameters, and a second command FIFO is used for loading the multilayer graph execution engine with layer definition data. An input image is provided for a first layer of the multilayer graph execution engine. The data flow graph is executed using the input image and the computational parameters. The executing is controlled by interrupts only when an uncertainty exists within the data flow graph.
Techniques for data manipulation using processor cluster address generation are disclosed. One or more processor clusters capable of executing software-initiated work requests are accessed. A plurality of dimensions from a tensor is flattened into a single dimension. A work request address field is parsed, where the address field contains unique address space descriptors for each of the plurality of dimensions, along with a common address space descriptor. A direct memory access (DMA) engine coupled to the one or more processor clusters is configured. Addresses are generated based on the unique address space descriptors and the common address space descriptor. The plurality of dimensions can be summed to generate a single address. Memory is accessed using two or more of the addresses that were generated. The addresses are used to enable DMA access.
G06F 12/06 - Addressing a physical block of locations, e.g. base addressing, module addressing, address space extension, memory dedication
G06F 13/28 - Handling requests for interconnection or transfer for access to input/output bus using burst mode transfer, e.g. direct memory access, cycle steal
7.
Integer matrix multiplication engine using pipelining
Techniques for data manipulation using integer matrix multiplication using pipelining are disclosed. A first integer matrix with dimensions m×k and a second integer matrix with dimensions k×n are obtained for matrix multiplication within a processor. The first and second integer matrices employ a two's complement variable radix point data representation. The first and second integer matrices are distilled into (j×j) submatrices. A first variable radix point format and an initial value for an accumulator register are configured dynamically. A first variable radix point format is configured dynamically for the first integer matrix and a second variable radix point format is configured dynamically for the second integer matrix. Multiply-accumulate operations are executed in a pipelined fashion on the (j×j) submatrices of the first integer matrix and the second integer matrix, where a third variable radix point format is configured for the result.
Techniques for data manipulation using processor cluster address generation are disclosed. One or more processor clusters capable of executing software-initiated work requests are accessed. A direct memory access (DMA) engine, coupled to the one or more processor clusters, is configured, wherein the DMA engine employs address generation across a plurality of tensor dimensions. A work request address field is parsed, where the address field contains unique address space descriptors for each of the plurality of dimensions, along with a common address space descriptor. DMA addresses are generated based on the unique address space descriptors and the common address space descriptor. Memory using two or more of the DMA addresses that were generated is accessed, where the two or more DMA addresses enable processing within the one or more processor clusters.
G06F 13/28 - Handling requests for interconnection or transfer for access to input/output bus using burst mode transfer, e.g. direct memory access, cycle steal
Techniques for data manipulation using a matrix multiplication engine using pipelining are disclosed. A first and a second matrix are obtained for matrix multiplication. A first matrix multiply-accumulate (MAC) unit is configured, where a first matrix element and a second matrix element are presented to the MAC unit on a first cycle. A second MAC unit is configured in pipelined fashion, where the first element of the first matrix and a second element of the second matrix are presented to the second MAC unit on a second cycle, and where a second element of the first matrix and the first element of the second matrix are presented to the first MAC unit on the second cycle. Additional MAC units are further configured within the processor in pipelined fashion. Multiply-accumulate operations are executed in pipelined fashion on each of n MAC units over additional k sets of m cycles.
Disclosed embodiments provide an interface circuit for the transfer of data from a synchronous circuit to an asynchronous circuit. Data from the synchronous circuit is received into a memory in the interface circuit. The data in the memory is then sent to the asynchronous circuit based on an instruction in a circular buffer that is part of the interface circuit. Processing elements within the interface circuit execute instructions contained within the circular buffer. The circular buffer rotates to provide new instructions to the processing elements. Flow control paces the data from the synchronous circuit to the asynchronous circuit.
Techniques for a neural network output layer for machine learning are disclosed. A plurality of processing elements within a reconfigurable fabric is configured to implement a data flow graph, where the data flow graph implements a neural network. The data flow graph can include machine learning or deep learning. A layer is implemented, within the neural network, that maps a first vector of real values to a second vector of real values bounded by zero and one, where the second vector sums to a value of one using fixed-point calculations. The layer can include a final layer within the neural network. The layer that maps the first vector includes a Softmax function. Results of the neural network are classified based on a value of the second vector. The classifying can include part of a machine learning or a deep learning process.
Techniques are disclosed for data manipulation within a reconfigurable computing environment for data flow graph computation using exceptions. Processing elements are configured within a reconfigurable fabric to implement a data flow graph. The processing elements are loaded with process agents. Valid data is executed by a first process agent on a first processing element, where the first process agent corresponds to a starting node of the data flow graph. A second processing element detects that an error exception has occurred, where a second process agent is running on the second processing element. A done signal to a third process agent is withheld by the second process agent, where the third process agent is running on a third processing element. The second process agent raises an interrupt request, where the interrupt request is based on the detecting that an error exception has occurred.
Techniques are disclosed for data flow graph node parallel update for machine learning. A first plurality of processing elements is configured to implement a portion of a data flow graph. The nodes include at least one variable node and implement part of a neural network. A second plurality of processing elements is configured to implement a second portion of the data flow graph. These nodes include at least one additional variable node and implement an additional part of the neural network. Training data is issued to the first plurality of processing elements. The training data is used to update variables within the at least one variable node. Additional variables are updated within the at least one additional variable node. The updating includes forwarding training data from the first plurality to the second plurality. The neural network is trained based on the variables that were updated and the additional variables.
An interface circuit is disclosed for the transfer of data from a synchronous circuit, with multiple source elements, to an asynchronous circuit. Data from the synchronous circuit is received into a memory in the interface circuit. The data in the memory is then sent to the asynchronous circuit based on an instruction in a circular buffer that is part of the interface circuit. Processing elements within the interface circuit execute instructions contained within the circular buffer. The circular buffer rotates to provide new instructions to the processing elements. Flow control paces the data from the synchronous circuit to the asynchronous circuit.
Techniques are disclosed for power conservation. A plurality of processing elements and a plurality of instructions are configured. The plurality of processing elements is controlled by instructions contained in a plurality of circular buffers. The plurality of processing elements can comprise a data flow processor. A first processing element, from the plurality of interconnected processing elements, is set into a sleep state by a first instruction from the plurality of instructions. The first processing element is woken from the sleep state as a result of valid data being presented to the first processing element. A subsection of the plurality of interconnected processing elements is also set into a sleep state based on the first processing element being set into a sleep state. At least one circular buffer from the plurality of circular buffers remains awake while the first processing element is in the sleep state, and the at least one circular buffer provides for data steering through a reconfigurable fabric.
H03K 19/177 - Logic circuits, i.e. having at least two inputs acting on one outputInverting circuits using specified components using elementary logic circuits as components arranged in matrix form
H01L 25/00 - Assemblies consisting of a plurality of individual semiconductor or other solid-state devices
G06F 5/08 - Methods or arrangements for data conversion without changing the order or content of the data handled for changing the speed of data flow, i.e. speed regularising having a sequence of storage locations, the intermediate ones not being accessible for either enqueue or dequeue operations, e.g. using a shift register
G06F 1/3287 - Power saving characterised by the action undertaken by switching off individual functional units in the computer system
Techniques are disclosed for designing a reconfigurable fabric. The reconfigurable fabric is designed using logical elements, configurable connections between and among the logical elements, and rotating circular buffers. The circular buffers contain configuration instructions. The configuration instructions control connections between and among logical elements. The logical elements change operation based on the instructions that rotate through the circular buffers. Clusters of logical elements are interconnected by a switching fabric. Each cluster contains processing elements, storage elements, and switching elements. A circular buffer within a cluster contains multiple switching instructions to control the flow of data throughout the switching fabric. The circular buffer provides a pipelined execution of switching instructions for the implementation of multiple functions. Each cluster contains multiple processing elements, and each cluster further comprises an additional circular buffer for each processing element. Logical operations are controlled by the circular buffers.
H03K 19/177 - Logic circuits, i.e. having at least two inputs acting on one outputInverting circuits using specified components using elementary logic circuits as components arranged in matrix form
17.
Branchless instruction paging in reconfigurable fabric
Circular buffers containing instructions that enable the execution of operations on logical elements are described where data in the circular buffers is swapped to storage. The instructions comprise a branchless instruction set. Data stored in circular buffers is paged in and out to a second level memory. State information for each logical element is also saved and restored using paging memory. Instructions are provided to logical elements, such as processing elements, via circular buffers. The instructions enable a group of processing elements to perform operations implementing a desired functionality. That functionality is changed by updating the circular buffers with new instructions that are transferred from paging memory. The previous instructions can be saved off in paging memory before the new instructions are copied over to the circular buffers. This enables the hardware to be rapidly reconfigured amongst multiple functions.
G06F 12/00 - Accessing, addressing or allocating within memory systems or architectures
G06F 12/0875 - Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches with dedicated cache, e.g. instruction or stack
G06F 9/38 - Concurrent instruction execution, e.g. pipeline or look ahead
G06F 15/78 - Architectures of general purpose stored program computers comprising a single central processing unit
G06F 3/06 - Digital input from, or digital output to, record carriers
G06F 12/08 - Addressing or allocationRelocation in hierarchically structured memory systems, e.g. virtual memory systems
G06F 13/28 - Handling requests for interconnection or transfer for access to input/output bus using burst mode transfer, e.g. direct memory access, cycle steal
G06N 3/04 - Architecture, e.g. interconnection topology
G06N 3/063 - Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
G06F 9/50 - Allocation of resources, e.g. of the central processing unit [CPU]
Techniques are disclosed for managing data within a reconfigurable computing environment. In a multiple processing element environment, such as a mesh network or other suitable topology, there is an inherent need to pass data between processing elements. Subtasks are divided among multiple processing elements. The output resulting from the subtasks is then merged by a downstream processing element. In such cases, a join operation can be used to combine data from multiple upstream processing elements. A control agent executes on each processing element. A memory buffer is disposed between upstream processing elements and the downstream processing element. The downstream processing element is configured to automatically perform an operation based on the availability of valid data from the upstream processing elements.
Disclosed techniques utilize a satisfiability solver for allocation and/or configuration of resources in a reconfigurable fabric of processing elements. A dataflow graph is an input provided to a toolchain that includes a satisfiability solver. The satisfiability solver operates on subsets of interconnected nodes within a dataflow graph to derive a solution. The solution is trimmed by removing artifacts and unnecessary parts. The solutions of subsets are then used as an input to additional subsets of nodes within the dataflow graph in an iterative process to derive a complete solution. The satisfiability solver technique uses adaptive windowing in both the time dimension and the spatial dimensions of the dataflow graph. Processing elements and routing elements within the reconfigurable fabric are configured based on the complete solution. Data computation is performed based on the dataflow graph using the processing elements and the routing resources.
An apparatus for mathematical manipulation is described allowing the selective combination of shifters to shift binary numbers of various widths. Selective combination allows on-the-fly adjustment of shifters from independent to coordinated shifting operations. Selective combination allows adjustable hardware-based shifting while saving space and resources. Multiple eight-bit shifters can be configured for a variety of operand widths, such as a 32-bit width, a 24-bit width, a 16-bit width, or an eight-bit width. Multiplexers route the appropriate input data to the appropriate shifters. Bidirectional shifting is configured through a selector tree, including both shift left and shift right operations. Opcodes configure the shifters for the desired type of shift and a shifted result is generated.
G06F 5/01 - Methods or arrangements for data conversion without changing the order or content of the data handled for shifting, e.g. justifying, scaling, normalising
21.
Reconfigurable fabric direct memory access with multiple read or write elements
Techniques are disclosed for data manipulation. Data is obtained from a first switching element where the first switching element is controlled by a first circular buffer. Data is sent to a second switching element where the second switching element is controlled by a second circular buffer. Data is controlled by a third switching element that is controlled by a third circular buffer. The third switching element hierarchically controls the first switching element and the second switching element. Data is routed through a fourth switching element that is controlled by a fourth circular buffer. The circular buffers are statically scheduled. The obtaining data from a first switching element and the sending the data to a second switching element includes a direct memory access (DMA). The switching elements can operate as a master controller or as a slave device. The switching elements can comprise clusters within an asynchronous reconfigurable fabric.
G06F 13/28 - Handling requests for interconnection or transfer for access to input/output bus using burst mode transfer, e.g. direct memory access, cycle steal
G06F 13/16 - Handling requests for interconnection or transfer for access to memory bus
G06F 13/42 - Bus transfer protocol, e.g. handshakeSynchronisation
22.
Communication between dataflow processing units and memories
A combination of memory units and dataflow processing units is disclosed for computation. A first memory unit is interposed between a first dataflow processing unit and a second dataflow processing unit. Operations for a dataflow graph are allocated across the first dataflow processing unit and the second dataflow processing unit. The first memory unit passes data between the first dataflow processing unit and the second dataflow processing unit to execute the dataflow graph. The first memory unit is a high bandwidth, shared memory device including a hybrid memory cube. The first dataflow processing unit and second dataflow processing unit include a plurality of circular buffers containing instructions for controlling data transfer between the first dataflow processing unit and second dataflow processing unit. Additional dataflow processing units and additional memory units are included for additional functionality and efficiency.
G06F 5/10 - Methods or arrangements for data conversion without changing the order or content of the data handled for changing the speed of data flow, i.e. speed regularising having a sequence of storage locations each being individually accessible for both enqueue and dequeue operations, e.g. using random access memory
G06F 13/16 - Handling requests for interconnection or transfer for access to memory bus
G06F 15/82 - Architectures of general purpose stored program computers data or demand driven
G11C 19/00 - Digital stores in which the information is moved stepwise, e.g. shift registers
Methods and systems for timing analysis and optimization of asynchronous circuit designs are disclosed. Registration stages are placed between combinational logic circuits. For timing purposes, the registration stages are modified to have a duplicate set of pins. New paths are formed in the circuit for the purposes of timing analysis. The paths are analyzable by timing tools. Once the timing analysis is complete, the paths are reverted to original paths, and new devices are selected for the circuit design based on results of the timing analysis. An updated design is sent for manufacture, based on the timing analysis and optimization of the asynchronous circuit.
Techniques are disclosed for power conservation. A plurality of processing elements and a plurality of instructions are configured. The plurality of processing elements is controlled by instructions contained in a plurality of circular buffers. The plurality of processing elements can comprise a dataflow processor. A first processing element, from the plurality of interconnected processing elements, is set into a sleep state by a first instruction from the plurality of instructions. The first processing element is woken from the sleep state as a result of valid data being presented to the first processing element. A subsection of the plurality of interconnected processing elements is also set into a sleep state based on the first processing element being set into a sleep state. At least one circular buffer from the plurality of circular buffers remains awake while the first processing element is in the sleep state, and the at least one circular buffer provides for data steering through a reconfigurable fabric.
H03K 19/177 - Logic circuits, i.e. having at least two inputs acting on one outputInverting circuits using specified components using elementary logic circuits as components arranged in matrix form
H01L 25/00 - Assemblies consisting of a plurality of individual semiconductor or other solid-state devices
G06F 7/38 - Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
G06F 5/08 - Methods or arrangements for data conversion without changing the order or content of the data handled for changing the speed of data flow, i.e. speed regularising having a sequence of storage locations, the intermediate ones not being accessible for either enqueue or dequeue operations, e.g. using a shift register
Disclosed embodiments select a proper hum frequency reference by utilizing one or more functional logic circuits within a cluster. The slowest logic circuit is determined, and an instance of that logic circuit is used in timing circuitry for the cluster. Multiple logic circuits with similar characteristics are incorporated into the timing circuit. Each cluster is interconnected to a second level timing circuit. Each cluster inputs timing information into the second level timing circuit. The second level timing circuit then determines when the next cycle, or tic, of the self-generated clock starts, and the process repeats, providing a self-generated clock signal.
H03K 19/096 - Synchronous circuits, i.e. using clock signals
H03K 19/20 - Logic circuits, i.e. having at least two inputs acting on one outputInverting circuits characterised by logic function, e.g. AND, OR, NOR, NOT circuits
H03K 5/135 - Arrangements having a single output and transforming input signals into pulses delivered at desired time intervals by the use of time reference signals, e.g. clock signals
26.
Logical elements with switchable connections for multifunction operation
Clusters of logical elements are interconnected by a switching fabric. Each cluster contains processing elements, storage elements, and switching elements. A circular buffer within a cluster contains multiple switching instructions to control the flow of data throughout the switching fabric. The circular buffer provides a pipelined execution of switching instructions for the implementation of multiple functions. Each cluster contains multiple processing elements, and each cluster further comprises an additional circular buffer for each processing element. Logical operations are controlled by the circular buffers.
H03K 19/177 - Logic circuits, i.e. having at least two inputs acting on one outputInverting circuits using specified components using elementary logic circuits as components arranged in matrix form
A plurality of software programmable processors is disclosed. The software programmable processors are controlled by rotating circular buffers. A first processor and a second processor within the plurality of software programmable processors are individually programmable. The first processor within the plurality of software programmable processors is coupled to neighbor processors within the plurality of software programmable processors. The first processor sends and receives data from the neighbor processors. The first processor and the second processor are configured to operate on a common instruction cycle. An output of the first processor from a first instruction cycle is an input to the second processor on a subsequent instruction cycle.
Circular buffers containing instructions that enable the execution of operations on logical elements are described where data in the circular buffers is swapped to storage. Data stored in circular buffers is paged in and out to a second level memory. State information for each logical element is also saved and restored using paging memory. Logical elements such as processing elements are provided instructions via circular buffers. The instructions enable a group of processing elements to perform operations implementing a desired functionality. That functionality is changed by updating the circular buffers with new instructions that are transferred from paging memory. The previous instructions can be saved off in paging memory before the new instructions are copied over to the circular buffers. This enables the hardware to be rapidly reconfigured amongst multiple functions.
Compact logic evaluation gates are built using null convention logic (NCL) circuits. The inputs to a null convention circuit include a NCL true input and a NCL complement input. The NCL circuit includes a gate coupled to the pair of inputs, where the gate comprises a plurality of transistors. The transistors allow for logical signal capture, provide a pair of cross-coupled inverters for data storage, and include a first and second pull-down device. The first pull-down device causes a first side of the pair of cross-coupled inverters to go to a “0” state when a “1” is applied to the NCL true input, and the second pull-down device causes a second side of the pair of cross-coupled inverters to go to a “0” state when a “1” is applied to the NCL complement input.
H03K 19/00 - Logic circuits, i.e. having at least two inputs acting on one outputInverting circuits
H03K 19/094 - Logic circuits, i.e. having at least two inputs acting on one outputInverting circuits using specified components using semiconductor devices using field-effect transistors
30.
Computing resource allocation based on flow graph translation
Systems and methods are disclosed for computing resource allocation based on flow graph translation. First, a high-level description of logic circuitry is obtained and translated to generate a flow graph representing sequential operations. Using the flow graph, similar processing elements in an array are interchangeably allocated to perform computational, communication, and storage tasks as needed. The sequential operations are executed using the array of interchangeable processing elements. Data is provided from the storage elements through the communication elements to the computational elements. Computational results are stored in the storage elements. Outputs from some of the computational elements provide inputs to other computational elements. Execution of the instructions can be controlled with time stepping. The processors are reallocated as needed, based on changes to the flow graph.
Multi-threshold flash Null Convention Logic (NCL) includes one or more high threshold voltage transistors within a flash NCL gate to reduce power consumption due to current leakage by transistors of the NCL gate. High-threshold voltage transistors may be added and/or may be used in place of one or more lower voltage threshold transistors of the NCL gate. A high-Vt device is included in the pull-up path to reduce power when the flash NCL logic gate is in the null state.
H03K 19/00 - Logic circuits, i.e. having at least two inputs acting on one outputInverting circuits
H03K 19/094 - Logic circuits, i.e. having at least two inputs acting on one outputInverting circuits using specified components using semiconductor devices using field-effect transistors
H03K 19/23 - Majority or minority circuits, i.e. giving output having the state of the majority or the minority of the inputs
H03K 23/40 - Gating or clocking signals applied to all stages, i.e. synchronous counters
H03K 23/58 - Gating or clocking signals not applied to all stages, i.e. asynchronous counters
H03K 19/096 - Synchronous circuits, i.e. using clock signals
Clusters of logical elements are interconnected by a switching fabric. Each cluster contains processing elements, storage elements, and switching elements. A circular buffer within a cluster contains multiple switching instructions to control the flow of data throughout the switching fabric. The circular buffer provides a pipelined execution of switching instructions. Each cluster contains multiple processing elements, and each cluster further comprises an additional circular buffer for each processing element. Logical operations are controlled by the circular buffers.
H03K 19/177 - Logic circuits, i.e. having at least two inputs acting on one outputInverting circuits using specified components using elementary logic circuits as components arranged in matrix form
33.
Multi-threshold circuitry based on silicon-on-insulator technology
Multiple threshold voltage circuitry based on silicon-on-insulator (SOI) technology is disclosed which utilizes N-wells and/or P-wells underneath the insulator in SOI FETs. The well under a FET is biased to influence the threshold voltage of the FET. A PFET and an NFET share a common buried P-well or N-well. Various types of logic can be fabricated in silicon-on-insulator (SOI) technology using multiple threshold voltage FETs. Embodiments provide circuits including the advantageous properties of both low-leakage transistors and high-speed transistors.
H01L 27/10 - Devices consisting of a plurality of semiconductor or other solid-state components formed in or on a common substrate including integrated passive circuit elements with at least one potential-jump barrier or surface barrier the substrate being a semiconductor body including a plurality of individual components in a repetitive configuration
H03K 19/0948 - Logic circuits, i.e. having at least two inputs acting on one outputInverting circuits using specified components using semiconductor devices using field-effect transistors using MOSFET using CMOS
H01L 27/12 - Devices consisting of a plurality of semiconductor or other solid-state components formed in or on a common substrate including integrated passive circuit elements with at least one potential-jump barrier or surface barrier the substrate being other than a semiconductor body, e.g. an insulating body
H01L 21/84 - Manufacture or treatment of devices consisting of a plurality of solid state components or integrated circuits formed in, or on, a common substrate with subsequent division of the substrate into plural individual devices to produce devices, e.g. integrated circuits, each consisting of a plurality of components the substrate being other than a semiconductor body, e.g. being an insulating body
H03K 19/08 - Logic circuits, i.e. having at least two inputs acting on one outputInverting circuits using specified components using semiconductor devices
H03K 19/20 - Logic circuits, i.e. having at least two inputs acting on one outputInverting circuits characterised by logic function, e.g. AND, OR, NOR, NOT circuits
34.
Software based application specific integrated circuit
A processing device is provided. A cluster includes a plurality of groups of processing elements. A multi-word device is connected to the processing elements within the groups. Each processing element in a particular group is in communication with all other processing elements within the particular group, and only one of the processing elements within other groups in the cluster. Each processing element is limited to operations in which input bits can be processed and an output obtained without reference to other bits. The multi-word device is configured to cooperate with at least two other processing elements to perform processing that requires reference to other bits to obtain a result.
An apparatus for mathematical manipulation is described allowing the selective combination of shifters to shift binary numbers of various widths. Selective combination allows on-the-fly adjustment of shifters from independent to coordinated shifting operations. Selective combination allows adjustable hardware-based shifting while saving space and resources. Multiple eight-bit shifters can be configured for a variety of operand widths, such as a 32-bit width, a 24-bit width, a 16-bit width, or an eight-bit width. Multiplexers route the appropriate input data to the appropriate shifters. Opcodes configure the shifters for the desired type of shift and a shifted result is generated.
G06F 5/01 - Methods or arrangements for data conversion without changing the order or content of the data handled for shifting, e.g. justifying, scaling, normalising
An extensible iterative multiplier design is provided. Embodiments provide cascaded 8-bit multipliers for simplifying the performance of multi-byte multiplications. Booth encoding is performed in the lowest order multiplier, with the result of the Booth encoding then provided to higher order multipliers. Additionally, multiply-add operations can be performed by initializing a partial product sum register. Configurable connections between the multipliers facilitate a variety of possible multiplication options, including the possibility of varying the width of the operands.
Systems and methods for clock generation and distribution are disclosed. Embodiments include arrangements of synchronization signals implemented using a mesh circuit. The mesh circuit is comprised of a plurality of null convention logic (NCL) gates organized into rings. Each ring shares at least one NCL gate with an adjacent ring. The rings are configured in such a way that each ring in the mesh operates synchronously with the other rings in the mesh.
An implementation method for a fast Null Convention Logic (NCL) data path includes a pipeline that is assembled from gates of various types of NCL. Self-ready flash NCL gates include a one-shot circuit to reset the gates to a null state and prepare the gates for the next wave of asserted data. In one embodiment, the one-shot circuit creates a flash pulse inside a gate in response to a change of a flash input line and ends the flash pulse in response to the gate output being reset to a null state. Conventional logic can be included in the data path as well.
H03K 19/094 - Logic circuits, i.e. having at least two inputs acting on one outputInverting circuits using specified components using semiconductor devices using field-effect transistors
H03K 19/00 - Logic circuits, i.e. having at least two inputs acting on one outputInverting circuits
G06F 9/38 - Concurrent instruction execution, e.g. pipeline or look ahead
Multi-threshold flash Null Convention Logic (NCL) includes one or more high threshold voltage transistors within a flash NCL gate to reduce power consumption due to current leakage by transistors of the NCL gate. High-threshold voltage transistors may be added and/or may be used in place of one or more lower voltage threshold transistors of the NCL gate. A high-Vt device is included in the pull-up path to reduce power when the flash NCL logic gate is in the null state.
H03K 19/00 - Logic circuits, i.e. having at least two inputs acting on one outputInverting circuits
H03K 19/094 - Logic circuits, i.e. having at least two inputs acting on one outputInverting circuits using specified components using semiconductor devices using field-effect transistors
A self-ready flash null Convention Logic (NCL) gate includes a one-shot circuit to create the flash timing to reset the gate to a null state. The one-shot circuit may be any type of circuit to generate a pulse in response to a change of state of an input line. In one embodiment, the one-shot circuit may start the pulse in response to a change of a flash input line and end the pulse in response to the NCL output being reset to a null state.
H03K 19/094 - Logic circuits, i.e. having at least two inputs acting on one outputInverting circuits using specified components using semiconductor devices using field-effect transistors
41.
Method and apparatus for ensuring data cache coherency
A multithreaded processor can concurrently execute a plurality of threads in a processor core. The threads can access a shared main memory through a memory interface; the threads can generate read and write transactions that cause shared main memory access. An incoherency detection module prevents incoherency by maintaining a record of outstanding global writes, and detecting a conflicting global read. A barrier is sequenced with the conflicting global write. The conflicting global read is allowed to proceed after the sequence of the conflicting global write and the barrier are cleared. The sequence can be maintained by a separate queue for each thread of the plurality.