A reduced noise image can be formed from a set of images. One of the images of the set can be selected to be a reference image and other images of the set are transformed such that they are better aligned with the reference image. A measure of the alignment of each image with the reference image is determined. At least some of the transformed images can then be combined using weights which depend on the alignment of the transformed image with the reference image to thereby form the reduced noise image. By weighting the images according to their alignment with the reference image the effects of misalignment between the images in the combined image are reduced. Furthermore, motion correction may be applied to the reduced noise image.
Ray tracing systems have computation units (“RACs”) adapted to perform ray tracing operations (e.g. intersection testing). There are multiple RACs. A centralized packet unit controls the allocation and testing of rays by the RACs. This allows RACs to be implemented without Content Addressable Memories (CAMs) which are expensive to implement, but the functionality of CAMs can still be achieved by implemented them in the centralized controller.
A method of improving texture fetching by a texturing/shading unit in a GPU pipeline by performing efficient convolution operations, includes receiving a shader and determining whether the shader is a kernel shader. In response to receiving a kernel shader, the kernel shader is modified to perform a collective fetch of all texels for a group of output pixels instead of performing independent fetches of texels for each output pixel in the group of output pixels.
Task building logic builds a plurality of tasks each comprising a group of rays. When a new ray is received into ray storage, if an existing task exists for the new ray, the new ray is added to an existing respective list. The task building logic indicates when any of the tasks is ready for scheduling, and task scheduling logic identifies a task ready for scheduling based on the indication from the task building logic, and in response traverses the respective list in order to schedule at least some of the rays of the respective task for processing in parallel.
Rendering systems that can use combinations of rasterization rendering processes and ray tracing rendering processes are disclosed. In some implementations, these systems perform a rasterization pass to identify visible surfaces of pixels in an image. Some implementations may begin shading processes for visible surfaces, before the geometry is entirely processed, in which rays are emitted. Rays can be culled at various points during processing, based on determining whether the surface from which the ray was emitted is still visible. Rendering systems may implement rendering effects as disclosed.
G09G 5/36 - Control arrangements or circuits for visual indicators common to cathode-ray tube indicators and other visual indicators characterised by the display of individual graphic patterns using a bit-mapped memory
Methods and graphics processing modules for rendering a stereoscopic image including left and right images of a three-dimensional scene. Geometry is processed in the scene to generate left data for use in displaying the left image and right data for use in displaying the right image. Disparity is determined between the left and right data by comparing the generated left data and the generated right data used in displaying the stereoscopic image. In response to identifying at least a portion of the left data and the right data as non-disparate, a corresponding portion of the left image and the right image is commonly processed (e.g. commonly rendered or commonly stored). In response to identifying at least a portion of the left data and the right data as disparate, a corresponding portion of the left image and the right image is separately processed (e.g. separately rendered or separately stored).
A texture filtering unit applies anisotropic filtering using a filter kernel which can be adapted to apply different amounts of anisotropy up to a maximum amount of anisotropy. If it is determined that a received input amount of anisotropy is not above the maximum amount of anisotropy, the filter kernel applies the input amount of anisotropy, and texels of a texture are sampled using the filter kernel to determine a filtered texture value. If it is determined that the input amount of anisotropy is above the maximum amount of anisotropy, the filter kernel applies an amount of anisotropy that is not above the maximum amount of anisotropy, a plurality of sampling operations are performed to sample texels of the texture using the filter kernel to determine a respective plurality of intermediate filtered texture values, and the plurality of intermediate filtered texture values are combined to determine a filtered texture value which has been filtered in accordance with the input amount of anisotropy and the input direction of anisotropy.
Texture filtering in computer graphics calculates first and second pairs of texture-space basis vectors that correspond to first and second pairs of screen-space basis vectors transformed to texture space under a local approximation of a mapping between screen space and texture space. Based on differences in magnitudes of the vectors of the pairs of texture-space basis vectors, an angular displacement is determined between a selected pair of the first and second pairs of screen-space basis vectors and screen-space principal axes of the local approximation of the mapping that indicate maximum and minimum scale factors of the mapping. The determined angular displacement and the selected pair of screen-space basis vectors are used to generate texture-space principal axes, with a major axis associated with the maximum scale factor of the mapping and a minor axis associated with the minimum scale factor of the mapping. A texture is filtered using the major and minor axes.
A tessellation method uses vertex tessellation factors. For a quad patch, the method involves comparing the vertex tessellation factors for each vertex of the quad patch to a threshold value and if none exceed the threshold, the quad is sub-divided into two or four triangles. If at least one of the four vertex tessellation factors exceeds the threshold, a recursive or iterative method is used which considers each vertex of the quad patch and determines how to further tessellate the patch dependent upon the value of the vertex tessellation factor of the selected vertex or dependent upon values of the vertex tessellation factors of the selected vertex and a neighbor vertex. A similar method is described for a triangle patch.
Compressed image data is received in substantially in raster scan order, and for each group of pixels in a row of the compressed image data, a block-based decoding scheme for the group of pixels is identified and the compressed data corresponding to the group of pixels is decoded at decoding hardware using the identified scheme.
H04N 19/436 - Methods or arrangements for coding, decoding, compressing or decompressing digital video signals characterised by implementation details or hardware specially adapted for video compression or decompression, e.g. dedicated software implementation using parallelised computational arrangements
H04N 19/176 - Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being an image region, e.g. an object the region being a block, e.g. a macroblock
H04N 19/184 - Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being bits, e.g. of the compressed video stream
11.
COMPUTER SYSTEM AND METHOD USING A FIRST PAGE TABLE AND A SECOND PAGE TABLE
A computer system includes a physical memory having a first page table and a second page table, and an address translation module. The first page table includes primary page table entries, where each page table entry among the primary page table entries is configured to store a mapping of a virtual memory address to a physical memory address and auxiliary information. The second page table includes secondary page table entries each storing at least one further auxiliary information, where each secondary page table entry corresponds to a primary page table entry in the first page table. The address translation module is configured to, in response to receiving a request from a processor, walk through the first page table to identify a primary page table entry and consecutively identify a location of a corresponding secondary page table entry based on a location of the primary page table entry.
Methods and systems for wirelessly transmitting data between Wi-Fi stations without requiring the Wi-Fi stations to be fully connected to the Wi-Fi network. A first Wi-Fi station generates the data to be transmitted. The data comprises status data and/or wake-up data. The first Wi-Fi station then inserts the data in a vendor-specific information element of a probe request frame and wirelessly transmits the probe request frame. The probe request frame is then received by a second Wi-Fi station. If the probe request frame contains wake-up data and the second Wi-Fi station is operating in a low-power mode when it receives the probe request frame, the second Wi-Fi station will wake-up from the low-power mode. If the probe request frame contains status data then the second Wi-Fi station may process the probe request frame and/or forward at least a portion of the received probe request frame to another device.
G08B 25/00 - Alarm systems in which the location of the alarm condition is signalled to a central station, e.g. fire or police telegraphic systems
G08B 25/10 - Alarm systems in which the location of the alarm condition is signalled to a central station, e.g. fire or police telegraphic systems characterised by the transmission medium using wireless transmission systems
A memory attribute structure comprises one or more memory address entries. Each memory address entry comprising a respective memory address range mapped to a respective priority level. The memory attribute table is used when processing a memory access transaction through an execution path of a processing system. During said processing, a memory address of the memory access transaction is determined. The memory attribute structure is used to determine a priority level mapped to the determined memory address, and the memory access transaction is processed based on the determined priority level.
A memory attribute structure, which is configurable, is used when processing memory access transactions through an execution path of a processing system. The memory attribute structure includes one or more memory address entries. The entries are configurable. Each memory address entry comprising a respective memory address range mapped to a respective priority level of a set of priority levels. A central processing unit is configured to use the memory attribute structure to determine respective priority levels mapped to respective memory addresses of respective memory access transaction, and process the respective memory access transactions based on the respective priority levels.
Interpolation logic described herein provides a good approximation to a bicubic interpolation, which is generally smoother than bilinear interpolation, without performing all the calculations normally needed for a bicubic interpolation. This allows an approximation of smooth bicubic interpolation to be performed on devices (e.g. mobile devices) which have limited processing resources. At each of a set of predetermined interpolation positions within an array of data points, a set of predetermined weights represent a bicubic interpolation which can be applied to the data points. For a plurality of the predetermined interpolation positions which surround the sampling position, the corresponding sets of predetermined weights and the data points are used to determine a plurality of surrounding interpolated values which represent results of performing the bicubic interpolation at the surrounding predetermined interpolation positions. A linear interpolation is then performed on the surrounding interpolated values to determine an interpolated value at the sampling position.
A graphics processing system renders a scene in a rendering space sub-divided into a plurality of tiles, each tile being sub-divided into a plurality of microtiles. A plurality of first hardware elements calculate a respective first output based on coordinates for a pixel of a microtile. A plurality of second hardware elements calculate a respective second output based on coordinates for a subsample within the pixel. Hardware logic generates an edge test output value or depth calculation value based on at least one of the second outputs, and the scene is rendered in the rendering space using the generated edge test output values or depth calculation values.
A graphics processing unit (GPU) processes graphics data using a rendering space which is sub-divided into a plurality of tiles. The GPU comprises cost indication logic configured to obtain a cost indication for each of a plurality of sets of one or more tiles of the rendering space. The cost indication for a set of tile(s) is suggestive of a cost of processing the set of one or more tiles. The GPU controls a rendering complexity with which primitives are rendered in tiles based on the cost indication for those tiles. This allows tiles to be rendered in a manner that is suitable based on the complexity of the graphics data within the tiles. In turn, this allows the rendering to satisfy constraints such as timing constraints even when the complexity of different tiles may vary significantly within an image.
Hardware for implementing a Deep Neural Network (DNN) having a convolution layer, the hardware comprising an input buffer configured to provide data windows to a plurality of convolution engines, each data window comprising a single input plane; and each of the plurality of convolution engines being operable to perform a convolution operation by applying a filter to a data window, each filter comprising a set of weights for combination with respective data values of a data window, and each of the plurality of convolution engines comprising: multiplication logic operable to combine a weight of the filter with a respective data value of the data window provided by the input buffer; and accumulation logic configured to accumulate the results of a plurality of combinations performed by the multiplication logic so as to form an output for a respective convolution operation.
G06N 3/063 - Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
G06F 7/544 - Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state deviceMethods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using unspecified devices for evaluating functions by calculation
Methods of rendering a scene in a graphics system identify a draw call within a current render and analyse the last shader in the series of shaders used by the draw call to identify any buffers that are sampled by the last shader and that are to be written by a previous render that has not yet been sent for execution on the GPU. If any such buffers are identified, further analysis is performed to determine whether the last shader samples from the identified buffers using screen space coordinates that correspond to a current fragment location and if this determination is positive, the draw call is added to data relating to the previous render and the last shader is recompiled to replace an instruction that reads data from an identified buffer with an instruction that reads data from an on-chip register.
G09G 5/36 - Control arrangements or circuits for visual indicators common to cathode-ray tube indicators and other visual indicators characterised by the display of individual graphic patterns using a bit-mapped memory
20.
HIERARCHICAL MANTISSA BIT LENGTH SELECTION FOR HARDWARE IMPLEMENTATION OF DEEP NEURAL NETWORK
Hierarchical methods for selecting fixed point number formats with reduced mantissa bit lengths for representing values input to, and/or output, from, the layers of a DNN. The methods begin with one or more initial fixed point number formats for each layer. The layers are divided into subsets of layers and the mantissa bit lengths of the fixed point number formats are iteratively reduced from the initial fixed point number formats on a per subset basis. If a reduction causes the output error of the DNN to exceed an error threshold, then the reduction is discarded, and no more reductions are made to the layers of the subset. Otherwise a further reduction is made to the fixed point number formats for the layers in that subset. Once no further reductions can be made to any of the subsets the method is repeated for continually increasing numbers of subsets until a predetermined number of layers per subset is achieved.
G06N 3/04 - Architecture, e.g. interconnection topology
G06F 7/483 - Computations with numbers represented by a non-linear combination of denominational numbers, e.g. rational numbers, logarithmic number system or floating-point numbers
G06F 7/499 - Denomination or exception handling, e.g. rounding or overflow
G06F 17/11 - Complex mathematical operations for solving equations
Methods and systems for performing a convolution transpose operation between an input tensor having a plurality of input elements and a filter comprising a plurality of filter weights. The method includes: dividing the filter into a plurality of sub-filters; performing, using hardware logic, a convolution operation between the input tensor and each of the plurality of sub-filters to generate a plurality of sub-output tensors, each sub-output tensor comprising a plurality of output elements; and interleaving, using hardware logic, the output elements of the plurality of sub-output tensors to form a final output tensor for the convolution transpose.
A SIMD processing unit processes a plurality of tasks which each include up to a predetermined maximum number of work items. The work items of a task are arranged for executing a common sequence of instructions on respective data items. The data items are arranged into blocks, with some of the blocks including at least one invalid data item. Work items which relate to invalid data items are invalid work items. The SIMD processing unit comprises a group of processing lanes configured to execute instructions of work items of a particular task over a plurality of processing cycles. A control module assembles work items into the tasks based on the validity of the work items, so that invalid work items of the particular task are temporally aligned across the processing lanes. In this way the number of wasted processing slots due to invalid work items may be reduced.
G06T 1/20 - Processor architecturesProcessor configuration, e.g. pipelining
G06F 9/30 - Arrangements for executing machine instructions, e.g. instruction decode
G06F 9/38 - Concurrent instruction execution, e.g. pipeline or look ahead
G06F 9/48 - Program initiatingProgram switching, e.g. by interrupt
G06F 15/80 - Architectures of general purpose stored program computers comprising an array of processing units with common control, e.g. single instruction multiple data processors
A method and system for generating two or three dimensional computer graphics images using multisample antialiasing (MSAA) is provided, which enables memory bandwidth to be conserved. For each of one or more pixels it is determined whether all of a plurality of sample areas of that pixel are located within a particular primitive. For those pixels where it is determined that all the sample areas of that pixel are located within that primitive, a value is stored in a multisample memory for a smaller number of the sample areas of that pixel than the total number of the sample areas of that pixel and data is stored indicating that all the sample areas of that pixel are located within that primitive.
Methods and storage unit allocators for allocating one or more portions of a storage unit to a plurality of tasks for storing at least two types of data. The method includes receiving a request for one or more portions of the storage unit to store a particular type of data of the at least two types of data for a task of the plurality of tasks; associating the request with one of a plurality of virtual partitionings of the storage unit based on one or more characteristics of the request, each virtual partitioning allotting none, one, or more than one portion of the storage unit to each of the at least two types of data; and allocating the requested one or more portions of the storage unit to the task from the none, one, or more than one portion of the storage unit allotted to the particular type of data in the virtual partitioning associated with the request.
A method and apparatus are provided for manufacturing integrated circuits performing invariant integer division x/d. A desired rounding mode is provided and an integer triple (a,b,k) for this rounding mode is derived. Furthermore, a set of conditions for the rounding mode is derived. An RTL representation is then derived using the integer triple. From this a hardware layout can be derived and an integrated circuit manufactured with the derived hardware layout. When the integer triple is derived a minimum value of k for the desired rounding mode and set of conditions is also derived. 0
G06F 7/38 - Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
A method of data compression in which the total size of the compressed data is determined and based on that determination, the bit depth of the input data may be reduced before the data is compressed. The bit depth that is used may be determined by comparing the calculated total size to one or more pre-defined threshold values to generate a mapping parameter. The mapping parameter is then input to a remapping element that is arranged to perform the conversion of the input data and then output the converted data to a data compression element. The value of the mapping parameter may be encoded into the compressed data so that it can be extracted and used when subsequently decompressing the data.
H03M 7/30 - CompressionExpansionSuppression of unnecessary data, e.g. redundancy reduction
H04N 19/13 - Adaptive entropy coding, e.g. adaptive variable length coding [AVLC] or context adaptive binary arithmetic coding [CABAC]
H04N 19/132 - Sampling, masking or truncation of coding units, e.g. adaptive resampling, frame skipping, frame interpolation or high-frequency transform coefficient masking
H04N 19/176 - Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being an image region, e.g. an object the region being a block, e.g. a macroblock
H04N 19/186 - Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being a colour or a chrominance component
27.
ACTIVATION ACCELERATOR FOR NEURAL NETWORK ACCELERATOR
An activation accelerator for use in a neural network accelerator includes a look-up table and activation pipelines. The look-up table stores values representing a non-linear activation function. Each activation pipeline comprises a range conversion unit which receives an input value and generates a converted value from the input value, an index generation unit which receives information identifying a first subset of bits of the converted value and a second subset of bits of the converted value, generates an index from the first subset of bits of the converted value, and generates an interpolation value from the second subset of bits of the converted value, a look-up table interface unit which retrieves multiple values from the look-up table based on the index, and an interpolation unit which generates an estimated result of the non-linear activation function for the input value by interpolating between the multiple values retrieved from the look-up table based on the interpolation value.
An activation accelerator for a neural network accelerator includes a look-up table and activation pipelines. The look-up table stores values representing a non-linear activation function. Each activation pipeline comprises a range conversion unit that receives an input value, an input offset, and an absolute value flag, and generates a converted value from the input value by combining the input value and the input offset, and when an absolute value flag is set, generates an absolute value of the combination of the input value and the input offset. An index generation unit generates an index and an interpolation value from the converted value. A look-up table interface retrieves multiple values from the look-up table based on the index. An interpolation unit generates an estimate of a result of the non-linear activation function for the input value from an interpolation output generated by interpolating between the multiple values retrieved from the look-up table based on the interpolation value.
An activation accelerator for a neural network accelerator includes a look-up table that stores a plurality of values representing a non-linear activation function, and activation pipelines. Each activation pipeline comprises a range conversion unit that receives an input value and generates a converted value from the input value, an index generation unit that generates an index and an interpolation value from the converted value, a look-up table interface unit that retrieves multiple values from the look-up table based on the index, and an interpolation unit that receives information identifying a rounding mode of a plurality of rounding modes, and generates an estimate of a result of the non-linear activation function for the input value from an interpolation output generated by interpolating between the multiple values retrieved from the look-up table using the interpolation value and rounding in accordance with the identified rounding mode.
A method of enabling synchronisation of a second clock at a second device with a first clock at a first device further comprising a third clock. A first message comprising an identifier generated in dependence on the third clock is transmitted to the second device. A timestamp is generated in dependence on the time at which the first message is transmitted from the first device according to the first clock, and a second message comprising the identifier and the generated timestamp is generated. The second message is then transmitted to the second device.
H04L 7/00 - Arrangements for synchronising receiver with transmitter
H04L 69/00 - Network arrangements, protocols or services independent of the application payload and not provided for in the other groups of this subclass
H04N 21/00 - Selective content distribution, e.g. interactive television or video on demand [VOD]
H04N 21/43 - Processing of content or additional data, e.g. demultiplexing additional data from a digital video streamElementary client operations, e.g. monitoring of home network or synchronizing decoder's clockClient middleware
H04N 21/8547 - Content authoring involving timestamps for synchronizing content
H04W 4/70 - Services for machine-to-machine communication [M2M] or machine type communication [MTC]
A computer-implemented method for generating a feature descriptor for a location in an image for use in performing descriptor matching in analysing the image, the method comprising determining a set of samples characterising a location in an image by sampling scale-space data representative of the image, the scale-space data comprising data representative of the image at a plurality of length scales; and generating a feature descriptor in dependence on the determined set of samples.
G06V 10/44 - Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersectionsConnectivity analysis, e.g. of connected components
G06V 10/46 - Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]Salient regional features
G06V 10/52 - Scale-space analysis, e.g. wavelet analysis
H04N 19/117 - Filters, e.g. for pre-processing or post-processing
H04N 19/132 - Sampling, masking or truncation of coding units, e.g. adaptive resampling, frame skipping, frame interpolation or high-frequency transform coefficient masking
32.
METHODS AND SYSTEMS FOR STORING VARIABLE LENGTH DATA BLOCKS IN MEMORY
A set of two or more variable length data blocks is stored in memory. Each variable length data block has a maximum size of N*B, wherein N is an integer greater than or equal to two, and B is a maximum data size that can be written to the memory using a single memory access request. For each variable length data block of the set, the first P non-overlapping portions of size B of the variable length data block are stored in a chunk of the memory allocated to that variable length data block, wherein P is a minimum of (i) a number of non-overlapping portions of size B of the variable length data block and (ii) X which is an integer less than N. Any remaining portions of the variable length data blocks are stored in a remainder section of the memory shared between the variable length data blocks of the set. Information indicating the size of each of the variable length data blocks in the set is stored in a header.
A method of scheduling a plurality of active GPU drivers in a GPU includes, for one or more of the plurality of active GPU drivers, allocating a portion of a scheduling interval to the active GPU driver and selecting an active GPU driver for execution according to a priority-based scheduling algorithm. In response to an active GPU driver executing within its allocated portion, the priority level of the active GPU driver is increased, in response to the active GPU driver completing its workload within its allocated portion the priority level of the active GPU driver is reset and in response to the active GPU driver executing for its whole allocated portion, the priority level of the active GPU driver is reduced. The priority levels of each active GPU driver are reset to their initial priority levels at the start of each scheduling interval.
Methods and apparatus for merging tasks in a graphics pipeline in which, subsequent to a trigger to flush a tag buffer, one or more tasks from the flushed tag buffer are generated, each task comprising a reference to a program and plurality of fragments on which the program is to be executed, wherein a fragment is an element of a primitive at a sample position. It is then determined whether merging criteria are satisfied and if satisfied, one or more fragments from a next tag buffer flush are added to a last task of the one or more tasks generated from the flushed tag buffer.
Methods and apparatus for compressing image data are described along with corresponding methods and apparatus for decompressing the compressed image data. A decoder unit samples compressed image data including interleaved blocks of data encoding a first image and blocks of data encoding differences between the first image and a second image, the second image being twice the width and the height of the first image. A difference decoder decodes a fetched encoded sub-block of the differences between the first and second images and output a difference quad and a prediction value for a pixel, and a filter sub-unit generates a reconstruction of the image at a sample position using decoded blocks of the first image, the difference quad and the prediction value.
H04N 19/17 - Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being an image region, e.g. an object
H04N 19/172 - Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being an image region, e.g. an object the region being a picture, frame or field
H04N 19/34 - Scalability techniques involving progressive bit-plane based encoding of the enhancement layer, e.g. fine granular scalability [FGS]
H04N 19/423 - Methods or arrangements for coding, decoding, compressing or decompressing digital video signals characterised by implementation details or hardware specially adapted for video compression or decompression, e.g. dedicated software implementation characterised by memory arrangements
H04N 19/59 - Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding involving spatial sub-sampling or interpolation, e.g. alteration of picture size or resolution
H04N 19/82 - Details of filtering operations specially adapted for video compression, e.g. for pixel interpolation involving filtering within a prediction loop
36.
INTERSECTION TESTING IN A RAY TRACING SYSTEM USING CONVEX POLYGON EDGE SIGNED PARAMETERS
A method and an intersection testing module in a ray tracing system for performing intersection testing for a ray with respect to a plurality of convex polygons, each of which is defined by an ordered set of vertices. The vertices of the convex polygons are projected onto a pair of axes orthogonal to the ray direction. For each edge of a convex polygon defined by two of the projected vertices, a signed parameter is determined, wherein the sign of the signed parameter is indicative of which side of the edge the ray passes on. If the ray is determined to intersect a point on the edge then the sign of the signed parameter is determined using a module which is configured to: take as inputs, indications which classify each of pi, qi, pj and qj coordinates as negative, zero or positive, and output, for valid combinations of classifications of the pi, qi, pj and qj coordinates, an indication of the sign of the signed parameter. It is then determined whether the ray intersects the convex polygon based on the signs of the signed parameters determined for the edges of the convex polygon.
Rendering systems that can use combinations of rasterization rendering processes and ray tracing rendering processes are disclosed. In some implementations, these systems perform a rasterization pass to identify visible surfaces of pixels in an image. Some implementations may begin shading processes for visible surfaces, before the geometry is entirely processed, in which rays are emitted. Rays can be culled at various points during processing, based on determining whether the surface from which the ray was emitted is still visible. Rendering systems may implement rendering effects as disclosed.
A method of performing lossy compression on a block of image data in accordance with a multi-level difference table determines an origin value for the block of image data, and determines a level within the multi-level difference table for the block of image data, by determining a maximum difference between the determined origin value and any one of image element values in the block of image data and selecting from the multi-level difference table the level whose largest entry most closely represents the determined maximum difference. For each image element value in the block, one of the entries at the determined level within the multi-level difference table is selected, and a compressed block of data for the block of image data is formed, the compressed block of data including (i) data representing the determined origin value, (ii) an indication of the determined level, and (iii) for each image element value in the block of image data, an indication of the selected entry for that image element value.
An activation accelerator for use in a neural network accelerator includes a look-up table and one or more activation pipelines. The look-up table is configured to store a plurality of values representing a non-linear activation function, stored in the look-up table in accordance with a mode of a plurality of modes. Each activation pipeline comprises a range conversion unit, an index generation unit, a look-up table interface unit, and an interpolation unit. The range conversion unit receives an input value and generates a converted value from the input value. The index generation unit receives information identifying the mode and generates an index and an interpolation value from the converted value based on the identified mode. The look-up table retrieves multiple values from the look-up table based on the index. The interpolation unit generates an estimated result of the non-linear activation function for the input value by interpolating between the multiple values retrieved from the look-up table based on the interpolation value.
A computer implemented method of compressing a neural network, the method comprising: receiving a neural network comprising a plurality of layers; forming a graph that represents the flow of data through the plurality layers of the neural network, the graph comprising: a plurality of vertices, each vertex of the plurality of vertices being representative of an output channel of a layer of the plurality of layers of the neural network; and one or more edges, each edge of the one or more edges representing the potential flow of non-zero data between respective output channels represented by a respective pair of vertices; identifying, by traversing the graph, one or more redundant channels comprised by the plurality of layers of the neural network; and outputting a compressed neural network in which the identified one or more redundant channels are not present in the compressed neural network.
A method and apparatus for rendering a computer-generated image using a stencil buffer is described. The method divides an arbitrary closed polygonal contour into first and higher level primitives, where first level primitives correspond to contiguous vertices in the arbitrary closed polygonal contour and higher level primitives correspond to the end vertices of consecutive primitives of the immediately preceding primitive level. The method reduces the level of overdraw when rendering the arbitrary polygonal contour using a stencil buffer compared to other image space methods. A method of producing the primitives in an interleaved order, with second and higher level primitives being produced before the final first level primitives of the contour, is described which improves cache hit rate by reusing more vertices between primitives as they are produced.
Methods and parallel processing units for avoiding inter-pipeline data hazards identified at compile time. For each identified inter-pipeline data hazard the primary instruction and secondary instruction(s) thereof are identified as such and are linked by a counter which is used to track that inter-pipeline data hazard. When a primary instruction is output by the instruction decoder for execution the value of the counter associated therewith is adjusted to indicate that there is hazard related to the primary instruction, and when primary instruction has been resolved by one of multiple parallel processing pipelines the value of the counter associated therewith is adjusted to indicate that the hazard related to the primary instruction has been resolved. When a secondary instruction is output by the decoder for execution, the secondary instruction is stalled in a queue associated with the appropriate instruction pipeline if at least one counter associated with the primary instructions from which it depends indicates that there is a hazard related to the primary instruction.
Post processing on data generated by processing an image in accordance with a single-shot detector (SSD) neural network. The data comprises information identifying a plurality of bounding boxes in the image and a confidence score for a class for each bounding box. For each bounding box, (a) determining if the confidence score meets a confidence score threshold, (b) when the confidence score meets the confidence score threshold, determining if less than a maximum number of bounding boxes entries have been stored, (c) when less than the maximum number of bounding box entries have been stored, adding a new bounding box entry for the bounding box, (d) when the maximum number of bounding box entries have been stored, determining if the confidence score is greater than a lowest confidence score of the bounding box entries, and (e) when the confidence score is greater than the lowest confidence score of the bounding box entries, removing the bounding box entry with the lowest confidence score, and adding a new bounding box entry for the bounding box.
G06V 10/25 - Determination of region of interest [ROI] or a volume of interest [VOI]
G06V 10/764 - Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
G06V 10/82 - Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
44.
DEDICATED RAY MEMORY FOR RAY TRACING IN GRAPHICS SYSTEMS
A ray tracing unit implemented in a graphics rendering system includes processing logic configured to perform ray tracing operations on rays, a dedicated ray memory coupled to the processing logic and configured to store ray data for rays to be processed by the processing logic, an interface to a memory system, and control logic configured to manage allocation of ray data to either the dedicated ray memory or the memory system. Core ray data for rays to be processed by the processing logic is stored in the dedicated ray memory, and at least some non-core ray data for the rays is stored in the memory system. This allows core ray data for many rays to be stored in the dedicated ray memory without the size of the dedicated ray memory becoming too wasteful when the ray tracing unit is not in use.
A method for grouping primitives into pairs of adjoining triangles for use in a ray tracing process. An input list of edges of triangular primitives is obtained, an edge bounding volume surface area (BVSA) and an additional edge qualifier is determined for each of the edges. The entries in the input list are sorted by edge BVSA then by edge qualifier, giving a sorted list in which the entries have a sorted order. The list is traversed in the sorted order to seek groups of matched edges within a predetermined window of list entries, each edge in a matched group having a matching edge BVSA and edge qualifier with another edge in the matched group from a different triangular primitive. When a group of matched edges is found, associated triangular primitives are designated as a cluster of adjoining primitives. The cluster of adjoining primitives are processed together as a group.
A processor has a register bank to which software writes descriptors specifying tasks to be processed by a hardware pipeline. The register bank includes a plurality of register sets, each for holding the descriptor of a task. The processor includes a first selector operable to connect the execution logic to a selected one of the register sets and thereby enable the software to write successive ones of said descriptors to different ones of said register sets. The processor also includes a second selector operable to connect the hardware pipeline to a selected one of the register sets. The processor further comprises control circuitry configured to control the hardware pipeline to begin processing a current task based on the descriptor in a current one of the register sets while the software is writing the descriptor of another task to another of the register sets.
Conservative rasterization hardware comprises hardware logic arranged to perform an edge test calculation for each edge of a primitive and for two corners of each pixel in a microtile. The two corners that are used are selected based on the gradient of the edge and the edge test result for one corner is the inner coverage result and the edge test result for the other corner is the outer coverage result for the pixel. An overall outer coverage result for the pixel and the primitive is calculated by combining the outer coverage results for the pixel and each of the edges of the primitive in an AND gate. The overall inner coverage result for the pixel is calculated in a similar manner.
G09G 5/36 - Control arrangements or circuits for visual indicators common to cathode-ray tube indicators and other visual indicators characterised by the display of individual graphic patterns using a bit-mapped memory
48.
DETECTING OUT-OF-BOUNDS VIOLATIONS IN A HARDWARE DESIGN USING FORMAL VERIFICATION
A formal verification tool used with a hardware monitor to verify that a hardware design for an electronic device does not comprise a bug or error that can cause an instantiation of the hardware design to fetch an instruction from an out-of-bounds address. Formal assertations for a hardware design are received, wherein the formal assertions assert a formal property that compares a memory address from which an instruction was fetched by an instantiation of the hardware design to an allowable memory address range or an unallowable memory address range associated with an operating state of the instantiation of the hardware design when the fetch was performed. The tool formally verifies that the formal assertations are true for the hardware design to identify whether the hardware design has a bug or error that causes an out-of-bounds violation.
Hardware implementations of Deep Neural Networks (DNNs) and related methods with a variable output data format. Specifically, in the hardware implementations and methods described herein the hardware implementation is configured to perform one or more hardware passes to implement a DNN wherein during each hardware pass the hardware implementation receives input data for a particular layer, processes that input data in accordance with the particular layer (and optionally one or more subsequent layers), and outputs the processed data in a desired format based on the layer, or layers, that are processed in the particular hardware pass. In particular, when a hardware implementation receives input data to be processed, the hardware implementation also receives information indicating the desired format for the output data of the hardware pass and the hardware implementation is configured to, prior to outputting the processed data convert the output data to the desired format.
G06F 1/3234 - Power saving characterised by the action undertaken
G06F 7/483 - Computations with numbers represented by a non-linear combination of denominational numbers, e.g. rational numbers, logarithmic number system or floating-point numbers
G06F 7/544 - Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state deviceMethods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using unspecified devices for evaluating functions by calculation
G06N 3/04 - Architecture, e.g. interconnection topology
A method of selecting, in hardware logic, an ith largest or a pth smallest number from a set of n m-bit numbers is described. The method is performed iteratively and in the rth iteration, the method comprises: summing an (m−r)th bit from each of the m-bit numbers to generate a summation result and comparing the summation result to a threshold value. Depending upon the outcome of the comparison, the rth bit of the selected number is determined and output and additionally the (m−r−1)th bit of each of the m-bit numbers is selectively updated based on the outcome of the comparison and the value of the (m−r)th bit in the m-bit number. In a first iteration, a most significant bit from each of the m-bit numbers is summed and each subsequent iteration sums bits occupying successive bit positions in their respective numbers.
G06F 7/24 - Sorting, i.e. extracting data from one or more carriers, re-arranging the data in numerical or other ordered sequence, and re-recording the sorted data on the original carrier or on a different carrier or set of carriers
G06F 7/57 - Arithmetic logic units [ALU], i.e. arrangements or devices for performing two or more of the operations covered by groups or for performing logical operations
G06F 7/78 - Arrangements for rearranging, permuting or selecting data according to predetermined rules, independently of the content of the data for changing the order of data flow, e.g. matrix transposition or LIFO buffersOverflow or underflow handling therefor
51.
COMPRESSING AND DECOMPRESSING IMAGE DATA USING COMPACTED REGION TRANSFORMS
A method of compressing a set of image value data items each representing a position in image-value space so as to define an occupied region thereof. A series of compression transforms is applied to subsets of the image data items to generate a transformed set of image data items occupying a compacted region of value space. A set of one or more reference data items is identified that quantizes the compacted region in value space. For each image data item in the set of image data items, a sequence of decompression transforms is identified that generates an approximation of that image data item when applied to a selected one of the reference data items. Each image data item is encoded as a representation of the identified sequence of decompression transforms for that image data item. The data items and the decompression transforms are stored as compressed image data.
H04N 19/167 - Position within a video image, e.g. region of interest [ROI]
H04N 19/186 - Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being a colour or a chrominance component
H04N 19/196 - Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the adaptation method, adaptation tool or adaptation type used for the adaptive coding being specially adapted for the computation of encoding parameters, e.g. by averaging previously computed encoding parameters
H04N 19/426 - Methods or arrangements for coding, decoding, compressing or decompressing digital video signals characterised by implementation details or hardware specially adapted for video compression or decompression, e.g. dedicated software implementation characterised by memory arrangements using memory downsizing methods
H04N 19/46 - Embedding additional information in the video signal during the compression process
H04N 19/60 - Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using transform coding
52.
COMPUTING SYSTEMS AND METHODS FOR PROCESSING GRAPHICS DATA USING COST INDICATIONS FOR SETS OF TILES OF A RENDERING SPACE
A computing system comprises graphics rendering logic and image processing logic. The graphics rendering logic processes graphics data to render an image using a rendering space which is sub-divided into a plurality of tiles. Cost indication logic obtains a cost indication for each of a plurality of sets of one or more tiles of the rendering space, wherein the cost indication for a set of one or more tiles is suggestive of a cost of processing rendered image values for a region of the rendered image corresponding to the set of one or more tiles. The image processing logic processes rendered image values for regions of the rendered image. The computing system causes the image processing logic to process rendered image values for regions of the rendered image in dependence on the cost indications for the corresponding sets of one or more tiles.
H04N 19/14 - Coding unit complexity, e.g. amount of activity or edge presence estimation
H04N 19/174 - Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being an image region, e.g. an object the region being a slice, e.g. a line of blocks or a group of blocks
53.
INTERSECTION TESTING IN RAY TRACING SYSTEMS USING HIERARCHICAL ACCELERATION STRUCTURES WITH IMPLICITLY REPRESENTED NODES
Ray tracing systems and methods generate a hierarchical acceleration structure for intersection testing in a ray tracing system. Nodes of the hierarchical acceleration structure are determined, each representing a region in a scene, and being linked to form the hierarchical acceleration structure. Data is stored representing the hierarchical acceleration structure, including data defining the regions represented by a plurality of the nodes. At least one node is an implicitly represented node, wherein data defining a region represented by an implicitly represented node is not explicitly included as part of the stored data but can be inferred from the stored data. Intersection testing in the ray tracing system is performed in which, based on conditions in the ray tracing system, a determination is made as to whether testing of one or more rays for intersection with a region represented by a particular node of a sub-tree is to be skipped.
Hardware tessellation units include a sub-division logic block that comprises hardware logic arranged to perform a sub-division of a patch into two (or more) sub-patches. The hardware tessellation units also include a decision logic block that is configured to determine whether a patch is to be sub-divided or not and one or more hardware elements that control the order in which tessellation occurs. In various examples, this hardware element is a patch stack that operates a first-in-last-out scheme and in other examples, there are one or more selection logic blocks that are configured to receive patch data for more than one patch or sub-patch and output the patch data for a selected one of the received patches or sub-patches.
A histogram-based method of selecting a fixed point number format for representing a set of values input to, or output from, a layer of a Deep Neural Network (DNN). The method comprises obtaining a histogram that represents an expected distribution of the set of values of the layer, each bin of the histogram is associated with a frequency value and a representative value in a floating point number format; quantising the representative values according to each of a plurality of potential fixed point number formats; estimating, for each of the plurality of potential fixed point number formats, the total quantisation error based on the frequency values of the histogram and a distance value for each bin that is based on the quantisation of the representative value for that bin; and selecting the fixed point number format associated with the smallest estimated total quantisation error as the optimum fixed point number format for representing the set of values of the layer.
G06N 3/063 - Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
G06F 7/499 - Denomination or exception handling, e.g. rounding or overflow
G06F 7/544 - Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state deviceMethods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using unspecified devices for evaluating functions by calculation
Hardware for implementing a Deep Neural Network (DNN) having a convolution layer. A plurality of convolution engines are each operable to perform a convolution operation by applying a filter to a data window, each filter comprising a set of weights for combination with respective data values of a data window, and each of the plurality of convolution engines comprising: multiplication logic operable to combine a weight of a filter with a respective data value of a data window; control logic configured to cause the multiplication logic to combine a weight with a respective data value if the weight is non-zero, and otherwise not cause the multiplication logic to combine that weight with that data value; and accumulation logic configured to accumulate the results of a plurality of combinations performed by the multiplication logic so as to form an output for a respective convolution operation.
G06N 3/063 - Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
G06F 7/544 - Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state deviceMethods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using unspecified devices for evaluating functions by calculation
Transforming rendered frames in a graphics processing system to obtain enhanced frames with desired characteristics of a set of target images includes selecting a plurality of shaders, each defined by a parametrized mathematical function arranged to replicate a particular visual characteristic. For each shader, parameters of the parametrized mathematical function have been derived in dependence on a set of target images so that the shader is arranged to impose its respective particular visual characteristic in dependence on an extent to which the particular visual characteristic is exhibited in the target images. The plurality of shaders are combined to form a pipeline, obtaining one or more rendered frames, applying the pipeline to at least a portion of the one or more rendered frames to obtain enhanced frames, and outputting for display the enhanced frames, wherein the enhanced frames exhibit visual characteristics of the target images.
An integrated circuit includes a memory configured to store a plurality of functions; a mapping interface configured to perform a mapping from a received first signal to a first function of the plurality of functions; and a state machine configured to, in response to said mapping, execute the first function; wherein the integrated circuit is arranged to, in dependence on the execution of the first function at the state machine, modify said mapping between the first signal and the first function so as to re-map the first signal to a second function of the plurality of functions such that, on receiving a subsequent first signal, the state machine is configured to execute the second function.
G06F 15/177 - Initialisation or configuration control
G05B 19/04 - Programme control other than numerical control, i.e. in sequence controllers or logic controllers
G05B 19/045 - Programme control other than numerical control, i.e. in sequence controllers or logic controllers using logic state machines, consisting only of a memory or a programmable logic device containing the logic for the controlled machine and in which the state of its outputs is dependent on the state of its inputs or part of its own output states, e.g. binary decision controllers, finite state controllers
A compressed data structure that encodes a set of Haar coefficients for a 2×2 quad of pixels of a block of pixels is decoded. The set of Haar coefficients comprises differential coefficients and an average coefficient. A first portion of the compressed data structure encodes the differential coefficients for the 2×2 quad of pixels. A second portion of the compressed data structure encodes the average coefficient for the 2×2 quad of pixels. The first portion is used to determine signs and exponents differential coefficients which are non-zero. The second portion is used to determine a representation of the average coefficient. The result of a weighted sum of the differential coefficients and the average coefficient for the 2×2 quad of pixels is determined using: (i) the determined signs and exponents for the differential coefficients which are non-zero, (ii) the determined representation of the average coefficient, and (iii) respective weights for the differential coefficients. The determined result is used to determine the decoded value.
A rendering optimisation identifies a draw call within a current render (which may be the first draw call in the render or a subsequent draw call in the render) and analyses a last shader in the series of shaders used by the draw call to determine whether the last shader samples from the one or more buffers at coordinates matching a current fragment location. If this determination is positive, the method further recompiles the last shader to replace an instruction that reads data from one of the one or more buffers at coordinates matching a current fragment location with an instruction that reads from the one or more buffers at coordinates stored in on-chip registers.
G09G 5/36 - Control arrangements or circuits for visual indicators common to cathode-ray tube indicators and other visual indicators characterised by the display of individual graphic patterns using a bit-mapped memory
61.
INTERSECTION TESTING IN A RAY TRACING SYSTEM USING MULTIPLE RAY BUNDLE INTERSECTION TESTS
Ray tracing systems and computer-implemented methods are described for performing intersection testing on a bundle of rays with respect to a box. Silhouette edges of the box are identified from the perspective of the bundle of rays. For each of the identified silhouette edges, components of a vector providing a bound to the bundle of rays are obtained and it is determined whether the vector passes inside or outside of the silhouette edge. Results of determining, for each of the identified silhouette edges, whether the vector passes inside or outside of the silhouette edge, are used to determine an intersection testing result for the bundle of rays with respect to the box.
Methods and tessellation modules for tessellating a patch to generate tessellated geometry data representing the tessellated patch. A plurality of tessellation pipelines operate in parallel as a core, configured to process a respective patch of a set of patches at a respective tessellation pipeline to identify tessellation factors for the patches of the set of patches. Tessellation instances to be used in tessellating the patches of the set are determined based on the identified tessellation factors for the patches of the set of patches. An allocation of the tessellation instances amongst the tessellation pipelines of the core is determined, and the tessellation instances allocated to the tessellation pipelines of the core at the allocated tessellation pipelines are processed to generate tessellated geometry data associated with the respective allocated tessellation instances.
A data processing device for detecting motion in a sequence of frames each comprising one or more blocks of pixels, includes a sampling unit configured to determine image characteristics at a set of sample points of a block, a feature generation unit configured to form a current feature for the block, the current feature having a plurality of values derived from the sample points, and motion detection logic configured to generate a motion output for a block by comparing the current feature for the block to a learned feature representing historical feature values for the block.
A computer system has a plurality of operating systems, each operating system including a graphics processing unit (GPU) driver; a GPU including GPU firmware for controlling the execution of tasks at the graphics processing unit and, for each operating system: a firmware state register modifiable by the GPU firmware and indicating whether the GPU firmware is online; an OS state register modifiable by a GPU driver and indicating whether the GPU driver is online; a memory management unit mediating access to GPU registers such that each operating system can access its respective registers but not those of other operating systems; One of the GPU drivers is a host GPU driver initialising the GPU and bringing the GPU firmware online. Each GPU driver submits tasks for processing only if its respective firmware state register indicates that the GPU firmware is online. The GPU processes tasks for an operating system if the respective OS state register of that operating system indicates that the GPU driver is online.
A circuit for mapping N coordinates to a 1D space receives N input bit-strings representing respective coordinates, which can be of different sizes; produces a grouped bit-string therefrom, in which the bits, including non-data bits, are grouped into groups of bits originating from the same bit position per group; and demultiplexes this into n=1 . . . N demultiplexed bit-strings, and sends each to a respective n-coordinate channel. The nth demultiplexed bit-string includes a respective part of the grouped bit-string that has n coordinate data bits and N-n non-data bits per group, and all other groups filled with null bits. Each but the N-coordinate channel includes bit-packing circuitry which packs down the respective demultiplexed bit-string by removing the non-data bits, and removing the same number of bits per group from the null bit.
A method of performing anisotropic texture filtering includes generating one or more parameters describing an elliptical footprint in texture space; performing isotropic filtering at each of a plurality of sampling points in an ellipse to be sampled, the ellipse to be sampled based on the elliptical footprint; and combining results of the isotropic filtering at each of the plurality of sampling points to generate a combination result by a sequence of linear interpolations, wherein each linear interpolation in the sequence of linear interpolations comprises blending a result of a previous linear interpolation in the sequence with the isotropic filtering results for one or more of the plurality of sampling points, the one or more of the plurality of sampling points for a linear interpolation being closer to a midpoint of the major axis of the elliptical footprint than the one or more of the plurality of sampling points for the previous linear interpolation in the sequence.
A neural network is compressed by selecting two or more adjacent layers, each having one or more input channels and one or more output channels, a first layer performing a first operation and a second layer performing a second operation. First and second matrices representative of sets of coefficients of the first and second layers are determined, having a plurality of elements representative of non-zero values and a plurality of elements representative of zero values. An array is formed comprising the first matrix and the second matrix by aligning the columns or rows of the first matrix that are representative of the output channels of the first layer with the columns or rows of the second matrix that are representative of the input channels of the second layer. The rows and/or columns of the array are rearranged into respective first and second sub-matrices. A compressed neural network is then outputted.
A hardware implementation of a neural network and a method of processing data in such a hardware implementation are disclosed. Input data for a plurality of layers of the network is processed in blocks, to generate respective blocks of output data. The processing proceeds depth-wise through the plurality of layers, evaluating all layers of the plurality of layers for a given block, before proceeding to the next block.
Methods and decoding units for decoding a compressed data structure to determine a decoded value. The compressed data structure encodes a set of Haar coefficients for a block of pixels, including a plurality of differential coefficients and a sum coefficient. The compressed data structure includes a set of exponent bits representing exponents of the differential coefficients, a set of sign bits representing signs for the differential coefficients, a set of sum bits representing the sum coefficient. The compressed data structure is unpacked to identify the set of exponent bits, the set of sign bits and the set of sum bits. The identified set of exponent bits is used to determine exponents for the differential coefficients. The identified set of sign bits is used to determine signs the differential coefficients. The identified set of sum bits is used to determine the sum coefficient. The decoded value is determined by determining the result of a weighted sum of the differential coefficients and the sum coefficient for the block of pixels.
A method and data processing system for resampling a first set of samples using a neural network accelerator. The first set of samples is arranged in a tensor extending in at least a first dimension defined in a first coordinate system. A set of resampling parameters is determined, having a first resampling factor a_1/b_1 for a first dimension, and a first offset d_1 for the first dimension. At least a first number of kernels is obtained, and the first set of samples is resampled to produce a second set of samples, based on the first resampling factor and the first offset.
Systems and methods of performing convolution efficiently adapting the Winograd algorithm are provided. Methods of convolving an input tensor with weights w use hardware comprising a plurality of linear operation engines as part of performing adaptations of a Winograd algorithm, the Winograd algorithm splitting each input channel i of a total of Cin input channels into one or more tiles di and calculating a result A[Σi=1Cin(GwjiGT)∘(BTdiB)]AT for each output channel j, wherein G, B and A are constant matrices. The methods comprise determining a first filter F1 from matrix B wherein the filter F1 comprises n kernels, each kernel being an outer product of two columns of the matrix B; and using the linear operation engines to perform a convolution of the input tensor with the first filter F1.
A method of managing task dependencies within a task queue of a GPU determines a class ID and a resource ID for a task and also for any parent task of the task and outputting the class IDs and resource IDs for both the task itself and any parent task of the task for storage associated with the task in a task queue. The class ID identifies a class of the task from a hierarchy of task classes and the resource ID of the task identifies resources allocated and/or written to by the task.
A method of managing shared register allocations in a GPU includes, in response to receiving an allocating task, searching a shared register allocation cache for a cache entry with a cache index that identifies a secondary program that is associated with the allocating task. In response to identifying a cache entry with a cache index that identifies the secondary program that is associated with the allocating task, the method returns an identifier of the cache entry and status information indicating a cache hit. Returning the identifier of the cache entry causes the identifier of the cache entry to be associated with the allocating task and returning the status information indicating a cache hit causes the allocating task not to be issued.
G06F 12/0871 - Allocation or management of cache space
G06F 12/084 - Multiuser, multiprocessor or multiprocessing cache systems with a shared cache
G06F 12/0891 - Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches using clearing, invalidating or resetting means
74.
END-TO-END DATA FORMAT SELECTION FOR HARDWARE IMPLEMENTATION OF DEEP NEURAL NETWORK
Methods for selecting fixed point number formats for representing values input to and/or output from layers of a Deep Neural Network (DNN) which take into account the impact of the fixed point number formats for a particular layer in the DNN. The fixed point number format(s) used to represent sets of values input to and/or output from a layer are selected one layer at a time in a predetermined sequence wherein any layer is preceded in the sequence by the layer(s) from which it depends. The fixed point number format(s) for each layer is/are selected based on the error in the output of the DNN associated with the fixed point number formats. Once the fixed point number format(s) for a layer has/have been selected any calculation of the error in the output of the DNN for a subsequent layer in the sequence is based on that layer being configured to use the selected fixed point number formats.
G06F 7/483 - Computations with numbers represented by a non-linear combination of denominational numbers, e.g. rational numbers, logarithmic number system or floating-point numbers
G06N 3/04 - Architecture, e.g. interconnection topology
A light map for a scene is determined for use in rendering the scene in a graphics processing system. Initial lighting indications representing lighting within the scene are determined. For a texel position of the light map, the initial lighting indications are sampled using an importance sampling technique to identify positions within the scene. Sampling rays are traced between a position in the scene corresponding to the texel position of the bounce light map and the respective identified positions with the scene. A lighting value is determined for the texel position of the light map using results of the tracing of the sampling rays. By using the importance sampling method described herein, the rays which are traced are more likely to be directed towards more important regions of the scene which contribute more to the lighting of a texel.
A method of generating identifiers (IDs) for primitives and optionally vertices during tessellation. The IDs include a binary sequence of bits that represents the sub-division steps taken during the tessellation process and so encodes the way in which tessellation has been performed. Such an ID may subsequently be used to generate a random primitive or vertex and hence recalculate vertex data for that primitive or vertex.
A graphics processing unit configured to process graphics data using a rendering space which is sub-divided into a plurality of tiles. The graphics processing unit comprises a tiling unit and rendering logic. The tiling unit is arranged to generate a tile control list for each tile, the tile control list identifying each graphics data item present in the tile. The rendering logic is arranged to render the tiles using the tile control lists generated by the tiling unit. The tiling unit comprises per-tile hash generation logic arranged to generate, for each tile, a per-tile hash value based on a set of textures that will be accessed when processing the tile in the rendering logic, and the tiling unit is further arranged to store the per-tile hash value for a tile within the tile control list for the tile.
A hardware monitor arranged to detect livelock in a hardware design for an integrated circuit. The hardware monitor includes monitor and detection logic configured to detect when a particular state has occurred in an instantiation of the hardware design; and assertion evaluation logic configured to periodically evaluate one or more assertions that assert a formal property related to reoccurrence of the particular state in the instantiation of the hardware design to detect whether the instantiation of the hardware design is in a livelock comprising the predetermined state. The hardware monitor may be used by a formal verification tool to exhaustively verify that the instantiation of the hardware design cannot enter a livelock comprising the predetermined state.
A binary logic circuit and method for rounding an unsigned normalised n-bit binary number to an m-bit binary number. A correction value of length of n bits and a pre-truncation value of length of n bits are determined. The correction value is determined by shifting the n-bit number by m bits. The pre-truncation value is determined based on at least the n-bit number, the correction value, a value for the most significant bit (MSB) of the n-bit number, and a rounding value having a ‘1’ at the n−mth bit position and a ‘0’ at all other bits. The rounded m-bit number is then obtained by truncating the n−m least significant bits (LSB) of the pre-truncation value.
A set of payload flip-flops receives an input instance of a payload, and outputs an output instance from which a first instance of an error check signal is generated. One or more error check flip-flops receive the first instance and output a second instance. The input payload instance is clocked into the payload flip-flops if a payload enable signal is asserted, and the first error-check signal instance is clocked into the error check flip-flops if an error check enable signal is asserted. The input payload instance is input to the set of payload flip-flops over two clock cycles, and the payload enable signal is asserted for the two clock cycles. The error check enable signal is asserted on the second cycle. The first instance of the error check signal is compared with the second instance and an error signal is asserted if they do not match.
A hardware design for a main data transformation component is verified. The main data transformation component is representable as a hierarchical set of data transformation components which includes (i) leaf data transformation components which do not have children, and (ii) parent data transformation components which comprise one or more child data transformation components. For each of the leaf data transformation components, it is verified that an instantiation of the hardware design generates an expected output transaction. For each of the parent data transformation components, it is formally verified that an instantiation of an abstracted hardware design generates an expected output transaction in response to each of test input transactions. The abstracted hardware design represents each of the child data transformation components of the parent data transformation component with a corresponding abstracted component that for a specific input transaction to the child data transformation component produces a specific output transaction with a causal deterministic relationship to the specific input transaction.
Methods and encoding units for encoding a block of pixels into a compressed data structure. A set of Haar coefficients is determined for the block of pixels, including differential coefficients and a sum coefficient. A set of exponent bits is determined representing exponents for the differential coefficients. A set of sign bits is determined representing signs for the differential coefficients. A set of sum bits is determined representing the sum coefficient. The determined set of exponent bits is packed into a first portion of the compressed data structure; the determined set of sign bits is packed into a second portion of the compressed data structure; and the determined set of sum bits is packed into a third portion of the compressed data structure. The compressed data structure is stored.
H04N 19/176 - Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being an image region, e.g. an object the region being a block, e.g. a macroblock
H04N 19/182 - Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being a pixel
83.
FLEXIBLE CACHE STRUCTURE FOR CACHING COMPRESSED AND UNCOMPRESSED DATA
A device in which each field in a first RAM together with a respective field in a second RAM form a respective entry of a cache RAM. Caching circuitry is operable to select between applying a first mode and a second mode in at least one entry in the cache RAM. In the first mode, the respective field in the first RAM is used to hold a first portion of a single cacheline in a first format, and the respective field in the second RAM is used to hold the corresponding tag of the single cacheline and a remaining portion of the single cacheline. In the second mode, the first RAM is used to hold a plural cachelines in a second format shorter than the first format, and the corresponding entry in the second RAM is used to hold the corresponding tags of the plural cachelines.
A device in which each field in a first RAM together with a respective field in a second RAM form a respective entry of a cache RAM. Caching circuitry is operable to use the respective field in the first RAM to hold a first portion of a single cacheline, and the respective field in the second RAM to hold the corresponding tag of the single cacheline and a remaining portion of the single cacheline. The caching circuitry is further arranged so as, upon a cache hit by a subsequent memory access operation requesting to access data for which a corresponding cacheline has already been cached, to retrieve the corresponding tag and the remaining portion of the respective cacheline from the second RAM in a first one of a sequence of clock cycles.
A computer implemented method of compressing a neural network, the method comprising: receiving a neural network; determining a matrix representative of a set of coefficients of a layer of the received neural network, the layer being arranged to perform an operation, the matrix comprising a plurality of elements representative of non-zero values and a plurality of elements representative of zero values; rearranging the rows and/or columns of the matrix so as to gather the plurality of elements representative of non-zero values of the matrix into one or more sub-matrices, the one or more sub-matrices having a greater number of elements representative of non-zero values per total number of elements of the one or more sub-matrices than the number of elements representative of non-zero values per total number of elements of the matrix; and outputting a compressed neural network comprising a compressed layer arranged to perform a compressed operation in dependence on the one or more sub-matrices.
Methods of implementing a sparse submanifold convolution on a graphics processing unit. The methods include: receiving, at the graphics processing unit, an input tensor in a dense format; identifying, at the graphics processing unit, active positions of the input tensor; performing, at the graphics processing unit, an indexed unfold operation on the input tensor based on the identified active positions to generate an input matrix comprising elements of the input tensor in each active window of the input tensor; and performing, at the graphics processing unit, a matrix multiplication between a weight matrix and the input matrix to generate an output matrix that comprises elements of an output tensor of the sparse submanifold convolution based on the active windows. The methods may further comprise performing, at the graphics processing unit, an indexed fold operation on the output matrix based on the active windows to generate an output tensor in a dense format.
Methods of implementing a standard convolution on a graphics processing unit. The methods include: receiving, at the graphics processing unit, an input tensor in a dense format; identifying, at the graphics processing unit, active positions of the input tensor; performing, at the graphics processing unit, an indexed unfold operation on the input tensor based on the identified active positions of the input tensor to generate an input matrix comprising elements of the input tensor in each non-zero window of the input tensor; and performing, at the graphics processing unit, a matrix multiplication between a weight matrix and the input matrix to generate an output matrix that comprises elements of an output tensor of the standard convolution based on the non-zero windows of the input tensor.
Methods of implementing a sparse submanifold convolution using a neural network accelerator. The methods include: receiving, at the neural network accelerator, an input tensor in a sparse format; performing, at the neural network accelerator, for each position of a kernel of the sparse submanifold convolution, a 1×1 convolution between the received input tensor and weights of filters of the sparse submanifold convolution at that kernel position to generate a plurality of partial outputs; and combining appropriate partial outputs of the plurality of partial outputs to generate an output tensor of the sparse submanifold convolution in sparse format.
Methods of implementing a standard deconvolution on a graphics processing unit, the standard deconvolution being representable as a direct convolution between an input tensor to the standard deconvolution and each of a plurality of a sub-filters, each sub-filter of the plurality of sub-filters comprising a subset of weights of a filter of the standard deconvolution. The methods comprising: receiving, at the graphics processing unit, the input tensor in a dense format; identifying, at the graphics processing unit, active positions of the received input tensor; performing, at the graphics processing unit, an indexed unfold operation on the input tensor based on the identified active positions to generate an input matrix comprising elements of the input tensor in each non-zero sub-window of the input tensor; and performing, at the graphics processing unit, a matrix multiplication between a weight matrix and the input matrix to generate an output matrix that comprises elements of an output tensor of the standard deconvolution that are based on the non-zero sub-windows of the input tensor.
Methods of implementing a sparse submanifold deconvolution on a graphics processing unit, the sparse submanifold deconvolution being representable as a direct convolution between an input tensor to the sparse submanifold deconvolution and each of a plurality of a sub-filters, each sub-filter of the plurality of sub-filters comprising a subset of weights of a filter of the sparse submanifold deconvolution. The methods include: receiving, at the graphics processing unit, the input tensor in a dense format; receiving, at the graphics processing unit, information identifying target positions of an output tensor of the sparse submanifold deconvolution; performing, at the graphics processing unit, an indexed unfold operation on the input tensor based on the identified target positions of the output tensor to generate an input matrix comprising elements of the input tensor in each sub-window of the input tensor relevant to at least one of the identified target positions of the output tensor; and performing, at the graphics processing unit, a matrix multiplication between a weight matrix and the input matrix to generate an output matrix that comprises elements of the output tensor at the identified target positions.
A method of detecting an error at a graphics processing unit causes an instruction including a request for a response from a graphics processing unit to be provided to the graphics processing unit. A timer being configured to expire after a time period is initialised, and during the time period the graphics processing unit is monitored for the response from the graphics processing unit. An error is determined to have occurred in response to determining that no response was received from the graphics processing unit before the timer expired.
Methods and systems for generating common priority information for a plurality of requestors in a computing system that share a plurality of computing resources for use in a next cycle to arbitrate between the plurality of requestors, include generating, for each resource, priority information for the next cycle based on an arbitration scheme; generating, for each resource, relevant priority information for the next cycle based on the priority information for the next cycle for that resource, the relevant priority information for a resource being the priority information that relates to requestors that requested access to the resource in the current cycle and were not granted access to the resource in the current cycle; and combining the relevant priority information for the next cycle for each resource to generate the common priority information for the next cycle.
G06F 9/48 - Program initiatingProgram switching, e.g. by interrupt
G06F 9/50 - Allocation of resources, e.g. of the central processing unit [CPU]
G06F 13/16 - Handling requests for interconnection or transfer for access to memory bus
G06F 13/18 - Handling requests for interconnection or transfer for access to memory bus with priority control
G06F 13/20 - Handling requests for interconnection or transfer for access to input/output bus
G06F 13/28 - Handling requests for interconnection or transfer for access to input/output bus using burst mode transfer, e.g. direct memory access, cycle steal
G06F 13/364 - Handling requests for interconnection or transfer for access to common bus or bus system with centralised access control using independent requests or grants, e.g. using separated request and grant lines
A block of image data having a plurality of image element values each having a plurality of data values relating to a respective plurality of channels is compressed, wherein the channels comprise a reference channel and non-reference channels. For each of the non-reference channels: (i) a number of bits for a non-channel decorrelating mode is determined for losslessly representing a difference between a maximum and a minimum of the non-reference channel data values; (ii) decorrelated data values are determined by finding a difference between the data value of the non-reference channel and the data value of the reference channel; (iii) a number of bits for a channel decorrelating mode is determined for losslessly representing a difference between a maximum and a minimum of the decorrelated data values of the non-reference channel; (iv) the determined number of bits for the non-channel decorrelating mode is compared with the determined number of bits for the channel decorrelating mode; and (v) either the channel decorrelating mode or the non-channel decorrelating mode is selected, wherein if the channel decorrelating mode is selected then the decorrelated data values are used in place of the data values for determining compressed channel data for the non-reference channel.
H04N 19/103 - Selection of coding mode or of prediction mode
H04N 19/176 - Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being an image region, e.g. an object the region being a block, e.g. a macroblock
H04N 19/186 - Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being a colour or a chrominance component
H04N 19/46 - Embedding additional information in the video signal during the compression process
94.
VERIFYING FIRMWARE BINARY IMAGES USING A HARDWARE DESIGN AND FORMAL ASSERTIONS
Described herein are hardware monitors arranged to detect illegal firmware instructions in a firmware binary image using a hardware design and one or more formal assertions. The hardware monitors include monitor and detection logic configured to detect when an instantiation of the hardware design has started and/or stopped execution of the firmware and to detect when the instantiation of the hardware design has decoded an illegal firmware instruction. The hardware monitors also include assertion evaluation logic configured to determine whether the firmware binary image comprises an illegal firmware instruction by evaluating one or more assertions that assert that if a stop of firmware execution has been detected, that a decode of an illegal firmware instruction has (or has not) been detected. The hardware monitor may be used by a formal verification tool to exhaustively verify that the firmware boot image does not comprise an illegal firmware instruction, or during simulation to detect illegal firmware instructions in a firmware boot image.
G06F 21/57 - Certifying or maintaining trusted computer platforms, e.g. secure boots or power-downs, version controls, system software checks, secure updates or assessing vulnerabilities
G06F 21/51 - Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems at application loading time, e.g. accepting, rejecting, starting or inhibiting executable software based on integrity or source reliability
A method of rendering an image of a 3-D scene includes rendering a noisy image and obtaining one or more guide channels. For each of a plurality of local neighborhoods, the method comprises: calculating the parameters of a model that approximates the noisy image as a function of the one or more guide channels, and applying the calculated parameters to produce a denoised image. At least one of (i) the noisy image, (ii) the one or more guide channels, and (iii) the denoised image, are stored in a quantized low-bitdepth format.
A method of performing anisotropic texture filtering includes generating one or more parameters describing an elliptical footprint in texture space; performing isotropic filtering at each sampling point of a set of sampling points in an ellipse to be sampled to produce a plurality of isotropic filter results, the ellipse to be sampled based on the elliptical footprint; selecting, based on one or more parameters of the set of sampling points and one or more parameters of the ellipse to be sampled, weights of an anisotropic filter that minimize a cost function that penalises high frequencies in the filter response of the anisotropic filter under a constraint that the variance of the anisotropic filter is related to an anisotropic ratio squared, the anisotropic ratio being the ratio of a major radius of the ellipse to be sampled and a minor axis of the ellipse to be sampled; and combining the plurality of isotropic filter results using the selected weights of the anisotropic filter to generate at least a portion of a filter result.
A method of converting 10-bit pixel data (e.g. 10:10:10:2 data) into 8-bit pixel data involves converting the 10-bit values to 7-bits or 8-bits and generating error values for each of the converted values. Two of the 8-bit output channels comprise a combination of a converted 7-bit value and one of the bits from the fourth input channel. A third 8-bit output channel comprises the converted 8-bit value and the fourth 8-bit output channel comprises the error values. In various examples, the bits of the error values may be interleaved when they are packed into the fourth output channel.
H04N 19/132 - Sampling, masking or truncation of coding units, e.g. adaptive resampling, frame skipping, frame interpolation or high-frequency transform coefficient masking
G06T 1/20 - Processor architecturesProcessor configuration, e.g. pipelining
H04N 19/176 - Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being an image region, e.g. an object the region being a block, e.g. a macroblock
H04N 19/42 - Methods or arrangements for coding, decoding, compressing or decompressing digital video signals characterised by implementation details or hardware specially adapted for video compression or decompression, e.g. dedicated software implementation
H04N 19/89 - Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using pre-processing or post-processing specially adapted for video compression involving methods or arrangements for detection of transmission errors at the decoder
98.
TEXTURE ADDRESS GENERATION USING FRAGMENT PAIR DIFFERENCES
Methods and hardware for texture address generation receive fragment coordinates for an input block of fragments and texture instructions for the fragments and calculating gradients for at least one pair of fragments. Based on the gradients, the method determines whether a first mode or a second mode of texture address generation is to be used and then uses the determined mode and the gradients to perform texture address generation. The first mode of texture address generation performs calculations at a first precision for a subset of the fragments and calculations for remaining fragments at a second, lower, precision. The second mode of texture address generation performs calculations for all fragments at the first precision and if the second mode is used and more than half of the fragments in the input block are valid, the texture address generation is performed over two clock cycles.
A method of scheduling instructions within a parallel processing unit is described. The method comprises decoding, in an instruction decoder, an instruction in a scheduled task in an active state, and checking, by an instruction controller, if an ALU targeted by the decoded instruction is a primary instruction pipeline. If the targeted ALU is a primary instruction pipeline, a list associated with the primary instruction pipeline is checked to determine whether the scheduled task is already included in the list. If the scheduled task is already included in the list, the decoded instruction is sent to the primary instruction pipeline.
G06F 9/48 - Program initiatingProgram switching, e.g. by interrupt
G06F 7/575 - Basic arithmetic logic units, i.e. devices selectable to perform either addition, subtraction or one of several logical operations, using, at least partially, the same circuitry
G06F 9/30 - Arrangements for executing machine instructions, e.g. instruction decode
G06F 9/38 - Concurrent instruction execution, e.g. pipeline or look ahead
100.
DATA COMPRESSION AND DECOMPRESSION METHODS AND SYSTEMS IN RAY TRACING
A method of compressing data for representing displacement information in a ray tracing system, wherein the displacement information indicates displacements to be applied to geometry in a scene to be rendered by the ray tracing system. The method includes retrieving a pair of datasets representing the displacement information, wherein a first of the datasets comprises a first array of values, and a second of the datasets comprises a second array of values; retrieving values from a corresponding array position in each of the first and second arrays, wherein the retrieved values form a pair of values representing an upper and lower bound of a magnitude of displacement for the corresponding array position. The method includes identifying which of a plurality of predetermined conditions the pair of values satisfies, and encoding the pair of values as a single value in a compressed dataset, wherein the single value represents the identified predetermined condition.