AI accelerator

Updated on Dec 20, 2024

Edit

Comment

An AI accelerator is (as of 2016) an emerging class of microprocessor (or coprocessor) designed to accelerate artificial neural networks, machine vision and other machine learning algorithms for robotics, internet of things and other data-intensive or sensor-driven tasks. They are frequently manycore designs (mirroring the massively-parallel nature of biological neural networks). They are targeted at practical narrow AI applications, rather than artificial general intelligence research. Many vendor specific terms exist for devices in this space.

They are distinct from GPUs (which are commonly used for the same role) in that they lack any fixed function units for graphics, and generally focus on low-precision arithmetic.

History

Computer systems have frequently complemented the CPU with special purpose accelerators for intensive tasks, most notably graphics, but also sound, video, etc. Over time various accelerators have appeared that have been applicable to AI workloads.

Early attempts

In the early days, DSPs (such as the AT&T DSP32C) have been used as neural network accelerators e.g. to accelerate OCR software, and there have been attempts to create parallel high throughput systems for workstations (e.g. TetraSpert in the 1990s, which was a parallel fixed point vector processor), aimed at various applications including neural network simulations. ANNA was a neural net CMOS accelerator developed by Yann LeCun. There was another attempt to build a neural net workstation called Synapse-1 (not to be confused with the current IBM SyNAPSE project).

Heterogeneous computing

Architectures such as the Cell microprocessor (itself inspired by the PS2 vector units, one of which was tied more closely to the CPU for general purpose work) have exhibited features significantly overlap with AI accelerators - in its support for packed low precision arithmetic, dataflow architecture, and prioritising 'throughput' over latency and "branchy-int" code. This was a move toward heterogeneous computing, with a number of throughput-oriented accelerators intended to assist the CPU with a range of intensive tasks: physics-simulation, AI, video encoding/decoding, and certain graphics tasks beyond its contemporary GPUs.

The Physics processing unit was yet another example of an attempt to fill the gap between CPU and GPU in PC hardware, however physics tends to require 32bit precision and up, whilst much lower precision can be a better tradeoff for AI.

CPUs themselves have gained increasingly wide SIMD units (driven by video and gaming workloads) and increased the number of cores in a bid to eliminate the need for another accelerator, as well as for accelerating application code. These tend to support packed low precision data types.

Use of GPGPU

Spontaneous innovative software appeared using vertex and pixel shaders for general purpose computation through rendering APIs, by storing non graphical data in vertex-buffers and texture maps (including implementations of convolutional neural networks for OCR), Vendors of graphics processing units subsequently saw the opportunity to expand their market and generalised their shader pipelines with specific support for GPGPU, mostly motivated by the demands of video game-physics but also targeting scientific computing.

This killed off the market for a dedicated physics accelerator, and superseded Cell in video game consoles, and eventually led to their use in running convolutional neural networks such as AlexNet (which exhibited leading performance the ImageNet Large Scale Visual Recognition Challenge).

As such, as of 2016 GPUs are popular for AI work, and they continue to evolve in a direction to facilitate deep learning, both for training and inference in devices such as self-driving cars. - and gaining additional connective capability for the kind of dataflow workloads AI benefits from (e.g. NVidia NVLink).

Use of FPGA

Deep learning frameworks are still evolving, making it hard to design custom hardware. Reconfigurable devices like Field-programmable gate arrays (FPGA) make it easier to evolve hardware, frameworks and software alongside each other.

Microsoft has used FPGA chips to accelerate inference. This has motivated Intel to purchase Altera with the aim of integrating FPGAs in server CPUs, which would be capable of accelerating AI as well as other tasks.

Use of ASIC

Whilst GPUs perform far better than CPUs for these tasks, a factor of 10 in efficiency can still be gained with a more specific design, via an application-specific integrated circuit (ASIC).

Memory access pattern

The memory access pattern of AI calculations differs from graphics: a more predictable but deeper dataflow, benefiting more from the ability to keep more temporary variables on-chip (e.g. in scratchpad memory rather than caches); GPUs by contrast devote silicon to efficiently dealing with highly non-linear gather-scatter addressing between texture maps and frame-buffers, and texture filtering, as is needed for their primary role in 3D rendering.

Precision

AI researchers are often finding minimal accuracy losses whilst dropping to 16 or even 8 bits, suggesting that a larger volume of low precision arithmetic is a better use of the same bandwidth. Some researchers have even tried using 1bit precision (i.e. putting the emphasis entirely on spatial information in vision tasks). IBM's design is more radical, dispensing with scalar values altogether and accumulating timed pulses to represent activations stochastically, requiring conversion of traditional representations.

Nomenclature

As of 2016, the field is still in flux and vendors are pushing their own marketing term for what amounts to an "AI accelerator", in the hope that their designs and APIs will dominate. There is no consensus on the boundary between these devices, nor the exact form they will take, however several examples clearly aim to fill this new space, with a fair amount of overlap in capabilities.

In the past when consumer graphics accelerators emerged, the industry eventually adopted NVidias self assigned term, "the GPU", as the collective noun for "graphics accelerators", which had taken many forms before settling on an overall pipeline implementing a model presented by Direct3D.

Slowing of Moore's law

As of 2016, the slowing (and possible imminent end of) Moore's law drives some to suggest refocussing industry efforts on application led silicon design, whereas in the past, increasingly powerful general purpose chips have been applied to varying applications via software. In this scenario, a diversification of dedicated AI accelerators makes more sense than continuing to stretch GPUs and CPUs.

Future

It remains to be seen however if the eventual shape of an AI accelerator is a radically new device like TrueNorth, or a more general purpose processor that just happens to be optimised for the right mix of precision and dataflow. There are also some even more exotic approaches on the horizon, e.g. using memristors, attempting to use individual memristors as synapses.

Potential applications

Autonomous cars, NVidia have targeted their Drive PX-series boards at this space.

Military robots

Agricultural robots, for example chemical-free weed control.

Voice control, e.g. in mobile phones, a target for Qualcomm Zeroth.

Machine translation

Unmanned aerial vehicles, e.g. navigation systems, e.g. the Movidius Myriad 2 has been demonstrated successfully guiding autonomous drones.

Industrial robots, increasing the range of tasks that can be automated, by adding adaptability to variable situations.

Healthcare assisting with diagnoses

Search engines, increasing the energy efficiency of data centres and ability to use increasingly advanced queries.

Natural language processing

Examples

Vision processing units

e.g. Movidius Myriad 2, which is a many-core VLIW AI accelerator at its heart, complemented with video fixed function units.

Tensor processing unit - presented as an accelerator for Google's TensorFlow framework, which is extensively used for convolutional neural networks. It focusses on a high volume of 8-bit precision arithmetic.

SpiNNaker, a many-core design coming traditional ARM architecture cores with an enhanced network fabric design specialised for simulating a large neural network.

TrueNorth The most unconventional example, a manycore design based on spiking neurons rather than traditional arithmetic. Frequency of pulses represents signal intensity. As of 2016 there is no consensus amongst AI researchers if this is the right way to go, but some results are promising, with large energy savings demonstrated for vision tasks.

Zeroth NPU a design by Qualcom aimed squarely at bringing speech and image recognition capabilities to mobile devices.

Nervana Engine Nervana Systems

Eyeriss, a design aimed explicitly at convolutional neural networks, using a scratchpad and on chip network architecture.

Adapteva epiphany is targeted as a coprocessor, featuring a network on a chip scratchpad memory model, suitable for a dataflow programming model, which should be suitable for many machine learning tasks.

Kalray have demonstrated an MPPA and report efficiency gains over GPUs for convolutional neural nets.

IIT Madras are designing a spiking neuron accelerator for new RISC-V systems, aimed at big-data analytics in servers.

Nvidia DGX-1 is based on GPU technology however the use of multiple chips forming a fabric via NVLink specialises its memory architecture in a way that is particularly suitable for deep learning.

References

AI accelerator Wikipedia

(Text) CC BY-SA

Contents