News & Events
Taiwanese startup Neuchips has taped out its AI accelerator designed specifically for data center recommendation models. Emulation of the chip suggests it will be the only solution on the market to achieve one million DLRM inferences per Joule of energy (or 20 million inferences per second per 20–Watt chip). The company has already demonstrated that its software can achieve world–beating INT8 DLRM accuracy at 99.97% of FP32 accuracy.
Neuchips was founded in response to a call by Facebook (now Meta) in 2019 for the industry to work on hardware acceleration for recommendation inference. The Taiwanese startup set out to do exactly this, and the company is one of only two startup entrants specifically targeting recommendation (the other is Esperanto with its 1000–core RISC–V design).
“According to many reports, most of the AI inference cycles in the data center are actually for recommendation models, not vision or language… so we think recommendation is an important market,” Neuchips CEO Youn–Long Lin told EE Times, adding that the number of recommendation inferences required is growing steadily. “The power consumption is fixed, so the essential issue is that we have to do as much as possible within an energy budget in order to increase prediction accuracy.”
Prediction accuracy is very important for recommendation applications, such as online shopping, where any loss in accuracy means a corresponding loss in revenue for online shopping platforms.
DLRM (deep learning recommendation model), Meta’s open–source recommendation model, has quite different characteristics compared to the CNNs widely used for computer vision. Dense features, those with continuous values such as customer age or income, are extracted by multilayer perceptron (MLP — a type of neural network) while sparse features (yes or no questions) use embedding tables. There may be many hundreds of features or more, and embedding tables can be gigabytes in size. Interactions between these features would indicate the relationship between products and users for online shopping platforms. These interactions are computed explicitly — DLRM uses a dot product. And then these interactions go through another neural network.
While neural network computation may be compute–bound, the other operations required for DLRM may be bound by memory capacity, memory bandwidth, or communication. This makes DLRM a very hard model to accelerate with general–purpose AI accelerators, including those developed for applications such as image processing.
Neuchips’ ASIC solution, RecAccel, includes specially designed engines to accelerate embeddings (marked purple in diagram below), matrix multiplication (orange) and feature interaction (green).
“In the embedding engine, mostly the issue is to look up multiple tables simultaneously and very fast,” Lin said. “Recommendation model sizes vary a lot — some are very small, some are very large. The important issue is how to allocate tables to both off–chip and on–chip memory appropriately.”
Neuchips’ embedding engine reduces access to off–chip memory by 50% and increases bandwidth utilization by 30%, the company said, via a novel cache design and DRAM traffic optimization techniques.
Different recommendation models use different operations for feature interaction — DLRM uses dot product, but there are others. Lin said Neuchips’ feature interaction engine supports this kind of flexibility.
The chip has 10 compute engines with 16K MAC per engine.
“The important issue here is how to implement this compute engine with low power consumption and so it can handle sparse matrices efficiently,” Lin said. The compute engines consume 1 microjoule per inference at the SoC level.
Lin added that hardware features can also terminate computation when a certain level of accuracy is reached, to save power.
By EETimes