aiml

15 Topics

Intel has just announced its first AI-optimized FPGA – the Intel® Stratix® 10 NX FPGA – to address the rapid increase in AI model complexity
FPGAs have been used in the area of hardware customization for decades. The hardware customization capability taps into the value proposition of FPGAs such as pipelining for applications that require low batch size and low latency, and flexible fabric and I/O functions to scale up and deliver entire systems. Intel’s silicon and software portfolio, which includes FPGAs, empowers our customers’ intelligent services from the cloud to the edge. Many Intel® FPGA customers have already started implementing AI accelerators using the hardware customization available through Intel FPGA technologies. It starts with Intel’s vision for device agnostic AI development. This vision allows developers to focus on building their solutions rather than focusing on specific devices. Intel has been fitting FPGAs into that vision for a while. Intel is focusing on acceleration, specifically on the trend of increasing AI model size and complexity. AI model complexity continues to double every 3.5 months. That’s a factor of 10X per year. These AI models are used in applications such as Natural Language Processing (NLP), Fraud Detection, and Surveillance. Intel has just announced its first AI-optimized FPGA – the Intel® Stratix® 10 NX FPGA – to address the rapid increase in AI model complexity. The Intel Stratix 10 NX FPGA embeds a new type of AI-optimized block called the AI Tensor Block, which delivers up to 15X more INT8 compute performance than today’s Stratix 10 MX. The INT8 data type is often used by AI inferencing algorithms. The AI Tensor Block is tuned for the common matrix-matrix or vector-matrix multiplications used by these AI algorithms, with capabilities designed to work efficiently for both small and large matrix sizes. David Moore, Corporate Vice President and General Manager for the Intel Programmable Solutions Group, holds up an Intel Stratix 10 NX FPGA, the company’s first AI-optimized FPGA The Intel Stratix 10 NX FPGA has several additional in-package features that support high-performance AI inferencing. These features include high-speed HBM2 memory and high-speed transceivers for fast networking. Note that Intel was able to develop the Intel Stratix 10 NX FPGA quickly due to its chiplet-based FPGA architecture strategy. Intel partnered with Microsoft to develop the AI Tensor block to help accelerate AI workloads in the data center. “As Microsoft designs our real-time multi-node AI solutions, we need flexible processing devices that deliver ASIC-level tensor performance, high memory and connectivity bandwidth, and extremely low latency. Intel® Stratix® 10 NX FPGAs meet Microsoft’s high bar for these requirements, and we are partnering with Intel to develop next-generation solutions to meet our hyperscale AI needs.” – Doug Burger, Technical Fellow, Microsoft Azure Hardware Intel Stratix 10 NX FPGAs serve as multi-function AI accelerators for Intel® Xeon® processors. They specifically address applications that require hardware customization, low latency, and real-time capabilities. The AI Tensor Blocks in the Intel Stratix 10 NX FPGA deliver more compute throughput by implementing more multipliers and accumulators compared to the DSP block found in other Intel Stratix 10 devices. The AI Tensor Block contains 30 multipliers and 30 accumulators instead of the two multipliers and two accumulators in the DSP block. The multipliers in the AI Tensor Block are tuned for lower precision numerical formats such as INT4, INT8, Block Floating Point 12, and Block Floating Point 16. These specific precisions are frequently used for AI inferencing workloads. The Intel Stratix 10 NX FPGAs addresses today’s AI challenges. For example, NLP typically uses large AI models, and these models are growing larger. The need to detect, recognize, and understand the context of various languages, followed by translation to the target language is a growing use for language translation applications, which are one NLP workload. These expanded workload requirements drive model complexity, which results in the need for more compute cycles, more memory, and more networking bandwidth. The Intel Stratix 10 NX FPGA’s in-package HBM2 memory allows large AI models to be stored on chip. Estimates suggest that a Stratix 10 NX FPGA running a large AI model like BERT at batch size 1 delivers 2.3X better compute performance than an NVIDIA V100. Fraud Detection is another application where Intel FPGAs enable real-time data processing applications where every microseconds matters. Intel FPGAs’ ability to create custom hardware solutions with direct ingestion of data through its transceivers and deterministic, low latency compute elements make microsecond-class real-time performance possible. Typically, Fraud Detection employs LSTM (Long Short Term Memory) AI models at batch size 1. Estimates suggest that the Intel Stratix 10 NX FPGA will deliver 34X better compute performance than an NVIDIA T4 GPU for LSTM models at batch size 1. Finally, consider a video surveillance application. Intel FPGAs excel in video surveillance applications because of their hardware customization ability, which allows implementation of custom processing and custom I/O protocols for direct data ingestion. For example, estimates suggest that the Intel Stratix 10 NX FPGA will provide 3.8X better compute performance than an NVIDIA T4 GPU for video surveillance using the ResNet50 model at batch size 1. The Intel Stratix 10 NX extends the benefits of FPGA based, high performance, hardware customization for AI inferencing through the introduction of the AI Tensor block. The Intel Stratix 10 NX FPGA delivers as much as 15X more compute performance for AI inferencing. This FPGA is Intel’s first AI-optimized FPGA and it will be available later this year. For more information about the Intel Stratix 10 NX FPGA, click here. Intel’s silicon and software portfolio empowers our customers’ intelligent services from the cloud to the edge. Notices and Disclaimers 15X more INT8 compute performance than today’s Stratix 10 MX for AI workloads: When implementing INT8 computations using the standard Stratix 10 DSP Block, there are 2 multipliers and 2 accumulators used. On the other hand, when using the AI Tensor Block, you have 30 multipliers and 30 accumulators. Therefore 60/4 provides up to 15X more INT8 compute performance when comparing the AI Tensor Block with the standard Stratix 10 DSP block. BERT 2.3X faster, LSTM 10X faster, ResNet50 3.8X faster: BERT batch 1 performance 2.3X faster than Nvidia V100 (DGX-1 server w/ 1x NVIDIA V100-SXM2-16GB | TensorRT 7.0 | Batch Size = 1 | 20.03-py3 | Precision: Mixed | Dataset: Sample Text); LSTM batch 1 performance 9.5X faster than Nvidia V100 (Internal server w/Intel® Xeon® CPU E5-2683 v3 and 1x NVIDIA V100-PCIE-16GB | TensorRT 7.0 | Batch Size = 1 | 20.01-py3 | Precision: FP16 | Dataset: Synthetic); ResNet50 batch 1 performance 3.8X faster than Nvidia V100 (DGX-1 server w/ 1x NVIDIA V100-SXM2-16GB | TensorRT 7.0 | Batch Size = 1 | 20.03-py3 | Precision: INT8 | Dataset: Synthetic). Estimated on Stratix 10 NX FPGA using -1 speed grade, tested in May 2020. Each end-to-end AI model includes all layers and computation as described in Nvidia’s published claims as of May 2020. Result is then compared against Nvidia’s published claims. Link for Nvidia: https://developer.nvidia.com/deep-learning-performance-training-inference. Results have been estimated or simulated using internal Intel analysis, architecture simulation, and modeling, and provided to you for informational purposes. Any differences in your system hardware, software or configuration may affect your actual performance.Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No product or component can be absolutely secure. Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. For more complete information about performance and benchmark results, visit http://www.intel.com/benchmarks . Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit http://www.intel.com/benchmarks . Intel Advanced Vector Extensions (Intel AVX) provides higher throughput to certain processor operations. Due to varying processor power characteristics, utilizing AVX instructions may cause a) some parts to operate at less than the rated frequency and b) some parts with Intel® Turbo Boost Technology 2.0 to not achieve any or maximum turbo frequencies. Performance varies depending on hardware, software, and system configuration and you can learn more at http://www.intel.com/go/turbo. Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Cost reduction scenarios described are intended as examples of how a given Intel-based product, in the specified circumstances and configurations, may affect future costs and provide cost savings. Circumstances will vary. Intel does not guarantee any costs or cost reduction. Intel does not control or audit third-party benchmark data or the web sites referenced in this document. You should visit the referenced web site and confirm whether referenced data are accurate. © Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.
sleibson
6 years ago Place Blog
2.8KViews
0likes
0Comments
Nokia AirFrame Edge Server based on 2nd Gen Intel® Xeon® Scalable CPUs and the Intel® FPGA PAC N3000 suits edge and far-edge cloud RAN, MEC, and 5G deployments
Nokia AirFrame open edge servers feature an ultra-small footprint so they fit well in many locations including existing base station facilities and far-edge sites. These compact servers are provisioned with a real-time, OPNFV compatible, OpenStack distribution that provides both low latency and high throughput for cloud RAN and other applications. The software runs atop integrated 2 nd Generation Intel® Xeon® Scalable CPUs and the optional Intel® FPGA Programmable Acceleration Card (Intel FPGA PAC) N3000, which enhance the server’s capabilities with respect to artificial intelligence (AI) and machine learning (ML) workloads. An optional Fronthaul Gateway module provides 5G/4G/CPRI connectivity to existing legacy radios and contains an L2/L3 switch and an Intel® Stratix® 10 FPGA, which provides high-performance L1 processing. The servers are available in OCP-accepted 2RU or 3RU chassis with as many as five server sled slots and dual redundant AC or DC power supplies. Nokia offers both 1U and 2U server sleds based on 2 nd Generation Intel Xeon Scalable processors with connectivity through front server slots for high accessibility. A new 4-minute Nokia video details the features and benefits of the Nokia AirFrame open edge server for use in a variety of deployments including Cloud RAN, Multi-access Edge Computing (MEC), and 5G. For more technical details about the Nokia AirFrame open edge server, please contact Nokia directly. Notices and Disclaimers Intel technologies may require enabled hardware, software or service activation. No product or component can be absolutely secure. Your costs and results may vary. Intel does not control or audit third-party data. You should consult other sources to evaluate accuracy. © Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.
sleibson
6 years ago Place Blog
2.3KViews
0likes
0Comments
Upcoming Webinar: Computational Storage Acceleration Using Intel® Agilex™ FPGAs
Today’s computational workloads are larger, more complex, and more diverse than ever before. The explosion of applications such as high-performance computing (HPC), artificial intelligence (AI), machine vision, analytics, and other specialized tasks is driving the exponential growth of data. At the same time, the trend towards using virtualized servers, storage, and network connections means that workloads are growing in scale and complexity. Traditionally, data is conveyed to a computational engine -- such as a central processing unit (CPU) -- for processing, but transporting the data takes time, consumes power, and is increasingly proving to be a bottleneck in the overall process. The solution is computational storage, also known as in-storage processing (ISP). The idea here is that, instead of bringing the data to the computational engine, computational storage brings the computational engine to the data. In turn, this allows the data to be processed and analyzed where it is generated and stored. To learn more on this concept, please join us for a webinar led by Sean Lundy from Eideticom, Craig Petrie from BittWare, and myself: Computational Storage Using Intel® Agilex™ FPGAs: Bringing Acceleration Closer to Data The webinar will take place on Thursday 11 November 2021 starting at 11:00 a.m. EST. Sean will introduce Eideticom’s NoLoad computational storage technology for use in data center storage and compute applications. NoLoad technology provides CPU offload for applications, resulting in the dramatic acceleration of compute-plus-data intensive tasks like storage workloads, data bases, AI inferencing, and data analytics. NoLoad’s NVMe-compliant interface simplifies the deployment of computational offload by making it straightforward to deploy in servers of all types and across all major operating systems. Craig will introduce BittWare’s new IA-220-U2 FPGA-based Computational Storage Processor (CSP) that supports Eideticom’s NoLoad technology as an option. The IA-220-U2 CSP, which is powered by an Intel Agilex F-Series FPGA with 1.4M logic elements (LEs), features PCIe Gen 4 for twice the bandwidth offered by PCIe Gen 3 solutions. This CSP works alongside traditional Flash SSDs, providing accelerated computational storage services (CSS) by performing compute-intensive tasks, including compression and/or encryption. This allows users to build out their storage using standard SSDs instead of being locked into a single vendor’s storage solutions. BittWare’s IA-220-U2 accelerates NVMe FLASH SSDs by sitting alongside them as another U.2 Module. (Image source: Bittware) We will also discuss the Intel Agilex™ FPGAs that power BittWare’s new CSP. Built on Intel’s 10nm SuperFin Technology, these devices leverage heterogeneous 3D system-in-package (SiP) technology. Agilex I-Series FPGAs and SoC FPGAs are optimized for bandwidth-intensive applications that require high-performance processor interfaces, such as PCIe Gen 5 and Compute Express Link (CXL). Meanwhile, Agilex F-Series FPGAs and SoC FPGAs are optimized for applications in data center, networking, and edge computing. With transceiver support up to 58 Gbps, advanced DSP capabilities, and PCIe Gen 4 x16, the Agilex F-Series FPGAs that power BittWare’s new CSP provide the customized connectivity and acceleration required by compute-plus-data intensive power sensitive applications such as HPC, AI, machine vision, and analytics. This webinar will be of interest to anyone involved with these highly complex applications and environments. We hope to see you there, so Register Now before all the good virtual seats are taken.
Samskrut
5 years ago Place Blog
2.2KViews
0likes
0Comments
More details on the Intel® Stratix® 10 NX FPGA, the first AI-optimized Intel® FPGA, now available in a new White Paper
The increasing complexity of AI models and the explosive growth of AI model size are both rapidly outpacing innovations in compute resources and memory capacity available on a single device. AI model complexity now doubles every 3.5 months or about 10X per year, driving rapidly increasing demand in AI computing capability. Memory requirements for AI models are also rising due to an increasing number of parameters or weights in a model. The Intel® Stratix® 10 NX FPGA is Intel’s first AI-optimized FPGA, developed to enable customers to scale their designs with increasing AI complexity while continuing to deliver real-time results. The Intel Stratix 10 NX FPGA fabric includes a new type of AI-optimized tensor arithmetic block called the AI Tensor Block. These AI Tensor Blocks are tuned for the common matrix-matrix or vector-matrix multiplications used for AI computations and contain dense arrays of lower precision multipliers typically used for AI model arithmetic. The smaller multipliers in these AI Tensor Blocks can also be aggregated to construct larger-precision multipliers. The AI Tensor Block’s architecture contains three dot-product units, each of which has ten multipliers and ten accumulators for a total of 30 multipliers and 30 accumulators within each block. The AI Tensor Block multipliers’ base precisions are INT8 and INT4 along with shared exponent to support Block Floating Point 16 (Block FP16) and Block Floating Point 12 (Block FP12) numerical formats. Multiple AI Tensor Blocks can be cascaded together to support larger vector calculations. A new White Paper titled “Pushing AI Boundaries with Scalable Compute-Focused FPGAs” covers the new features and performance capabilities of the Intel Stratix 10 NX FPGAs. Click here to download the White Paper. If you’d like to see the Intel Stratix 10 NX FPGA in action, please check out the recent blog “WaveNet Neural Network runs on Intel® Stratix® 10 NX FPGA, synthesizes 256 16 kHz audio streams in real time.” Notices & Disclaimers Intel technologies may require enabled hardware, software or service activation. No product or component can be absolutely secure. Your costs and results may vary. © Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.
sleibson
6 years ago Place Blog
2.1KViews
0likes
0Comments
BittWare’s 520NX Accelerator Card harnesses AI-optimized power of the Intel® Stratix® 10 NX FPGA
BittWare has just announced the 520NX AI Accelerator PCIe card based on the AI-optimized Intel® Stratix® 10 NX FPGA, which incorporates specialized AI Tensor blocks with a theoretical peak computational speed of 143 INT8 TOPS and 8 Gbytes of in-package, stacked high-bandwidth memory (HBM2). In addition to the Intel Stratix 10 NX FPGA’s internal resources, the 520NX AI Accelerator card’s on-board resources include a PCIe Gen3 x16 host interface, four independently clocked QSFP28 card cages that support as many as four 100G optical transceiver modules, and two DIMM sockets that can accommodate as much as 256 Gbytes of memory. The 520NX offers enterprise-class features and capabilities for application development and deployment including: HDL developer toolkit: API, PCIe drivers, application example designs, and diagnostic self-test Passive, active, or liquid cooling options Multiple OCuLink expansion ports for additional PCIe, storage, or network I/O The BittWare 520NX AI Accelerator card based on the AI-optimized Intel Stratix 10 NX FPGA The Intel Stratix 10 NX FPGA was introduced earlier this year. (See “Intel has just announced its first AI-optimized FPGA – the Intel® Stratix® 10 NX FPGA – to address the rapid increase in AI model complexity.”) More recently, the FPGA’s AI capabilities have been demonstrated by Myrtle.ai, running a WaveNet text-to-speech application that can synthesize 256 simultaneous streams of 16 kbps audio. (See “WaveNet Neural Network runs on Intel® Stratix® 10 NX FPGA, synthesizes 256 16 kHz audio streams in real time.”) The new BittWare 520NX AI Accelerator card makes it much easier to develop applications based on the Intel Stratix 10 NX FPGA by providing the FPGA on a proven, ready-to-integrate PCIe card. For more information about the 520NX AI Accelerator card, please contact BittWare directly. Notices & Disclaimers Intel technologies may require enabled hardware, software or service activation. No product or component can be absolutely secure. Your costs and results may vary. Intel does not control or audit third-party data. You should consult other sources to evaluate accuracy. © Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.
sleibson
6 years ago Place Blog
2KViews
0likes
0Comments
Terasic DE10-Agilex Accelerator PCIe board combines Intel® Agilex™ F-Series FPGA with four DDR4 SO-DIMM SDRAM sockets and two QSFP-DD connectors
If you’re itching to get your hands on the innovative features built into the new family of Intel® Agilex™ FPGAs like the second-generation Intel® HyperFlex™ architecture or the improved DSP capabilities including half-precision floating point (FP16) and BFLOAT 16 computational abilities, then consider the new Terasic DE10-Agilex Accelerator board. This PCIe card combines an Intel Agilex F-Series FPGA with four independent DDR4 SO-DIMM SDRAM sockets and two QSFP-DD connectors on a three-quarter length PCIe board. The board’s host interface is a PCIe Gen 4.0 x16 port. Each SO-DIMM memory socket accommodates 8 or 16 Gbytes of DDR4 memory, for a maximum total SDRAM capacity of 64 Gbytes, and each QSFP-DD connector accommodates Ethernet transceiver modules to 200G. The board is available with two different cooling options: a 2-slot version with integrated fans or a single-slot, passively cooled version. The Terasic DE10-Agilex Accelerator PCIe card combines an Intel® Agilex™ F-Series FPGA with four independent DDR4 SO-DIMM SDRAM sockets and two QSFP-DD connectors The Terasic DE10-Agilex PCIe board supports the Intel® OpenVINO™ toolkit, OpenCL™ BSP, and Intel® oneAPI Toolkits used for developing code for myriad high-performance workloads including computer vision and deep learning. The Intel Agilex FPGA family delivers up to 40% higher performance 1 or up to 40% lower power 1 for data center, NFV and networking, and edge compute applications. For more technical information about the Terasic DE10-Agilex Accelerator Board or to order the product, please contact Terasic directly. Notices and Disclaimers 1 This comparison based on Intel® Agilex™ FPGA and SoC family vs. Intel® Stratix® 10 FPGA using simulation results and is subject to change. This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications, and roadmaps. Intel technologies may require enabled hardware, software or service activation. No product or component can be absolutely secure. Your costs and results may vary. Intel does not control or audit third-party data. You should consult other sources to evaluate accuracy. © Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.
sleibson
6 years ago Place Blog
1.9KViews
0likes
0Comments
WaveNet Neural Network runs on Intel® Stratix® 10 NX FPGA, synthesizes 256 16 kHz audio streams in real time
State-of-the-art text-to-speech (TTS) synthesis systems generally employ two neural network models that run sequentially to generate audio. The first model generates acoustic features such as spectrograms from input text. The second model, a vocoder, takes intermediate features from the first model and produces speech. Tacotron 2 is often used as the first model. A new White Paper from Myrtle.ai titled “Implementing WaveNet Using Intel® Stratix® 10 NX FPGA for Real-Time Speech Synthesis” focuses on the second model, a state-of-the-art vocoder based on a neural network model called WaveNet, which produces natural-sounding speech with near-human fidelity. The key to the WaveNet model’s high speech quality is an autoregressive loop, but this property also makes the network exceptionally challenging to implement for real-time applications. Efforts to accelerate WaveNet models generally have not achieved real-time audio synthesis. The Myrtle.ai White Paper describes efforts to implement a WaveNet model using an Intel® Stratix® 10 NX FPGA. By using Block Floating Point (BFP16) quantization, which the Intel Stratix 10 NX FPGA supports, Myrtle.ai has been able to deploy a real-time WaveNet model that synthesizes 256 16 kHz audio streams in real time. For more details and to download the White Paper, click here. To see a video demo on this system in action, click here. Notices & Disclaimers Intel technologies may require enabled hardware, software or service activation. No product or component can be absolutely secure. Your costs and results may vary. Intel does not control or audit third-party data. You should consult other sources to evaluate accuracy. © Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.
sleibson
6 years ago Place Blog
1.7KViews
0likes
0Comments
Are FPGAs good for accelerating AI? VentureBeat takes a closer look
VentureBeat has just posted an article titled “FPGA chips are coming on fast in the race to accelerate AI” that takes an in-depth look at the use of FPGAs for Artificial Intelligence (AI) applications. The article cites five AI application challenges that FPGAs help to overcome: Overcoming I/O bottlenecks Providing acceleration for high performance computing (HPC) clusters Integrating AI into workloads Enabling sensor fusion Adding extra capabilities beyond AI The article also discusses Microsoft’s integration of FPGA-based AI into Microsoft Azure and Project Brainwave and ends with the following statement: “Today’s FPGAs offer a compelling combination of power, economy, and programmable flexibility for accelerating even the biggest, most complex, and hungriest models.” If you are developing applications that incorporate AI, be sure to take a look at “FPGA chips are coming on fast in the race to accelerate AI.” Notices & Disclaimers Intel technologies may require enabled hardware, software or service activation. No product or component can be absolutely secure. Your costs and results may vary. Intel does not control or audit third-party data. You should consult other sources to evaluate accuracy. © Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.
sleibson
5 years ago Place Blog
1.7KViews
0likes
0Comments
Accelerating Memory Bound AI Inference Workloads with Intel® Stratix® 10 MX Devices
Artificial intelligence (AI) systems are increasingly constrained by memory solutions that overly constrict available memory bandwidth. For example, Recurrent Neural Networks (RNNs) used in AI applications such as finance, genome mapping, and speech AI – including Automatic Speech Recognition (ASR), Natural Language Processing/Understanding (NLP/NLU) – have two common traits: They are memory intensive They require very low latency Consequently, RNN applications can become memory bound when implemented with the wrong memory architecture. Intel® Stratix® 10 MX FPGAs – with integrated, in-package, 3D stacked HBM2 DRAM – provide 10X more memory bandwidth with better performance per watt compared to conventional memory solutions such as DDR SDRAMs 1 . Manjeera Digital Systems has developed a Universal Multifunction Accelerator (UMA) IP that solves memory-bound bottlenecks for applications like RNNs. The Manjeera UMA is a scalable, programmable datapath processor that delivers the performance of a hardware datapath while retaining software-programmable flexibility. Manjeera’s UMA implemented in an Intel® FPGA like the Intel Stratix 10 MX FPGA is called a Programmable Inference Engine (PIE), which can accelerate a wide variety of deep neural network (DNN) workloads including RNNs. When instantiated in an Intel Stratix 10 MX FPGA, the Manjeera PIE connects to all sixteen HBM2 DRAM stack’s pseudo-channels and partitions the available memory in the HBM2 stack into sixteen independent blocks, resulting in an aggregate data transfer rate of 170 GBps per HBM2 stack. (An Intel Stratix 10 MX FPGA incorporates one or two 3D stacked HBM2 memories.) This high available data rate maximizes the PIE’s use of the available HBM2 DRAM stack’s bandwidth and delivers significantly more performance relative to the bandwidth available from external DDR SDRAM. High memory bandwidth has proven to be a key factor in achieving low-latency RNN performance. The PIE is integrated into the OpenVino environment for direct import of TensorFlow models. The PIE also comes with a software stack for direct import of Keras models. A new Intel White Paper titled “Accelerating Memory Bound AI Inference Workloads with Intel® Stratix® 10 MX Devices” provides additional technical details on this topic. To download this White Paper, click here to access the Intel FPGA Partner Solution page, scroll down to the Manjeera Digital Systems section, and click on the White Paper link. Intel’s silicon and software portfolio empowers our customers’ intelligent services from the cloud to the edge. Notices and Disclaimers 1 Tests measure performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. Consult other sources of information to evaluate performance as you consider your purchase. For more complete information about performance and benchmark results, visit www.intel.com/benchmarks. Intel technologies may require enabled hardware, software or service activation. No product or component can be absolutely secure. Your costs and results may vary. © Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.
sleibson
6 years ago Place Blog
1.6KViews
0likes
0Comments
NextPlatform.com article describes Intel® oneAPI use at CERN for Large Hadron Collider (LHC) research
Independent consultant James Reinders has just published a comprehensive article on the NextPlatform.com Web site titled “CERN uses [Intel®] DL Boost, oneAPI to juice inference without accuracy loss,” which describes the use of deep learning and Intel® oneAPI by CERN to accelerate Monte Carlo simulations for Large Hadron Collider (LHC) research. Reinders writes that CERN researchers “have demonstrated success in accelerating inferencing nearly two-fold by using reduced precision without compromising accuracy at all.” The work is being carried out as part of Intel’s long-standing collaboration with CERN through CERN openlab. If Reinders’ name looks familiar to you, that’s because he recently published a book about the use of Data Parallel C++ (DPC++), which is the foundation compiler technology at the heart of Intel oneAPI. (See “Springer and Intel publish new book on DPC++ parallel programming, and you can get a free PDF copy!”) CERN researchers found that about half of the computations in a specific neural network (NN) called a Generative Adversarial Network (GAN) could be switched from FP32 to INT8 numerical precision, which is directly supported by Intel® DL Boost, without loss of accuracy. GAN performance doubled as a result while accuracy was not affected. Although this work was done using Intel® Xeon® Scalable Processors with direct INT8 support, Reinders’ article also makes the next logical jump: “INT8 has broad support thanks to Intel Xeon [Scalable Processors], and it is also supported in Intel® Xe GPUs. FPGAs can certainly support INT8 and other reduced precision formats.” Further, writes Reinders: “The secret sauce underlying this work and making it even better: oneAPI makes Intel DL Boost and other acceleration easily available without locking in applications to a single vendor or device” “It is worth mentioning how oneAPI adds value to this type of work. Key parts of the tools used, including the acceleration tucked inside TensorFlow and Python, utilize libraries with oneAPI support. That means they are openly ready for heterogeneous systems instead of being specific to only one vendor or one product (e.g. GPU). “oneAPI is a cross-industry, open, standards-based unified programming model that delivers a common developer experience across accelerator architectures. Intel helped create oneAPI, and supports it with a range of open source compilers, libraries, and other tools. By programming to use INT8 via oneAPI, the kind of work done at CERN described in this article could be carried out using Intel Xe GPUs, FPGAs, or any other device supporting INT8 or other numerical formats for which they may quantize.” For additional information about Intel oneAPI, see “Release beta09 of Intel® oneAPI Products Now Live – with new programming tools for FPGA acceleration including Intel® VTune™ Profiler.” You may also be interested in an instructor-led class titled “Using Intel® oneAPI Toolkits with FPGAs (IONEAPI).” Notices & Disclaimers Performance varies by use, configuration, and other factors. Learn more at www.Intel.com/PerformanceIndex. Intel does not control or audit third-party data. You should consult other sources to evaluate accuracy. Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available updates. See backup for configuration details. No product or component can be absolutely secure. Your costs and results may vary. Intel technologies may require enabled hardware, software or service activation. © Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.
sleibson
5 years ago Place Blog
1.5KViews
0likes
0Comments