OCP Accelerator Module (OAM) specification defines an open-hardware compute accelerator module form factor and its interconnects.
Facebook, Microsoft and Baidu contributed the OAM design specification to the OCP community during 2019 OCP Global Summit. Following this contribution, the Open Compute Project Foundation is chartering an Open Accelerator Infrastructure (OAI) sub-project within OCP Server Project to develop system level aspects of deploying upcoming products which meet the OAM specification.
Whitney Zhao from Facebook, Siamak Tavallaei from Microsoft and Ruiquan Ding from Baidu.
What is OAM and OAI?
The OAM design specification defines the mezzanine form factor and common specifications for a compute accelerator module. In contrast with a PCIe add-in card form factor, the mezzanine module form factor of OAM facilitates scalability across accelerators by simplifying the system solution when interconnecting high-speed communication links among modules.
Outlining an open, modular, hierarchical infrastructure for interoperable accelerators, the OAI base specification covers OAM and a complementary set of subsystems such as a compliant Baseboard (UBB), PCIe Switch Board (PSB), Tray, and Chassis along with a compliant Secure Control Module (SCM) for rack and system level management.
- OAM: OCP Accelerator Module (a mezzanine module form factor for various accelerators)
- UBB: Universal Baseboard (interconnecting topologies between accelerators to scale up)
- PSB: PCIe Switch Board (common interface between UBB, Hosts, and other IO devices to scale out)
- SCM: a Secure Control Module for management
- Tray: a means for ease of field replacement and serviceability
- Chassis: an outline for a collection of Trays and input/output resources to scale out
What value does OAM bring?
As AI evolves, different suppliers produce new AI accelerators, but due to the technical challenges and design complexities of current proprietary AI hardware systems, it generally takes about 6-12 months to integrated them into systems. This delay prevents quick adoption of new competitive AI accelerators.
The OAM specification is being authored by Facebook, Microsoft, and Baidu. Other companies are collaborating on the specification:
- Big internet companies such as Google, Alibaba, and Tencent
- AI chip/module companies such as Nvidia, Intel, AMD, Qualcomm, Xilinx , Graphcore, Habana, BittWare(Molex), and Huawei
- OEM/ODMs such as Huawei, IBM, Lenovo, Inspur, Penguin Computing, QCT, and Wiwynn
“The OAM specification along with the baseboard and enclosure infrastructure will speed up the adoption of new AI accelerators and will establish a healthy and competitive ecosystem” stated Bill Carter, Chief Technology Officer for the Open Compute Project Foundation.
Why OAI is needed?
Artificial Intelligence (AI) applications are rapidly evolving and producing an explosion of new types of hardware accelerators for Machine Learning (ML), Deep Learning (DL), and High-Performance Computing (HPC). Different implementations target similar requirements for power/cooling, robustness, serviceability, configuration, programming, management and debug, as well as inter-module communication to scale up and input/output bandwidth to scale out.
To take advantage of the available industry-standard form factors to reduce the required time and effort in producing suitable solutions, various implementations have selected PCIe CEM form factor as a quick market entry. Such solutions are not optimized for the upcoming AI workloads which require ever growing bandwidth and interconnect flexibility for data/model parallelism.
We need an open infrastructure to intercept rapid innovation in artificial intelligence. OAI is where open accelerator infrastructure meets open artificial intelligence.
Helpful links to learn more and get started:
Mailing List: https://ocp-all.groups.io/g/OCP-OAI
Server (Parent) Project: https://www.opencompute.org/projects/server