Working Group Charter and Direction
The OCP Future Technologies Initiative (FTI) has been a great resource to incubate emerging technologies, architectures and directions. In recent years, one such FTI initiative has been the AI HW/SW Co-Design effort focusing on both emerging technology used for AI computational infrastructure and the ecosystem required for efficient resource management, which includes discovery, configuration and the composition of computing resources to meet the service level objectives of AI workloads. In 2023, this team’s collaborative work effort successfully resulted in the publication of an OCP White Paper titled “Polymorphic Architecture for Future AI Applications.” 2024 kicked off with the AI HW/SW Co-Design Working Group following up this groundwork by revisiting the goals of the charter, and refining the polymorphic architecture framework with hierarchical transformability and composability, enabling large scalability and dynamic reconfigurability of heterogeneous computing infrastructure for high performance computation of AI workloads in real-time.
Closing 2024
As a member of the FTI in OCP, the AI HW/SW Co-Design Working Group progressed the architecture definition as outlined in the charter as well as evangelized the underlying principles and ended 2024 with the publication of the Polymorphic Architecture Definition v1.0, demonstrating the breadth of possibilities in discovering, composing, managing, and optimizing resources to meet the demands of the ever-changing AI workloads. At the 2024 OCP Global Summit, the group held an FTI Workshop highlighting the capabilities of AI HW/SW Co-Design, and applicability to the IT Ecosystem from application, to data management, to topology awareness and silicon design. Industry leaders discussed the requirements of co-design at each layer of the infrastructure to meet the ever-changing demands of the AI ecosystem. This was a very insightful and impactful meeting resulting in great ideas and directions.
As the year closed, it was clear that the mission of this working group needed to continue and to focus on specific areas to both develop clear requirements for supporting the needed polymorphic capabilities and deliver on the proof of concepts for demonstrating its full potential.
Launching 2025
With the new year, the AI HW/SW Working Group has been transitioned from an OCP FTI effort to a sub-project under the OCP Server Project. This transition moves the working group from incubating ideas and concepts to a Project, with a mandate to develop and operationalize the Polymorphic Architecture from whitepapers to the proof of concepts. This sub-project will not only demonstrate the ideas identified, but also build and accept requirements for collaboration and integration with other OCP Projects and sub-projects.
Polymorphic Architecture Subsystems
Polymorphic Architecture enables hierarchical fractal composability across large-scale computing resources to make them behave like a single logical accelerator. Thus, the resources can be dynamically reconfigured and adapted to the application’s running behavior for highly efficient execution. The capability of the polymorphic system relies on the polymorphisms of its sub-systems supported by technologies in corresponding domains. As shown in Figure 1, these include:
- Polymorphic computing resources with characteristics of heterogeneity, modularity, virtualization, composability and transformability. These should enable various mechanisms of parallelism (such as 3D parallelism) and flexibility common for AI training and even for test-time compute scaling.
- Polymorphic interconnect with characteristics of reconfigurability, composability and scalability. Furthermore, the interconnect fabric can be co-optimized via topology-aware collective communication and/or in-network computing. The logical topology built on the polymorphic interconnect fabric can be formed via all-to-all switching, multi-pathing, or even relaying via other compute nodes.
- Polymorphic memory hierarchies: composable with capabilities of tiering and/or in-memory computing.
- Software ecosystem and optimizations: application characterization, computation partitioning & mapping, fault tolerance, compiler optimizations, etc.
Figure 1: Possible Sub-System Targets for the Working Group
As AI continues its fast evolution with exponential growth every year, high computation demands can’t be met simply by accelerator architecture innovations alone. Interconnect technologies are playing equal or greater roles in supporting AI development, either by scaling up the computation power (within a chip, a server, or a super node) or by scaling out with thousands of compute nodes.
Scope of Work for the AI Co-design Working Group 2025
To ensure delivery of concrete ideas and demonstrations, the working group is moving to focus on the aspects of the polymorphic interconnect as its first phase. With this, the work scope in 2025 is defined as following:
- The working group will explore the interconnect related technologies as outlined above to prototype the polymorphism of interconnect fabric in AI systems. As such, a simulation platform should be built to demonstrate resource discovery and fabric composition efficiency in the polymorphic design.
- The working group should also be aware of emerging AI fabric technologies from the community which is working to increase the efficiency of the compute-communication domain as well as optimize the ecosystem for failure recovery, observability, spares, locality, and overall cost for the AI applications. Two key touchpoints for this AI Fabric work are Ultra Ethernet Consortium (UEC), connecting systems, and Ultra Accelerator Link (UALink), connecting accelerators. Thus, the working group needs to explore whether to extend the state of the art, both through organic innovation from within this group and incorporating emerging research that allows for optimizing the data management and communication elements of the fabric. For example, work such as Topology Aware Collectives is under investigation, both in the working group as well as across the global community, for increasing the efficiency of the compute communication domain.
- Another area for investigation is how to benchmark the composed compute-communication elements, mapping to targets for performance and efficiency. One possible partner in the investigation of benchmarking is MLCommons.
As the working group shifts to the composition and optimization of the AI fabrics for the polymorphic architecture, there is a request to investigate interconnects at all levels in the design. With this, members of the team have requested to collaborate with the Open Chiplet Economy Sub-project to map the Abundance of Wires for chiplets as an interconnect possibility. Figure 2 depicts one aspect of the design space[1] for the AI ecosystem.
Figure 2: NIST Presentation Discussing the evolution of interconnect
To build a more complete model, the AI HW/SW Co-Design sub-project will need to closely collaborate with other OCP sub-projects, such as the Composable Memory Systems Working Group, Open Chiplet Economy Sub-Project, Short Reach Optical under FTI as well as the larger Server project group governing all of this work. For each of these collaborations, the AI HW/SW Working group will build a set of requirements for deliverables.
Call to Action
The AI HW/SW Co-Design working group is revising the charter to reflect the scope change from purely conceptual development to implementation (POC), initially addressing AI training at scale.
The goals we have identified for 2025:
Core requirements (work items):
- WI1 - Understand the performance bottlenecks and usage trends of AI / LLM models. Evaluate how polymorphic interconnect fabric impacts AI workload performance.
- WI2 - Identify and develop the methodologies for resource discovery as mapped to the topology and mechanisms to enable fabric re-composability.
- WI3 - Investigate and incorporate the re-composition mechanisms of the polymorphic architecture via simulation. One suggestion is to use ASTRA-Sim as a base to start building up the simulation framework of Polymorphic Architecture.
- WI4 - Refine the role of topology interfaces[2] and the role of collectives, for example, evaluating the potential of Topology Aware Collectives
- WI5 - Articulate the roles of scale-up/scale-out networks in the polymorphic architecture and the impact on (and/or requirements from) topology aware fabrics
- WI6 - Establish the methodology for measuring the overall efficiency of approaches in composition as well as failure handling and recovery
Additional requirements (additional items):
- AI1 - Establish requirements for collaborative design with projects/subprojects.
- AI2 - Establish POC definitions for 2025 OCP Global Summit
- AI3 - Enlist partners, collaborators, and members for newly launched sub-project
The journey of a smallish core group that has taken AI HW SW Co-design from a concept to this point has been both challenging and exciting. For 2025 we seek to enlist co-travelers from the larger OCP Community to collaborate on this journey to refine the ideas, establish concrete deliverables and begin development.