Computing's Hidden Menace: The OCP Takes Action Against Silent Data Corruption (SDC)

In the realm of computing systems, there lurks an insidious threat - Silent Data Corruption (SDC). A hardware-induced SDC is an error that occurs when data is corrupted within a computing system, often without any indication from the built-in error detection mechanisms. This can happen due to manufacturing defects, aging components, or even environmental factors. These errors may lead to anomalous/unpredictable software behavior, causing incorrect calculations, potentially leading to consequences like the loss of data.  

Acknowledging the scale of this threat, the Open Compute Project (OCP) has stepped up to take action. The OCP's Server Project has established the Server Component Resilience Workstream, focused explicitly on tackling hardware-induced Silent Data Corruption. This workstream, comprising key industry players Meta, Google, Intel, ARM, AMD, Microsoft, and NVIDIA, is launching a request for research proposals to fuel innovation in detection and prevention of SDC.

"As infrastructure rapidly shifts towards AI, the need for coordinated efforts to combat Silent Data Corruptions at scale only grows. We need to treat hardware resilience as a first order concern and OCP has been instrumental in bringing together a collaborative ecosystem. We are proud to partner with the OCP Server Component Resilience workstream to enable this research domain.” said David Ramku, Sr Director of Infrastructure at Meta, and OCP Board member

Facing the Challenge: The OCP Workstream

The scale of the SDC problem is daunting. Today's computing systems rely on advanced SoCs (System on a Chip) containing billions of transistors. Cloud computing involves millions of nodes running around the clock (24/7/365). Many applications simply cannot tolerate computational errors.

The OCP's Server Component Resilience Workstream aims to:

  1. Spread Awareness: Raise consciousness about SDC challenges within the entire computing community.

  2. Find Solutions: Develop methods to pinpoint the root causes of SDCs and prevent their occurrence.

  3. Collaborate with Academia: Foster partnerships with researchers to explore new approaches.

  4. Drive Innovation: Transfer research outcomes back to the industry, fueling the development of cutting-edge technologies.

Avenues for Research: Calling All Innovators

The OCP Foundation is inviting researchers to submit proposals in the following areas. The research vectors below are provided purely as guidance to the principal investigators to focus the research tasks and are not exhaustive.  

  • Effective screening and faster detection
  • Techniques in the software stack to detect/correct errors
  • Pre-silicon and post-silicon coverage/susceptibility assessment and improvement
  • Hardware techniques for SDC detection and resiliency

The Future of Computing Resilience

SDC is not a new problem, but the scale and complexity of modern computing systems have made it more pressing than ever. The OCP's initiative is a significant step toward building more resilient computing infrastructure. If you are a researcher with bright ideas to tackle SDC, don't miss out on this opportunity!

Timeline: 

Submission Details:

Researchers interested in contributing to this groundbreaking initiative can submit their proposals by April 3, 2024, via the designated form. A Sample Proposal along with a comprehensive list of potential research areas to guide principal investigators can be found here

For inquiries, the workstream welcomes questions sent to scr-feedback@ocproject.net