An important component of servers running in a data center is host firmware. It is fundamental to the performance, manageability and security of cloud applications. In addition, it is an essential enabler for computer system new technologies that are needed by cloud operators to serve billions of people.
Facebook has data centers globally, each housing a large number of servers, and the server infrastructure is constantly expanding. In order to answer these scalability challenges, as alternative solutions, Facebook and Intel have been collaborating to develop open source based firmware solutions using Xeon Scalable Processor (Xeon-SP) Firmware Support Package (FSP) for OCP platforms. The experimental project of such alternative approaches reached an important milestone in March 2020, with completion of Proof-Of-Concept. We successfully developed an alternative host firmware approach for Xeon-SP based OCP platform. This alternative host firmware approach aligns well with the Open Compute Project (OCP) OSF (Open System Firmware) direction.
Traditional Host Firmware Approach
The first generation of host firmware is more commonly known as BIOS (Basic Input Output System). It was designed to provide an abstraction layer to the hardware. It was instrumental for the success of the personal computer era.
In 2005, the computer industry invented a new generation of firmware architecture called UEFI (Unified Extensible Firmware Interface) to enable firmware to be:
- Extensible for a proliferating number of peripheral devices
- Able to support multiple OSes
- Able to support the growing complexity of computer and network systems
This dovetailed with an established firmware industry business model, where:
- Silicon vendor provides firmware reference code
- IBV (Independent BIOS Vendor) makes firmware feature ready
- ODMs customize firmware for specific platforms and make it production ready
- Customers treat the firmware basically as a black box
This traditional approach worked well for the industry (in particular enterprise customers) in the past 20 years. It will continue to be the industry mainstream for years to come.
Alternative Host Firmware Approach
Hyperscalers, however, have different requirements from enterprise customers. In the case of Facebook for instance, an efficient and reliable open software stack is advantageous for managing the systems securely at scale, while providing very high uptime in the data center for connecting people in the world. In order to answer the scalability requirements for host firmware at Facebook, we have been working with our partners on opening the host firmware. We refer to the alternative host firmware approach as OSF (Open System Firmware) approach.
How it works:
In traditional approach, UEFI was not initially designed to be a true operating system, it had to evolve to work on modern server systems with ever-growing complexity.
Figure-1: UEFI Platform Initialization boot phases
With OSF architecture, the bootloader executes in 2 phases:
- Early silicon initialization using coreboot and binary modules as necessary to bring cores out of reset and set up RAM. It then boots a payload. In our case, the payload is Linux.
- Linux as a mature operating system that includes device drivers, file system drivers and network drivers. It programs all devices, then locates and executes the target OS. The advantage lies in the modern kernel design and code base that are under scrutiny of a very large developer community. Advancements in embedded Linux have made it easy to add production software and services to the pre-boot environment and has opened up firmware to a new generation of engineers.
(Tux the penguin is attributed to firstname.lastname@example.org)
Figure-2: OSF architecture
OSF design principals are:
- Coreboot needs to be as slim as possible. We would like to push as much functionality into Linux as possible.
- Coreboot is fully open source. Open source strategy works for Linux, and it makes sense for a boot loader as well.
- All platforms share the same coreboot code base - images of the platform are customized through configurations (similar like Linux Kconfig mechanism).
- Linux OS image is the same across all platforms. This standardizes the firmware management. Necessary platform differences are achieved only at run time.
The benefits of OSF approach include:
- System uptime - The boot time is reduced significantly, due to better OS capability and driver maturity.
- Management - Linux engineers become firmware engineers. This taps Facebook’s engineering resources
- Technology updates turn-around time - With full control of the entire firmware stack and closer collaboration with Intel, Facebook is able to respond to issues and enable new technologies at a much quicker pace.
- Security - Open source firmware gives the ability to control security right from the reset vector.
Coreboot has been used in Google Chromebook devices for commercial purpose successfully. In addition, Facebook has been expanding coreboot’s enablement through network systems and OpenCellular in TIP (Telecom Infrastructure Project). However, the existing coreboot/Linux approach has a lack of server processor features and lack of support for various platforms with server processors, in particular, Intel® Xeon® Scalable processors.
Facebook conducted proof of concept on OCP’s platform MonoLake, which is based on a single socket Broadwell-DE. MonoLake is used heavily for Facebook servers and other applications. With MonoLake, we added server features such as BMC integration, and we made Facebook provisioning process working with the OCP OSF approach.
Facebook collaborated with Intel and with our ODMs. Engineers in these companies have been working with synergy toward the common goal, with the support of business agreements and technology infrastructure (including code sharing model and improved build system support).
As the next step of enablement, we developed a proof of concept to enable coreboot for Intel® Xeon® Scalable processors (Xeon-SP).
For such POC, Intel provided binary FSP for 1st generation of Xeon-SP (Skylake-SP). Such FSP was developed for the first time. On the other hand, support for multiple socket Xeon-SP brought a number of challenges to coreboot. Through collaboration with Intel, we were able to successfully complete the POC of FSP/coreboot on Skylake-SP and on an OCP platform Tioga Pass, that has two sockets of Skylake-SP processors. The coreboot change was upstreamed to https://github.com/coreboot/coreboot.
Following the POC, the collaboration continues to enable future generation of Xeon-SP processors, and OCP platforms such as Yosemite V3, SonoraPass using such Intel processors.
Making a new approach feature, stable and ready for deployment in a data center, is a multi-year effort. Our next steps are:
- Infrastructure integration - The OSF approach needs to work seamlessly with Facebook infrastructure tools.
- Performance evaluation - We will test the OSF approach in the Facebook data center environment.
- Gaps Evaluation – We will reduce the gap between the OSF approach and traditional approach in firmware readiness timing of supporting newer processors/platforms.
To learn more about this project, please look at the “FSP/coreboot status update by Intel and FB” and “OSF feature development for server by FB and Wiwynn” presentations at the upcoming 2020 OCP Virtual Summit. Learn more and register for free here: https://www.opencompute.org/summit/global-summit