The 2015 Open Compute Project (OCP) Hackathon took place from noon on Tuesday, March 10, until noon on Wednesday, March 11. Each team had 24 hours to develop, implement and demonstrate a “hack.” The hacks use only opensource hardware and software technologies. The winning hack must be useful, novel and non-obvious.
In 2014, the winning team consisted of Dimitar Boyn from I/O Switch Technologies, Inc., Jon Ehlen from Facebook, Inc., Ron Herardian, a self-described computer hobbyist and hacker, Rob Markovic, an independent consultant, Peter Mooshammer from Opscale, LLC, and Andreas Olofsson from Adapteva, Inc. The winning hack, Adaptive Storage, was a radical new storage architecture.
At the 2015 Hackathon, the team was re-formed. The new team, dubbed “Hardcore Hackers,” consisted of Dimitar Boyn, Ron Herardian, Rob Markovic, Peter Mooshammer and Austin McNiff, a freshman student at San Jose State University. Most of the team members met for the first time at the 2014 Hackathon and none of the team’s members were previously acquainted with Austin McNiff.
The team’s 2015 hack was to implement n-way redundancy for the Zettabyte File System (ZFS) using OpenZFS and to scale it horizontally using flash memory switched fabrics. If it worked, the hack could revolutionize ZFS. Instead of relying on traditional two-way clusters to provide high availability, ZFS could provide continuous availability using commodity hardware.
ZFS is an opensource file system available in FreeBSD and, through the OpenZFS project, on Linux and Mac OS X, as well as on IllumOS, which is a continuation of OpenSolaris. The team used CentOS, which is an opensource distribution of Linux based on Red Hat Linux. Red Hat is widely used in data centers and the company plays a major role in many open source projects.
The 2015 OCP Hackathon became the first demonstration of a non-proprietary ZFS storage system with a single file system spanning more than two servers. It was also the first ever public demonstration of a native Peripheral Component Interconnect Express (PCIe) flash memory switched fabric.
Flash Memory Switched Fabrics
Dimitar Boyn brought several flash memory switched PCIe fabric cards and dozens of I/O Switch Raijin New Generation Form Factor (NGFF) solid state drives (SSDs) to the event. I/O Switch is a corporate member of OCP and is in the process of opensourcing its Raijin SSD card design through the OCP Storage project. The team quickly assembled a lab environment on site at the event using commodity x86_64 servers and networking equipment.
The SSDs, manufactured by Memoright International, Inc., are designed for standard Mini PCIe version 2 (M.2) connectors used in notebook computers and mobile devices. However, the SSDs contain enterprise-grade Multi-Layer Cell (MLC) NAND (“Not AND”) flash memory and have advanced wear leveling and error correcting firmware. As a result, standard M.2 SSDs can be used in servers. Using standard parts based on open standards reduces cost.
The open standards based I/O Switch Thunderstorm flash memory switched fabric PCIe card has four M.2 connectors and can use either Advanced Host Controller Interface (AHCI) or Non Volatile Memory Express (NVMe) M.2 SSDs. In systems with an AHCI compatible Basic Input Output System (BIOS), which includes virtually all computers manufactured in the last decade, the SSDs operate as native, locally attached block devices. In the AHCI implementation, built-in driver support is available in every major operating system (OS), which reduces complexity.
Figure 1. I/O Switch Thunderstorm Flash Memory Switched Fabric PCIe Card
Native PCIe SSDs are much faster than Serial Advanced Technology Attachment (SATA) or Serial Attached Small Computer System Interface version 2 (SAS-2) SSDs because of flash memory de-indirection. The electrical and logical paths between application code and data are shorter and Input / Output (I/O) latency is lower because there is no disk controller bottleneck. Applications or software data protection technologies, such as Logical Volume Manager (LVM), software Redundant Array of Inexpensive Disks (RAID) or Just a Bunch of Disks (JBOD) can be used instead of a conventional Host Bus Controller (HBA).
Storage solutions can be optimized for I/O Operations per Second (IOPS), latency, throughput or capacity. Flash memory switched fabrics are optimized for throughput but provide competitive IOPS and latency. Some proprietary flash memory package approaches perform better in terms of IOPS, have higher storage density per PCIe slot, or offer lower latency, for example, by placing flash memory on Dual In-Line Memory Modules (DIMMS). However, proprietary flash memory package approaches are more costly compared to standard M.2 SSDs.
The ability to add many SSDs to a server without using any hard disk drive (HDD) bays is especially useful for storage solutions based on ZFS. In ZFS, read I/O operations are served from the Adaptive Read Cache (ARC) which resides in Random Access Memory (RAM). A Level 2 Adaptive Read Cache (L2ARC) is normally placed on SSDs so that rotating disk drives are accessed only as a last resort. With sufficient RAM and flash memory, after a cache warm up period, cache hit rates become very high for most application workloads, which improves performance for read I/O operations. In the hack, each server had five 512GB SSDs for a total of 2.5TB of flash memory per server.
Figure 2. I/O Adding Multiple PCIe SSDs to a Commodity Server
OpenZFS on Flash Memory Switched Fabrics
As with other journaled file systems, write I/O operations are logged in ZFS. Writes are acknowledged as soon as they are committed to the log, which is much faster than waiting for other file system operations to complete, especially if the log is on a SSD while rotating HDDs are used for storage capacity. In ZFS, the log is referred to as the ZFS Intent Log (ZIL).
Figure 3. ZFS Model
A single flash memory switched fabric PCIe card can deliver SSDs for OS boot, journaling (ZIL), read caching (L2ARC) and additional capacity. For example, both boot and ZIL can be mirrored on a single card or they can be mirrored (RAID1), striped (RAID0) or mirrored and striped (RAID10) across cards with two or more cards. Since the cards are modular and use standard parts, different sizes and types of SSDs can be combined as needed. SSDs can normally be upgraded or replaced in the field without removing the card from its slot.
Implementing n-way ZFS redundancy and horizontal scaling using ZFS, requires abstracting the underlying block devices away from the hardware before adding them to ZFS. Internet Small Computer System Interface (iSCSI) was used to abstract block devices used to construct ZFS virtual devices (“vdevs”) and Distributed Replicated Block Device (DRBD) was used to abstract the block device used for the ZFS file system’s log.
Figure 4. ZFS with Virtualized Block Devices
Using DRBD for the ZFS Log Device
ZFS is not a clustered file system and is not designed to scale horizontally. Redundancy is normally provided by configuring two storage servers connected to the same SAS-2 drive enclosures in a traditional two node cluster. When one server fails, the other server takes over the storage devices and begins providing the same Network Attached Storage (NAS) shares or Storage Area Network (SAN) Logical Unit Numbers (LUNs).
To ensure that uncommitted writes are preserved during a server failure, the ZIL must be attached to the SAS-2 bus, which is much slower than native PCIe storage devices. Redundancy in ZFS can be provided within vdevs or using RAID, but there can only be two storage heads. If both servers fail, the storage resources are unavailable until at least one server is repaired.
Creating horizontal redundancy for three or more servers requires replicating the ZIL across machines. The solution was to define the ZIL as a DRBD.
Figure 5. Distributed Replicated Block Device Consisting of Native PCIe SSDs
DRBD works on top of block devices, such as hard disk partitions or LVM volumes. It synchronously mirrors each block of data that is written to the device to one or more peers over a network using Transmission Control Protocol / Internet Protocol (TCP/IP). The file system on the active server is notified that the write has completed only when the block is successfully written to all disks in the cluster. Using ZFS, data can be accessed using a ZFS file system only on the active node.
Figure 6. Synchronous Mirroring of ZFS ZIL using DRBD
For a non-primary DRBD node to use the DRBD ZIL in ZFS, all of the DRBD peers have to be configured as primary nodes. DRBD’s dual-primary mode allows concurrent access to the device, thus a clustered file system, such as Global File System (GFS), that utilizes a distributed lock manager is required. To eliminate the need for a distributed lock manager, the DRBD device was not added to a ZFS pool (“zpool”) on the non-primary nodes.
Figure 7. Rob Markovic (pointing) and Austin McNiff (foreground)
Once DRBD was working, the rest of the data in the file system had to be mirrored across servers.
Using iSCSI to Create Mirrored ZFS Virtual Devices
Data was mirrored using iSCSI, which allows block device commands and data to be sent over a network using TCP/IP. Storage devices exported by one server can be attached by another server where they appear as local block devices. For the hack, all of the drives used for data were exported as iSCSI LUNs, including the local disks in the primary server, and then added to ZFS vdevs configured as mirrors where each mirror spans multiple nodes.
Figure 8. ZFS zpool with Mirror iSCSI LUN vdev and DRBD Log Device
The configuration ensures that both the data and the write log are mirrored across all of the servers. The file system can be read by the primary server, which shares its resources over the network using Network File System (NFS).
In a real system, the IP address of the NFS server could be virtualized, for example, as a floating IP address, so that after a failure of the primary server, the non-primary servers would continue operating when a new primary server takes over.
Each non-primary server mounts the NFS shares on the primary server. Reads by the non-primary servers take advantage of Remote Direct Memory Access (RDMA). Once the read cache on the primary server warms up, reads are served directly from ARC which is accessed by the network interface card on the primary server through RDMA. Write I/O operations also use RDMA via iSCSI Extensions for RDMA (iSER). In a real system, a 10Gbps Ethernet or InfiniBand network would be used to reduce the latency of I/O operations.
Promoting a Non-primary Server
If the primary server is down, a non-primary server can log in to the iSCSI targets. The iSCSI block devices are only mounted when the zpool is imported. Since DRBD preserves the logical names of devices across nodes, “zpool import –f” (force import) can be used to make a non-primary server the new primary server (the OS instance currently controlling the file system). When the zpool is imported, uncommitted writes in the ZIL can be recovered.
In the Hackathon implementation, the L2ARC on the new primary server had to warm up because the read cache was not replicated. However, since the zpool metadata contains the information for both the ZIL and the L2ARC, it is also possible to replicate the L2ARC using DRBD and to provide a hot read cache failover capability.
In a real system, the steps involved in detecting the failure of the primary server, importing the zpool on the new primary server and sharing its file systems using NFS would be automated.
Under normal operating conditions, the Central Processing Units (CPUs) on the non-primary servers are largely idle. As a result, the hack is ideal for a cluster of application servers or for an object store. For the hack, an Apache web server was used as a proxy for an object store Application Programming Interface (API).
The key insight, provided by Dimitar Boyn, was that access to data in the primary server’s ARC via RDMA running over InfiniBand will be faster than access to local SSDs in the same machine, even native PCIe SSDs. Reads in the underlying iSCSI horizontal mirrored vdevs use iSER. The latency of InfiniBand and of Direct Memory Access (DMA) is much less than the latency of flash memory. Therefore, the hack is a read optimized solution where appropriate use cases will have a majority of reads. The only thing faster would be reads from local ARC.
Since the servers used in the hack were basically all-flash arrays, there was never any concern about the latency of accessing rotating disk drives. While the primary server can simply serve data from ARC and L2ARC after a cache warm up period, the non-primary servers can run applications. At the same time, each non-primary server has a complete copy of the data in standby mode.
The Moment of Truth
The hack took all night and into the next day. Most of the team rotated in and out, working on various problems late into the night. With the help of Ron Herardian and Rob Markovic, Dimitar Boyn (Founder and Chief Technology Officer of I/O Switch) pulled an all-nighter to make sure that the failover process worked as expected.
Figure 9. Dimitar Boyn after Working Through the Night
The team worked right up until the hack was presented. All of the team members participated in the presentation, except Austin McNiff, who was unavailable due to his class schedule. As a demonstration, a web browser was shown on a laptop accessing media files on an Apache web server running on a non-primary server while the media files were accessed using NFS.
Figure 10. Before a Server Failure
Figure 11. After a First Server Failure
In the hack, there was a single primary server and writes were mirrored from the primary server to the non-primary servers. If the master failed, any non primary server could become the new master. Topologies with more than one primary server are also possible. In a ring topology, each server is primary for one file system and non-primary for other file systems in the ring. Any non-primary server can take over any file system in the ring. Each server has an active zpool controlling a subset of SSDs in its mirrored vdevs and DRBD ZIL device and shares its file system using NFS.
Figure 12. Two Dimensional Torus
In a four server ring, the patterns formed by the replication paths of data mirrored across vdevs and ZIL devices combined with the patterns formed by alternate forward or backward failover sequences are described by a two dimensional torus.
The 2015 OCP Hackathon proved that open technologies and open collaboration greatly accelerate innovation. Prior to the 2015 OCP Hackathon, there was no non-proprietary way to scale OpenZFS beyond two servers. Horizontal scaling could only be accomplished by placing multiple copies of data on a series of separate storage systems, each of which had only single redundancy. Using opensource software and open standards based hardware, n-way ZFS redundancy and horizontal scaling was implemented by a small team of hackers in just 24 hours.