Caching for Datacenters and Data Lakes
D3N (Datacenter-scale Data Delivery Network)
The Datacenter-scale Data Delivery Network (D3N) project aims to improve application performance and reduce demand on storage systems and data center networks. Inspired by Content Delivery networks (CDNs), D3N’s architecture is designed to cache data on the access side of storage and network bottlenecks for throughput-bound storage workloads.
Data is increasingly important to the success of organizations today. Consequently, the way data is stored and accessed is equally as important as the data itself. Data lakes–low-cost object-storage repositories–are often a part of an organization’s private datacenter. In large distributed organizations, these data lakes are constantly being accessed by many compute clusters operated by different parts of the organization. Even with a well-designed datacenter network, cluster-to-data lake bandwidth is typically much less than the bandwidth to storage within the compute clusters. Because of this disconnect, many users must manually copy a repeatedly accessed dataset to their local storage. This increases complexity and performance overhead to manage data placement and replication.
The D3N project leverages insights drawn from CDNs by caching data on the access side of bandwidth bottlenecks (e.g., rack-to-rack and cluster-to-data lake), with CDN techniques used to direct I/O requests to the correct cache. D3N is designed to accelerate big data analytics workloads with strong locality and limited network connectivity between compute clusters and data storage.
Collaboration with Red Hat
There are many benefits to developing in a collaborative environment and working with the open source community. The students working on the D3N project have been able to leverage the expertise and resources that Red Hat offers allowing their project to advance more rapidly.
The D3N project has been implemented as a modification to Red Hat Ceph Storage’s RADOS Gateway (RGW) in the Massachusetts Open Cloud (MOC) datacenter. RGW is a Ceph component that supports an S3/Swift-compatible object interface. The prototype on the MOC implements a two layer version of D3N. Level one acts independently and local to the rack server, while level two is a logically distributed cache formed by pooling the contents of all cache servers in the cluster. This maximizes the data held on the cluster side of any cluster-to-storage pool bottleneck. Evaluation of this implementation shows significant performance improvements and has proven to serve data substantially faster than per-compute-node hard drives.