Caching for Datacenters and Data Lakes
D3N: A Multi-Layer Cache for Data Centers
Overview & Motivation
In today’s world, data is king. The success or failure of organizations can depend on the data they collect and the insights they can glean from it via Big Data analytics. As such, data lakes—low-cost object-storage repositories that can store vast volumes of data are becoming critical parts of organizations’ private datacenters . In large distributed organizations, centralized data lakes are often accessed by many compute clusters operated by different parts of the organization (e.g. business units within an enterprise). Even with a well-designed datacenter network, cluster-to-data lake bandwidth is typically much less than the bandwidth to storage within the compute clusters. Consequently, many users will manually copy a repeatedly accessed dataset to local (e.g. HDFS) storage, incurring complexity and performance overhead to manage data placement and replication, to maintain consistency between replicas and to copy data to local storage.
D3N (datacenter-scale dataset delivery network) leverages insights drawn from CDNs by caching data on the access side of bandwidth bottlenecks (e.g., rack-to-rack and cluster-to-data lake), with CDN techniques used to direct I/O requests to the correct cache. D3N is designed to accelerate big data analytic workloads with strong locality and limited network connectivity between compute clusters and data storage.
Collaboration with Red Hat
There are many benefits to developing in a collaborative environment and working with the open source community. The students working on the D3N project have been able to leverage the expertise and resources that Red Hat offers allowing their project to advance more rapidly.
The D3N project has been implemented as a modification to Red Hat Ceph Storage’s RADOS Gateway (RGW) in the Massachusetts Open Cloud (MOC) datacenter. RGW is a Ceph component that supports an S3/Swift-compatible object interface. The prototype on the MOC implements a two layer version of D3N. Level one acts independently and local to the rack server, while level two is a logically distributed cache formed by pooling the contents of all cache servers in the cluster. This maximizes the data held on the cluster side of any cluster-to-storage pool bottleneck. Evaluation of this implementation shows significant performance improvements and has proven to serve data substantially faster than per-compute-node hard drives.
Additional Project Information
Please also visit the Red Hat Research D3N page.