Summary

We recommend requesting at least 25% more storage space than the size of the files that you’d like to archive.

File size can be greatly reduced by compressing your files using an archive format such as .zip, .tar, or .cpio, and file compression is a particularly good idea if you don’t access the contents of those files very often.

If you’d like more information on determining file size or compressing files, assistance can be provided by the IT Help Center or your local IT support organization. Also basic instructions for using .zip can also be found online.

Technical Details & Background Information for Administrators

Content transferred to the archive service is allocated on disk using the physical storage requirement, which includes data protection blocks. This will be at least 25% greater than the logical storage size reported on a traditional filesystem, which does not report parity/raid overhead (most desktop operating systems will report storage usage this way). If you store files with average size exceeding 4MB, expect your content to have an overhead requirement of 25-30%. (See background information below for additional details.)

If you have multiple terabytes, quota allocations for archive system “terabytes” are calculated using powers of 2, not base-10 calculation typically used by disk vendors. This is more formally called a tebibyte and is about 10% larger than a “terabyte:”

  • 1000 B/kB * 1000 kB/MB * 1000MB/GB * 1000 Gb/Tb  = 1012
  • 1024 B/KiB * 1024 KiB/MiB* 1024 MiB/GiB * 1024 GiB/TiB = 240

The data copied into the archive using native formats have had varied results from 26% (combination of full and incremental backup volumes), to 35% (native format A/V classroom recordings), to 300% (sample with millions of very small, 12k files not bundled into .tar or .zip archive format). If you have large quantities of data which are structured as relatively small files, plan to store your data using backup images or archive file formats like .zip, .tar, or .cpio  to achieve 4Mb+ file sizes (the most efficient to store).

The Isilon storage filesystem (OneFS) organizes disk space using traditional 8k disk blocks and exposes the ability to select the degree of “raid” protection on a folder-by-folder basis. An Isilon fileserver deployment is assembled as a cluster of storage nodes with a high-speed backplane for internal communication. The top two levels of the filesystem are replicated across all storage nodes. Boston University’s archive deployment uses “N:2+1” protection level, which Isilon recommends for archive and near-line storage deployments. The term “N:2+1” means that data will remain available without interruption or data loss if any two disks or a disk and a node(CPU+36 disks) fail at the same time.  A benefit of the implementation is that recovery from disk failure is much faster and less I/O intensive than is the case for traditional hardware raid implementations; this reduces the probability that a second disk will fail during recovery operation.

The OneFS filesystem associates data blocks required to meet protection level for each file and directory as part of the stored object.  This mechanism is relatively efficient for files that are 4Mb or larger which, with 5-node cluster, will require 25% physical storage “overhead” to meet protection requirement. Smaller files require more physical data blocks as a percentage of file size in 8k blocks and files less than 128k are stored as a 3-way mirror to guarantee N:2+1 protection.