Organizing Your Data
Why think in-depth about organizing your data?
Your research is important to your field and well-organized data will allow your colleagues to accurately assess, replicate, and evaluate your research results. Below are some framing questions to think about.
Types of Data
There are many types of data, each with its own properties. Which type(s) of data will you be dealing with?
- Captured in real-time
- Usually irreplaceable
- Examples: Sensor readings, telemetry, survey results, images
- Data from lab equipment
- Often reproducible, but can be expensive
- Examples: gene sequences, chromatograms, magnetic field readings
- Data generated from test models
- Models and metadata, where the input is more important than output data
- Examples: climate models, economic models
- Derived or compiled
- Reproducible (but very expensive)
- Examples: text and data mining, compiled database, 3D models
- discipline specific (FITS in astronomy, CIF in chemistry)
- instrument specific
Some file formats are better for preservation than others. For example, some formats are proprietary or classified, which means their structure is not publicly available for documentation or replication. If you save your data in such a format and the company that owns it ceases to exist, it may be impossible to reconstruct later. Consider the following as best practice guidelines, suggesting that preservation formats should:
- be accessible in the future;
- be non-proprietary;
- conform to an open, documented standard;
- be common, used by the relevant research community or communities;
- incorporate an open-standard character representation (ASCII, Unicode);
- (optimally) not be dependent on one specific software package to display/manipulate them.
Additionally, the best formats for preservation are unencrypted and uncompressed:
- For text: in terms of file format, use PDF and not proprietary formats like Word or Excel (more information is available on different subsets of PDF); in terms of character encoding, ASCII or Unicode (plain unformatted text);
- For audio: MPEG-4, not Quicktime;
- For images: TIFF or JPEG2000, not GIF or JPG;
- For structured data: XML or RDF, not RDBMS.
Some disciplines have already established conventions for file naming and directory structure. Has yours? If so, it is best to conform to that. Here are some examples (if you know of others, please point us to them and we’ll update this list):
If your discipline does not have file naming conventions, do what makes the most sense. Ask yourself, “What structure/naming conventions would a person entirely new to this data set find most comfortable?”
Here are some best-practice guidelines to follow:
- Use file naming consistently.
- Consider how people looking at your files in a decade might want them to be organized, and choose your naming scheme accordingly. One example could be: project_instrument_location_date_time_version.ext
- Avoid these symbols in file names: “/ \ : * ? ” < > [ ] & $. These characters have specific meanings in some computer operating systems that could result in misreading or deleting these files.
- Use underscores (_), not spaces, to separate terms.
- Try to keep folder names short (15-20 characters or less) and descriptive of what’s inside.
- Try to keep file names under 25 characters.
- Include dates in file or folder names; use format YYYY-MM-DD.
- If you’re not using automatic versioning, include a version number at the end of the file name such as v01. Change this version number each time the file is saved. Don’t use confusing labels: revision, final, final2, etc.
- For the final version, substitute the word FINAL for the version number. (This is especially important if files are being shared.)
- File names should contain only one period, before the file extension (e.g. name_paper.doc NOT name.paper.doc OR name_paper..doc)
- If you have many files already named, use a file renaming application such as Bulk Rename Utility (free), ReNamer (Mac/Windows/Unix free trial), or PSRenamer (free).
The information provided below will help you organize your data sets. However, you’ll want to consider using more sophisticated name schema if you want to share or cite your data. You’ll want to put your data sets where other people can access them, and give them identifiers that can be referenced easily.
Data identifiers must be globally unique and persistent. That is to say, they must not be repeated elsewhere, and they must not change over time.
There are many different schemes:
- PURL — A PURL is a Persistent Uniform Resource Locator. Functionally, a PURL is a URL. However, instead of pointing directly to the location of an Internet resource, a PURL points to an intermediate resolution service. The PURL resolution service associates the PURL with the actual URL and returns that URL to the client.
- DOI — A DOI (Digital Object Identifier) is a name (not a location) for an entity on digital networks. It provides a system for persistent and actionable identification and interoperable exchange of managed information on digital networks.
- ACCESSION — Accession numbers used by the National Center for Biotechnology Information (NCBI) are unique and citable.
- InChI — The IUPAC International Chemical Identifier (InChITM) is a non-proprietary identifier for chemical substances that can be used in printed and electronic data sources thus enabling easier linking of diverse data compilations.
- URI — Uniform Resource Identifier (URI) consists of a string of characters used to identify or name a resource on the Internet. Such identification enables interaction with representations of the resource over a network, typically the World Wide Web, using specific protocols.
Back Up Your Data
It is generally advisable to make three copies of your data. For example, you might have the original set, a copy on a local external hard drive, and another copy on an external drive located elsewhere. Having these copies be geographically distributed is crucial for data recovery after natural disasters; the particulars of distribution depend on the recovery time you’ll need to ensure in case your original data set becomes unusable.
Here are some data backup options:
- hard drive (examples: via Vista backup, Mac Time Machine, UNIX rsync);
- tape backup system;
- MIT’s TSM service (Basic is free, up to 15 Gb; Enterprise has a fee, and provides up to 10Tb, includes an off-site copy);
- cloud storage – some examples of private sector storage resources include:
- Amazon S3 – requires client software, no encryption support;
- S3-based Remote Hard Drive Services such as Elephant Drive and Jungle Disk;
- Mozy (from EMC) Free client software, 448-bit Blowfish encryption or AES key;
- Carbonite Free client software, 1024Free 1024-bit Blowfish encryption.
- CrashPlan service supported by IS&T
Secure Your Data
Your data will be most easily read by you, and others in the future, if it has been unencrypted. However, if you do need to encrypt your data because of its sensitivity:
- Keep passwords and keys on paper (2 copies), and in a PGP (pretty good privacy) encrypted digital file;
- Don’t rely on 3rd party encryption alone.
It’s also ideal to store data in an uncompressed format. If you do need to conserve space, we recommend limiting compression to your third backup copy.
Test your backup system
In order to be sure your backup system is working properly, try to retrieve your data files and make sure you can read them. You should do this on initial setup of the system, and thereafter on a regular schedule.