The Toolkit: File Formats

The Toolkit logo

The Toolkit brings together resources for creating, managing, and sharing digital collections to address common concerns we often hear, like this one:

JPEG, WAVE, MOV, PDF – what do they all mean? Which should we use for things we’re digitizing?

Digital file formats quickly give way to alphabet soup: JPEG, WAVE, MOV, PDF…Here are several different considerations and some best practices to help inform your decisions.

Primary files vs access files

First up, consider whether you’re creating primary or access files. The large, high-resolution files that result from scanning or reformatting are known as primary files, archival files, or preservation files. Because they often take up quite a bit of digital space, primary files are not typically used for the general sharing; these are the files stored internally and preserved for the longer-term. Access files are the smaller, usually lower-quality versions that are available for public use.

In general, you should perform your original scans or conversions into the highest quality formats available, then make copies in lower-quality versions for access purposes.

Primary File (also known as an archival file or preservation file)Access File (also known as a derivative)
Use forLong-term storage
Selling reproductions
Printing, i.e. publications, calendars,
posters, exhibit panels
Sharing on social media
Emailing to researchers
Posting on your website
File typeImages and text: TIFFImages and text:
JPEG or PDF
File sizeImages and text:
BIG! (one scanned postcard =
approx. 20MB)
Images and text:
Small (probably less than 1MB)
EditingUnedited or minimal editing i.e. cropping
or straightening
May be edited i.e. significant cropping, contrast adjustment, etc.

Lossless vs lossy formats

Files can be compressed to save digital space; they are compressed using either lossless or lossy compression processes. Lossless compression saves all the data from the original file, just making it smaller. Lossy compression will discard some of the data to make the file smaller. 

If a large image file, like a TIFF, is compressed into a JPEG, it will lose data. The JPEG will be a lower-quality image, because JPEGs are lossy file formats.

But if a TIFF is compressed into a PNG, it will just be smaller. It won’t be as small as if it were converted to a JPEG, but it will be smaller than the TIFF. The quality remains similar in the PNG and the TIFF.

You can take away data, but you can’t put it back! If a TIFF is compressed using a lossy compression process, like into a JPEG, it cannot be “uncompressed” back to a TIFF.

Audio-visual file formats

The same principles hold true in audio-visual formats. WAVE files are high quality, uncompressed audio files; FLAC files are lossless compressed audio files; WMA files are lossy compressed audio files. 

In videos transferred from analog sources, MOV or AVI files are the highest quality. MP4 files are compressed audio files with minimal loss; H.264 files are lossy compressed video files.

So which file formats should we use?

It depends!

Archival best practices suggest that successful long-term preservation is more likely when file formats have the following characteristics: 

  • complete and open documentation 
  • platform-independence 
  • non-proprietary (vendor-independent) 
  • no “lossy” or proprietary compression 
  • no embedded files, programs or scripts 
  • no full or partial encryption 
  • no password protection 

The following table identifies various formats with high, medium, or low probability for long-term preservation. 

High-confidence formats are non-proprietary, open source, uncompressed or utilize lossless compression. These formats can usually be viewed using multiple programs or file viewers and are most likely to be readable into the future, giving the best chance of preserving these files down the road. 

Medium-confidence formats are usually some combination of proprietary, undocumented, or utilize lossy compression. However, they are currently in widespread use (i.e. are the most recent version of a particular format). These files will most likely be able to be preserved for the long term, but may require reformatting to a preservation format as time goes on. 

Low-confidence formats are usually proprietary, undocumented, or utilize lossy compression. These formats are not currently supported by most mainstream programs or file viewers and may not be able to be viewed at all without specialized software or hardware environments. These files may be able to be preserved over the long term, but they will need to be reformatted before they can be made accessible.

Document TypeHigh ConfidenceMedium ConfidenceLow Confidence
Image

(Raster Image,
Photographs,
Scanned
Documents)
TIFF(Uncompressed) [.tif]
PDF/A or PDF/X (Graphic exchange format) [.pdf]
PNG [.png]
JPEG2000 (lossless) [.jp2]
TIFF(Compressed) [.tif]
GIF [.gif]
BMP [.bmp]
RAW digital camera images [.raw]
Photoshop images [.psd]
AudioWAVE [.WAVE]
Audio Interchange File Format [.aif, .aiff]
MP3 [.mp3]
Advanced Audio
Coding [.aac, .mp4, .m4a]
MIDI [.mid, .midi]
Free Lossless Audio Codec [FLAC]
Windows Media Audio [.wma] RealAudio [.ra, .rm, .ram]
Protected AAC [.m4p]
All other audio formats not listed here
VideoAVI (Uncompressed, motion JPEG) [.avi]
Quicktime Movie (Uncompressed, motion JPEG) [.mov]
MPEG-4 AVC [.mp4]
Windows Media
Video [.wmv]
MPEG-2 (wrapped in AVI or MOV) [.avi, .mov]
MPEG-4 (wrapped in AVI or MOV) [.avi, .mov]
Protected MPEG-4 [.m4p] RealVideo [.rv]

All other video formats not listed here

Adapted from Preferred File Formats, University of Washington University Libraries.

Resources that can help:

Want more practical advice about creating and caring for digital collections? Read more from The Toolkit!