Taking Care of Digital Collections

Last week we covered how we take care of physical collections. This week we’re moving onto digital items and collections.

There are three different scenarios for dealing with digital collections: donors donate analog media that we digitize; donors send us digital files via the internet; donors send us digital files on older media. Let’s explore how we handle each of these scenarios.

Scenario 1: Donors donate analog media that we decide to digitize.

img_20161012_113720325_hdr1

As noted in last week’s post, we take note of the different types of media in a collection. If we find analog items (VHS, reel-to-reel audio, reel-to-reel film, BETA, etc,), then we decide if they are candidates for digitization. If they are, depending on the media type, we either digitize them in-house or send the items to a company like The Media Preserve or Scene Savers.

After the media has been digitized, we follow some standard in-house digital preservation steps.*

We keep high quality master files (.wav, .avi) and lower quality access copies (.mp3, .mp4) that patrons might use.
If possible, we embed metadata in the files. Basically, when you write information on the back of a photograph, you are adding metadata to it. We do something similar, but the metadata is inside the digital file.
We run a variety of reports on these files to extract information from them, like a video’s running time, type of file, creation date, etc.
We bundle up that information in a grouping of folders/files called an AIP (archival information packet). We create two aips: one for the masters and another for the access copies.
We save them on a secure server.

*Note: This is a bird’s eye view of what we do. These steps are much more involved than what you read here.

Scenario 2: Donors send us digital files via the internet.

google-drive

We receive digital files via email, Dropbox, and Google Drive.

For items donated in this way, we generally save the donation as it was submitted.
We make a copy of the files and start to process them.
1. For some files, we might need to convert them to a stable format. For example, Microsoft updates Word and sometimes older files can no longer be read. We would convert a Word file to a .pdf, which won’t have readability issues in a few years.
2. We will look at all of the files and look for duplicates and files that are not relevant to the collection. We will delete these files.
3. If needed, we organize digital files into folders.
4. We might also give them new names. A folder filled with .jpgs with the name DSC####, we might give it the name “Paris” + a sequential number. If the photographs are of a trip to Paris, then the file name is a bit more descriptive and the sequential number helps keep them more organized.
5. After everything is organized we move on to following our steps in scenario 1.

Scenario 3: Donors send us digital files on older media.

When we receive collections, there might be digital files on CDs, thumb drives, hard drives, DVDs, floppy disks, Zip disks, etc. All of these media are fragile and the files on them can become corrupted over time. With these items, we follow another workflow. If there’s a kink in this one, which happens from time to time, then we follow our workflows above.

We use our BitCurator machine to look at the files on the piece of media. This means plugging our floppy disk reader into the computer and placing a floppy disk from the collection in the reader.
If the floppy disk still works, then we create a disk image of the files. Basically, it creates a copy of the files and puts them in a wrapper to protect them for the long term.
We run reports on the files via BitCurator. These reports are similar to ones run in Scenario 1, but they provide a bit more information.
After that, we make another copy of the files**, and then process them the same way as in Scenario 1 and 2.

**It should be noted that if a floppy disk contains a few Word files that could easily be printed out and added to the physical collection, then we will do that. In that case, from a time perspective, it doesn’t make sense for us to follow digital preservation steps.

We hope all of that makes sense and gives you a better sense of what we do behind the scenes. As with anything digital, over time our workflow will be tweaked and new digital preservation software might come on the scene to make this easier. Digital preservation steps vary institution to institution based on what kinds of software they have and what workflows they have established.

5 thoughts on “Taking Care of Digital Collections”

Walter Underwood

October 12, 2016 at 5:04 pm

Converting to PDF is not a good idea.

PDF is a printing format, not a text format. It throws away all structure: columns, paragraphs, sections, even spaces between words. Instead of a space, it moves the cursor a little farther to print the next character. In order to reconstruct the spaces, you need accurate font metrics for the font family and size used to make the document.

I spent ten years working on an enterprise search engine. Part of that was parsing several generations of PDF “standards” with at least three different libraries.

We had two saying about PDF:

1. Turning PDF into a structured document is like turning hamburger back into a cow.
2. PDF is where text goes to die.

I’m glad to discuss this in more detail.

- norieguthrie
  
  October 12, 2016 at 5:42 pm
  
  We’re aware that there might be some structural changes. The alternative would be to create an emulation environment for various software. We do not have the resources to do that. We’re more concerned with saving the information in the file so that researchers can read it for the foreseeable future.
  
  PDFs also allow us to embed metadata in the files through Adobe Bridge or exiftool. Here’s more info about this from our Metadata Coordinator, Scott Carlson: http://www.indiepreserves.info/preservation-tips-blog/embedded-boogaloo-pdfs-exiftool
  
  - Walter Underwood
    
    October 31, 2016 at 1:48 am
    
    I’d love to talk more about this, maybe over e-mail. If archival experts are recommending PDF without reservation, they are wrong. PDF is not a text format. It is a printer command language.
norieguthrie

October 31, 2016 at 3:01 pm

Thanks for your input. Here’s some more information from the Library of Congress: http://www.digitalpreservation.gov/formats/fdd/fdd000318.shtml

Pingback: KTRU Tuesdays: Digitization Update | What's in Woodson