CSI: Computer Science Investigation

David J. Malan
Harvard University
http://www.cs.harvard.edu/malan/
dmalan@harvard.edu

For more problem sets like this one, visit cs50.tv.

Summary Each semester, I take a stroll around campus with a digital camera, taking photos of non-obvious but identifiable locations. Unfortunately, I then always "accidentally" format the camera's CompactFlash card. And so I make a "forensic image" of the card with dd, circulate that image among our students, and challenge them to write a program with which to recover my most treasured photos! The assignment then evolves into a scavenger hunt, whereby students are challenged, after recovering the photos, to find their locations on campus and take a photograph of themselves (or someone from their section) in front of that location. The section that finds the most ocations the fastest wins, e.g., pizza for their next class.
Topics Loops. Conditions. File I/O. Forensics. File formats (headers, footers, other metadata). Hexadecimal notation. Block sizes.
Audience CS1 or higher.
Difficulty Takes some thought but only doesn't require very much code.
Strengths Excites students, because it's both a real-world domain and a fun way of exploring campus. It's an interesting problem but not very hard to solve once one understands what a file really is and how JPEGs are implemented. JPEGs are a complex format but their headers are simple: they're identifiable with high probability by a sequence of just 4 bytes.
Weaknesses Requires familiarity with file I/O, which typically happens late in a semester (at least for us).
Dependencies File I/O. We have students implement a solution in C but most any language that provides byte-level access to files could be used.

Instructions for Teacher

  1. Obtain a digital camera and a memory card (that you don't mind erasing). We happened to use CompactFlash, but most any type should suffice.
  2. Zero the memory card (so that students aren't confused by extraneous bytes). Odds are Linux users know how to do this; allow me to suggest that Mac users and PC users use a Mac for simplicity:
    1. Connect the memory card to a Mac running OS X 10.4 or higher via a memory card reader.
    2. Open Applications → Utilities → Disk Utility.
    3. In Disk Utility's left-hand pane, click the memory card's icon. (If the memory card has an indented child partition, be sure to click the parent.)
    4. Click the Erase tab in Disk Utility's right-hand pane.
    5. Select MS-DOS (FAT) next to Format.
    6. Next to Name, type CARD.
    7. Click Security Options....
    8. Select Zero Out Data then OK.
    9. Select Erase... and then Erase when prompted.
    10. Quit Disk Utility and drag the memory card's icon from the desktop to the Trash to unmount it.
  3. Insert the memory card into the digital camera and go take some pictures!
  4. Make a "forensic image" of the memory card with dd. Odds are Linux users know how to do this too; allow me to suggest that Mac users and PC users again use a Mac for simplicity:
    1. Connect the memory card to a Mac running OS X 10.4 or higher via a memory card reader.
    2. Open Applications → Utilities → Terminal.
    3. Execute the below at the prompt:
      mount
      Among the lines of output should be one for your memory card (i.e., /Volumes/CARD) like the below:
      /dev/disk2s1 on /Volumes/CARD (msdos, local, nodev, nosuid, noowners)
      Take note of the name of the card's "device file" (e.g., /dev/disk2s1).
    4. Execute the below at the prompt to determine how many 512B blocks are being used by your photos:
      df Among the lines of output should be one for your memory card (e.g., /dev/disk2s1) like the below:
      Filesystem    512-blocks      Used Available Capacity  Mounted on
      /dev/disk2s1    31270976   6020992  25249984      20%  /Volumes/CARD
      Take note of that number (e.g., 6020992).
    5. Execute the below at the prompt to umount the card's volume:
      diskutil unmount /Volumes/CARD
      Do not physically remove the card from the Mac, though.
    6. Execute the below at the prompt to copy the first, e.g. 6020992 blocks from the card to a new file on your desktop called card.raw:
      dd if=/dev/disk6s1 of=$HOME/Desktop/card.raw bs=512 count=6020992
      For the curious, if denotes "input file," of denotes "output file," and bs denotes "block size." Take care not to transpose if and of, else you may have to take another stroll across campus!
  5. Distribute card.raw to your students along with the instructions for students below (or some variant thereof)!

Instructions for Students

adapted for SIGCSE from pages 12 – 14 of pset5.pdf

Just the other day, I took a stroll around campus with a friend (Dan Armendariz of MIT, whose skills with a camera outshine my point-and-shoot tendencies) snapping photos, all of which were stored as JPEGs on a 1GB CompactFlash (CF) card. Rather than act like typical tourists, taking photos of John Harvard's foot (ugh) and squirrels (I mean, really), we opted to shoot identifiable but non-obvious persons, places, and things on campus.

Unfortunately, I somehow corrupted that CF card the moment I got home. Both my Mac and PC refuse to recognize the card now as having any photos, even though I'm pretty sure we took them. Both operating systems want to format the card, but, thus far, I've refused to let them, hoping instead someone can come to the rescue.

Write a program called recover that recovers these photos. (Please!)

OMG, what?

Well, here's the thing: JPEGs have "signatures," patterns of bytes that distinguish them from other file formats. In fact, most JPEGs begin with one of two sequences of bytes. Specifically, the first four bytes of most JPEGs are either

0xff 0xd8 0xff 0xe0

or

0xff 0xd8 0xff 0xe1

from first byte to fourth byte, left to right. Odds are, if you find one of these patterns of bytes on a disk known to store photos (e.g., my CF card), they demark the start of a JPEG.

Fortunately, digital cameras tend to store photographs contiguously on CF cards, whereby each photo is stored immediately after the previously taken photo. Accordingly, the start of a JPEG usually demarks the end of another. However, digital cameras generally initialize CF cards with a FAT file system whose block size is 512 bytes (B). The implication is that these cameras only write to those cards in units of 512 B. A photo that's 1 MB (i.e., 1,048,576 B) thus takes up 1048576 ÷ 512 = 2048 blocks on a CF card. But so does a photo that's, say, one byte smaller (i.e., 1,048,575 B)! The wasted space on disk is called slack space. Forensic investigators often look at slack space for remnants of suspicious data.

The implication of all these details is that you, the investigator, can probably write a program that iterates over a copy of my CF card, looking for JPEGs' signatures. Each time you find a signature, you can open a new file for writing and start filling that file with bytes from my CF card, closing that file only once you encounter another signature. Moreover, rather than read my CF card's bytes one at a time, you can read 512 of them at a time into a buffer for efficiency's sake. Thanks to FAT, you can trust that JPEGs' signatures will be "block-aligned." That is, you need only look for those signatures in a block's first four bytes.

Realize, of course, that JPEGs can span contiguous blocks. Otherwise, no JPEG could be larger than 512 B. But the last byte of a JPEG might not fall at the very end of a block. Recall the possibility of slack space. Fortunately, I bought a brand-new CF card for my stroll about campus. Odds are, that CF card was "zeroed" (i.e., filled with 0s) by the manufacturer. Because I didn't outright delete any photos we took, the only bits on that CF card should belong to actual photos or be 0s. And it's okay if some trailing 0s (i.e., slack space) end up in the JPEGs your program spits out; they should still be viewable.

Since I've but one CF card, I've gone ahead and created a "forensic image" of the card, storing its contents, byte after byte, in a file called card.raw. So that you don't waste time iterating over millions of 0s unnecessarily, I've only imaged the first few MB of the CF card. Since you're only going to be reading it, you don't need your own copy of this forensic image. (Might as well save space!) Simply open our copy with fopen, as in the below.

FILE *fp = fopen("card.raw", "r");

You should find that this image contains a whole bunch of JPEGs.

It's up to you to create, at least, a Makefile and recover.c for this program. You probably don't need a recover.h, but you're welcome to create one. For simplicity, you may hard-code the path to card.raw in your program; your program need not accept any command-line arguments. When executed, though, your program should recover every one of the JPEGs from card.raw, storing each as a separate file in your current working directory. Your program should number the files it outputs by naming each ###.jpg, where ### is three-digit decimal number from 000 on up. (Befriend sprintf.) You need not try to recover the JPEGs' original names. To check whether the JPEGs your program spit out are correct, simply SFTP them to your own desktop, double-click, and take a look. If each photo appears intact, your operation was likely a success!

Odds are, though, the JPEGs that the first draft of your code spits out won't be correct. (If you open them up and don't see anything, they're probably not correct!) Execute the command below to delete all JPEGs in your current working directory.

rm *.jpg

If you'd rather not be prompted to confirm each deletion, execute the command below instead.

rm -f *.jpg

Just be careful with that -f switch, as it "forces" deletion.

And now the proverbial icing on the cake. You are hereby challenged to find as many of the persons, places, and things that we photographed on campus as possible. To prove that you found some place or thing, take a photo of yourself (or of someone in your section) posing next to or near that same person, place, or thing! Put your photos (i.e., the photos you took, not the ones that we took that you recovered) online somewhere (e.g., Picasa Web Albums) and link to those photos on a Google Map that indicates where you found each person, place, or thing.

Here's how to create a map:

Creating a Map
http://maps.google.com/support/bin/answer.py?hl=en&answer=68480#overview_my_maps

Adding Photos
http://maps.google.com/support/bin/answer.py?hl=en&answer=68480#photos

The section whose students collectively identify the most photographs shall win an amazing prize. In the event of a tie, the section that submits the most photos (and their URL) first shall be declared the winner!


This is CS50.

Copyright © 2010, David J. Malan, Harvard University

This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

Creative Commons License


Extra info about this assignment: