Checking Data Integrity

Along with ensuring that we have all data files - and restoring the ones which we do not have - it is also useful to check the integrity of our data. In the file transfer process it is possible that files become corrupted: that they do not contain the complete and accurate data.

At the DCCN, it is typically recommended to use Uploader and the automatic upload protocol for the MEG and MRI. These processes transfer directly to the RDR and Project Folder - however, after this process you may unintentionally change some files. If you use another tool, such as Cyberduck or something else, you may even unintentionally upload the wrong file. Thus, we will want to ensure that we use the correct, uncorrupted files in our analyses: we check the integrity of our data. To do this, we will use a hash algorithm.

Hash Algorithm

A hash algorithm takes data as an input - for instance, a file - and produces a string of characters. This string of characters - called a hash or digest - can be used to compare the contents of one file to another. There are multiple types of hash algorithms, but each is designed to:

  • Uniquely identify a file’s contents

  • Be irreversible

  • Be fast (and computationally inexpensive)

The most commonly used hash algorithm is the SHA-256 algorithm, which produces a 64-character hexidecimal string (each hexidecimal character has 4 digits, hence the 256).

Practice Checking Data Integrity

Let’s first see a demonstration of how the SHA-256 algorithm works. To do this, we will need a file to check. To illustrate how the hash algorithm works, let’s create a new file which contains all of our raw behavioral data: /project/3010000.05/XXXXXXX.XX/raw/longData.csv.

cd /project/3010000.05/scripts
Rscript combineData.R /project/3010000.05/XXXXXXX.XX/raw/

If you open this file, you will see that it has many rows of data - one for each trial, per subject in our “experiment”.

  1. Compute the hash/digest for /project/3010000.05/XXXXXXX.XX/raw/longData.csv

  • Open the terminal emulator in TigerVNC

  • Type sha256sum /project/3010000.05/XXXXXXX.XX/raw/longData.csv

  1. Check if the hash/digest changes depending on the file name and location

  • Duplicate /project/3010000.05/XXXXXXX.XX/raw/longData.csv as /project/3010000.05/XXXXXXX.XX/raw/copyLongData.csv

  • Type sha256sum /project/3010000.05/XXXXXXX.XX/raw/copyLongData.csv

  • Compare the hash/digest from /project/3010000.05/XXXXXXX.XX/raw/longData.csv to /project/3010000.05/XXXXXXX.XX/raw/longData.csv: these should be identical

  1. Check if the hash/digest catches data falsification

  • Open /project/3010000.05/XXXXXXX.XX/raw/longData.csv in text editor, and change only one digit

  • Save this file and close it

  • Type sha256sum /project/3010000.05/XXXXXXX.XX/raw/copyLongData.csv

  • Compare the hash/digest from this to the hash/digest from before you falsified data: these should be very different

  1. Directly compare the hash/digest from one file to another

Answer
if [ "$(sha256sum /project/3010000.05/XXXXXXX.XX/raw/copyLongData.csv | awk '{print $1}')" = "$(sha256sum /project/3010000.05/XXXXXXX.XX/raw/longData.csv | awk '{print $1}')" ]; then
    echo "Files are identical."
else
    echo "One of the Files is corrupted"
fi

Advanced Example: Replacing Corrupted Files

Now, you know how to compare the SHA-256 sum of one file to another, in order to see if they have the same data. From the last lesson, you also know how to restore files in a missing folder. What would be nice to do now is to combine these two processes: let’s edit /project/3010000.05/XXXXXXX.XX/scripts/restoreMissing.sh to do two new things. The first thing we want to do is to check data integrity, and - if we find that the data in our Project Folder has been changed, we want to then restore the changes files.

We need to first delete and corrupt some files so that we can go back and restore them.

  1. Start a TigerVNC session

  2. Run /project/3010000.05/scripts/deleteAndCorrupt.sh

Open the terminal emulator and run the following code

cd /project/3010000.05/scripts/
chmod +x deleteAndCorrupt.sh
./deleteAndCorrupt.sh /project/3010000.05/XXXXXXX.XX/raw/
  1. Create /project/3010000.05/XXXXXXX.XX/scripts/checkIntegrity.sh

  2. Write a script which restores the corrupted files recursively

Hint 1: Find the Folder Each File Should Be In
#!/bin/bash
BASE_PATH="/project/3010000.05/XXXXXXX.XX"
MANIFEST="$BASE_PATH/docs/MANIFEST.txt"

while read -r sha path; do

    local_path="$BASE_PATH/$path"
    dir_path=$(dirname "$local_path")

done < "$MANIFEST"
Hint 2: Check the SHA-256 sum of a file in the DAC

We cannot compute the SHA-256 (or any other hash/digest) for a file in the RDR. Luckily, you can download a manifest file from the RDR which contains the SHA-256 for each file within a collection.

  1. Log in to https://data.ru.nl

  2. Go to My Collections

  3. Open di.dccn.DAC_3010000.05_873

  4. Navigate to the Manifest tab

  5. Click Generate manifest file

  6. Move the MANIFEST.txt file to /project/3010000.05/XXXXXXX.XX/docs

Answer
#!/bin/bash
if [ -z "$1" ]; then
    echo "Usage: $0 /project/3010000.05/XXXXXXX.XX"
    exit 1
fi
BASE_PATH="$1"
REPO_PATH="$2"
MANIFEST="$BASE_PATH/docs/MANIFEST.txt"

while read -r sha path; do

    local_path="$BASE_PATH/$path"
    dir_path=$(dirname "$local_path")

    if [[ "$path" == *old* ]]; then
        continue
    else
        echo "$dir_path/"
    fi

    if [ ! -d "$dir_path" ]; then
        mkdir -p "$dir_path/"
        echo "Made folder $dir_path"
        continue

    fi

    if [ -d "$dir_path" ]; then
        local_sha=$(sha256sum "$local_path" | awk '{print $1}')
        if [ "$sha" != "$local_sha" ]; then
            repocli get "$REPO_PATH/$path" "$dir_path"
            echo "Replaced corrupted file $dir_path"
        fi
    else
        repocli get "$REPO_PATH/$path" "$dir_path"
        echo "Restored repository file $REPO_PATH/$path to $dir_path"
    fi

done < "$MANIFEST"

Now save this file and run it in the terminal by typing the following:

cd /project/3010000.05/XXXXXXX.XX/scripts
chmod +x checkIntegrity.sh
./checkIntegrity.sh "/project/3010000.05/XXXXXXX.XX" "dccn/DAC_3010000.05_873"