Checking Data Integrity

Along with ensuring that we have all data files - and restoring the ones which we do not have - it is also useful to check the integrity of our data. In the file transfer process it is possible that files become corrupted: that they do not contain the complete and accurate data.

At the DCCN, it is typically recommended to use Uploader and the automatic upload protocol for the MEG and MRI. These processes transfer directly to the RDR and Project Folder - however, after this process you may unintentionally change some files. If you use another tool, such as Cyberduck or something else, you may even unintentionally upload the wrong file. Thus, we will want to ensure that we use the correct, uncorrupted files in our analyses: we check the integrity of our data. To do this, we will use a hash algorithm.

Hash Algorithm

A hash algorithm takes data as an input - for instance, a file - and produces a string of characters. This string of characters - called a hash or digest - can be used to compare the contents of one file to another. There are multiple types of hash algorithms, but each is designed to:

Uniquely identify a file’s contents
Be irreversible
Be fast (and computationally inexpensive)

The most commonly used hash algorithm is the SHA-256 algorithm, which produces a 64-character hexidecimal string (each hexidecimal character has 4 digits, hence the 256).

Practice Checking Data Integrity

Let’s first see a demonstration of how the SHA-256 algorithm works. To do this, we will need a file to check. To illustrate how the hash algorithm works, let’s create a new file which contains all of our raw behavioral data: /project/3010000.05/XXXXXXX.XX/raw/longData.csv.

cd /project/3010000.05/scripts
Rscript combineData.R /project/3010000.05/XXXXXXX.XX/raw/

If you open this file, you will see that it has many rows of data - one for each trial, per subject in our “experiment”.

Compute the hash/digest for /project/3010000.05/XXXXXXX.XX/raw/longData.csv

Open the terminal emulator in TigerVNC
Type sha256sum /project/3010000.05/XXXXXXX.XX/raw/longData.csv

Check if the hash/digest changes depending on the file name and location

Duplicate /project/3010000.05/XXXXXXX.XX/raw/longData.csv as /project/3010000.05/XXXXXXX.XX/raw/copyLongData.csv
Type sha256sum /project/3010000.05/XXXXXXX.XX/raw/copyLongData.csv
Compare the hash/digest from /project/3010000.05/XXXXXXX.XX/raw/longData.csv to /project/3010000.05/XXXXXXX.XX/raw/longData.csv: these should be identical

Check if the hash/digest catches data falsification

Open /project/3010000.05/XXXXXXX.XX/raw/longData.csv in text editor, and change only one digit
Save this file and close it
Type sha256sum /project/3010000.05/XXXXXXX.XX/raw/copyLongData.csv
Compare the hash/digest from this to the hash/digest from before you falsified data: these should be very different

Directly compare the hash/digest from one file to another

Advanced Example: Replacing Corrupted Files

Now, you know how to compare the SHA-256 sum of one file to another, in order to see if they have the same data. From the last lesson, you also know how to restore files in a missing folder. What would be nice to do now is to combine these two processes: let’s edit /project/3010000.05/XXXXXXX.XX/scripts/restoreMissing.sh to do two new things. The first thing we want to do is to check data integrity, and - if we find that the data in our Project Folder has been changed, we want to then restore the changes files.

We need to first delete and corrupt some files so that we can go back and restore them.

Start a TigerVNC session
Run /project/3010000.05/scripts/deleteAndCorrupt.sh

Open the terminal emulator and run the following code

cd /project/3010000.05/scripts/
chmod +x deleteAndCorrupt.sh
./deleteAndCorrupt.sh /project/3010000.05/XXXXXXX.XX/raw/

Create /project/3010000.05/XXXXXXX.XX/scripts/checkIntegrity.sh
Write a script which restores the corrupted files recursively

Now save this file and run it in the terminal by typing the following:

cd /project/3010000.05/XXXXXXX.XX/scripts
chmod +x checkIntegrity.sh
./checkIntegrity.sh "/project/3010000.05/XXXXXXX.XX" "dccn/DAC_3010000.05_873"