Data Storage Facilities

Typical data flow at DCCN uses three main types of data storage facilities - each with their own functionality, advantages and disadvantages. Knowing where data can and should be stored, as well as when it should be stored in certain locations is crucial for being an effective and efficient researcher.

Local Storage

About

Local storage includes the storage on any local devices such the C:/ drive in your DCCN-issued PC, as well as the D:/ drive in lab computers.

Advantages

Using local storage can be helpful in multiple stages of the research cycle. If you are collecting data in the lab and you are writing data while the experiment is running, you may likely first write the data to the local storage of the lab computer. Another common use case is if you have a software package that cannot be installed on the HPC cluster - in this case it is optional to work with data on the local storage of the PC issued to you by the Technical Group. Thus, some advantages are:

  • Easy to access and work with data

  • Can change and update software freely

  • Requires no training

Disadvantages

When conducting analyses on the PC issued to you by the Technical Group, you will need to download your research data onto the local storage of your device. In such cases, your research data MUST be pseudonymized. Also, downloading all of the research data may take a long time depending on the size of the data set you are analyzing. Similarly, such analyses generally can be run much faster on the HPC cluster and may require more RAM (i.e. working memory) than your PC has. Finally, your PC can crash at any moment so data in local storage can be lost; thus you must constantly re-upload your data to High Performance Storage to mitigate potential data loss. Thus, some disadvantages are:

  • Constant involuntary risk of data loss

  • Potential for data breach

  • Downloading data is time-consuming

  • Requires constant re-uploading to mitigate potential data loss

  • Less RAM

  • Less storage space

  • Files are not visible to any other research team members

Ultimately, local storage is risky because data is not backed up anywhere and inefficient for several reasons. Nonetheless, it does have its use-cases though you always must be careful to prevent data loss and breaches in data security and privacy.

High Performance Storage

About

High Performance Storage includes several different drives (or volumes): most notably the Home drive where your private work-related files may be kept, the Groupshare drive where your lab group’s shared files may be kept, and the Project drive where your project’s research data must be kept. High Performance Storage consists of drives such as these, which are mounted on Network PC’s in Trigon such as those in the Instruction and Trainee rooms as well as all Lab PC’s. High Performance Storage is also directly connected to the HPC cluster.

Advantages
  • Larger storage space than local storage on PCs.

  • Easily accessible via both Network PC’s and the HPC cluster

  • Easy to access and work with data

  • Set up to work with parallelization and large working memory on the HPC cluster, making analysis many times faster

  • Protected against data loss and backed-up

Disadvantages
  • Sometimes analysis packages/softwares cannot be user-downloaded (may require time for the TG to make these software available)

  • Storage is limited to the duration of the research project

  • Can only be accessed by research team members who are checked into the DCCN

High Performance Storage is directly connected to the HPC cluster, the workhorse of data analysis at the DCCN. For the vast majority of use cases it is the ideal place to store data that you will analyze since it offers ease-of-access to files and is set up to function with other storage infrastructure. It is the primary storage facility for research data when a project is in progress, but due to limited space you cannot leave research data on the High Performance Storage after the project has finished.

Radboud Data Repository

About

The Radboud Data Repository is an on-campus research data repository where data is backed up and ultimately Archived/ Published for long-term preservation and sharing. It includes three types of data collections which serve different purposes:

  • Data Acquisition Collections for raw (unprocessed) data

  • Research Documentation Collections for scripts, logs, and intermediate data representations describing the research process

  • Data Sharing Collections for all data and analysis scripts used in creating the results reported in your manuscript

The endpoint of DAC and RDC is archiving, which is intended only for internal use (i.e. amongst members of the project). The endopoint of a DSC is publishing.

Advantages
  • (Basically) unlimited storage

  • Secure

  • Facilitates compliance with Findable and Accessible principles of FAIR, thereby meeting funder requirements, many journal requirements, and University policy

  • Data for publication is reviewed for compliance with FAIR principles and privacy risks by a data steward

Disadvantages
  • Cannot read/write files directly

  • Platform under continuous development, sometimes services are down for routine maintanence

  • Time investment needed for familiarizing with the platform, and uploading, archiving and publishing data for a project

The Radboud Data Repository is the DCCN’s vault where data of finished projects is preserved.

Take Home Messages
  • Different storage locations have different use cases, pros and cons and are used in various DCCN procedures for data management

  • High Performance Storage and the Radboud Data Repository are the main storage locations we will use but Local Storage has certain use cases.