Unified Data

Being a computational chemist, I acquire a large number of files. As of today: 22 official projects, and over five thousand, output files alone, and more piling up each day. I figured that today I would describe how I keep them all safe and don’t lose anything.

I usually run calculations on one of a couple separate machines, but I keep nothing other than the jobs that are currently running on those machines. I move everything to an external source, in this case a file server, divided into project number and sub-divided into folders that each hold 50 jobs. An Excel spreadsheet keeps all of the important data and descriptions of each job segregated. But I assume that the average person could learn to keep their files neat and tidy. What I intend to discuss is the way that I keep the data on the file server and my personal laptop synchronized.

It’s actually a fairly simple command run within the terminal:

rsync -ruv $USER@$SERVER:$SERVER_PATH $MY_LAPTOP_LOCATION

rsync compares the files in $MY_LAPTOP_LOCATION to those at $SERVER_PATH and compresses and downloads anything that is different. It does this in a “smart” way however, so if you have a file that was say half-written (for instance if a calculation was in progress on that file when you copied it to the file server), and it is completely written the next day, if you ran rsync the first day and then the second, it only copies the portion of the file that is different.

To simplify things even further I set up an alias so that all I need to type is sync, and the above command runs. To do this on a Mac is fairly simple:

Open terminal and in the home directory (which is where it opens to by default) do the following:

pico .bashrc

add alias sync’’rsync -ruv $USER@$SERVER:$SERVER_PATH $MY_LAPTOP_LOCATION’’

Exit and save. Now to ensure that the .bashrc is read every time the terminal is opened you need to edit your .bash_profile file.

pico .bash_profile

add source .bashrc

Exit and save, log out and back in and your alias should be working!