Use Rsync to Protect Against Data Loss
Written by Mie Haga
March 31, 2023, is world backup day. Let’s dive into rsync— a powerful and efficient tool for data backup.
Before starting a discussion about rsync, let’s take a quick look at what backup is and why it's important.
What is backup?
Backup refers to copying physical or virtual files or databases to a secondary location for preservation in case of equipment failure or catastrophe. Backing up data is pivotal to a successful disaster recovery plan. – reference
One thing to note is that backup and replication are two different concepts. Backup restores data to a specific point in the past. Backups create data save points to recover the data in case of disasters that cause data loss. Replication makes copies of data and distributes them among many servers and data centers. This way, data is more easily accessible to clients in various geographical locations, and it resolves a point of failure.
Why should you care about backup?
Backups can be used to restore lost data and minimize downtime of critical business operations. Data can be lost for various reasons, like natural disasters or any factors that cause data loss. It also helps you restore your data after cyber-attacks.
Since rsync is widely used for efficient data backup, data recovery, and automating file transfer, I decided to write a blog post about it.
What is rsync?
rsync (remote synchronization) is a popular and powerful utility for synchronizing and copying files between computers, either locally or remotely, over a network.
More specifically, rsync transfers files:
-
locally on the same machine:
-
e.g., between directories on the same machine
-
-
from a local machine to a remote machine:
-
e.g., moving data from a desktop computer to a laptop, this way you don’t need to clean up your directory, create a branch, and commit
-
-
from a remote machine to a local machine:
-
e.g., updating a local file version to a remote file version
-
rsync is widely used in the Linux and Unix communities for backup and replication. Example command of copying data from source to destination(rsync commands):
$ rsync -a <source> <destination>
You might question, if rsync is used to copy data from one place to another, how is it different from using cp (copy) or scp (secure copy) commands?
cp vs rsync: cp or scp makes a copy of the entire file. rsync copies only the updated parts of the file.
The real word examples of using rsync:
-
backup web application from one server to another:
-
e.g. a web developer can use rsync to backup the code and database of a web application to a remote server. This can be done periodically, such as during the night hours automatically.
-
-
synchronize files between two different machines, such as a desktop computer and a laptop:
-
e.g. This is useful for people who work on multiple machines and need to keep their files up to date so that any changes made on a desktop are synced to a laptop. No need to clean up the local program, create a branch, or git commit/clone.
-
-
deploying websites or applications:
-
e.g. For instance, a web developer can use rsync to move files from a staging environment to a production server, ensuring that the files are transferred efficiently and only the changes are synced.
-
-
file distributions:
-
e.g. Let’s say you want to build your own CDN and host it across multiple geographical regions, use rsync to distribute to each server.
-
How rsync works in the big picture:
Let’s look into how rsync works. Imagine you have files on your local machine and want to back up the files to a remote machine. In the example, files are backed up in the past, and you modified just now. I will call the file that already exists in the remote machine old files and the newly updated but not backed up yet file new files.
First phase: establish a connection between the local and the remote
Rsync starts a process on your local machine to connect to the remote machine, typically via SSH. If the connection is established correctly, it starts a rsync process on the remote machine.
Second phase: detect the difference between the new file and the old file
The local machine sends the file list to the remote machine. The local machine reads its local disk and sends the file list to the remote machine via a network. The remote machine, then reads the file list from the network and writes it into its memory.
The remote machine splits a file into chunks, not in the disk but in its memory, and calculates checksums for each chunk. Think of checksums as a hash of the chunk; the checksum is unique to each chunk(*though, not 100% unique). This list of checksums is sent to the local machine, and the local machine similarly calculates checksums for each chunk on a local file and finds if the checksums match. In this process, the local machine will find out whether the data in the new file is the chunk of data that matches the chunk of the old file or data that does not exist in the old file.
Third phase: create instructions to copy files and send them
After the local machine checks the checksums of each chunk, it sends instructions to the remote machine about how to create a new copy of the file and its newly updated data. This is done by the local machine reading the file data from its disk to memory and sending the instructions over the network. For its efficiency, the data is compressed when it transfers data.
Fourth phase: follow instructions to make a copy and merge it
The remote machine receives the instructions, and the file’s data from the local machine, and the old files are merged with the new files. This is done by the remote machine reading the instructions and the new file data from the network and writing the new data back to its disk.
Data backup is critical, considering that data is core to most of the core business operations. Using the rsync command tool is a viable option for performing data backup. I hope this blog helps you get started on your data backup journey!
*****
[References]
Jenkov.com, https://jenkov.com/tutorials/rsync/index.html
rsync, article 3: How does rsync work? (2022), https://michael.stapelberg.ch/posts/2022-07-02-rsync-how-does-it-work/
rsync, https://rsync.samba.org/
Wikipedia, https://en.wikipedia.org/wiki/Rsync
Michael Stapelberg: Why I wrote my own rsync, https://www.youtube.com/watch?v=wpwObdgemoE