What is Deduplication?

Deduplication is a compression technique that removes duplicate copies of data.  Most cloud storage services use deduplication to reduce their storage costs, when users upload files that already exist on their servers.  This can also benefit the user, since duplicate files will not use any bandwidth and appear to upload instantly.

Deduplication can also be done by “offline” backup software.  For instance, if multiple computers are doing disk image backups in an office environment, the operating system files can be deduplicated to save space on the target storage device.  This feature is typically only found in backup software tailored for businesses.

How It Works – Technical Details

Unique chunks of data are identified in each file.  When a match is found, the duplicate chunk is replaced with a small reference that points to the stored chunk.

Drawbacks

Because deduplicated data is stored differently from how it was written, there is a higher potential for data loss.  For instance, if 100 users store the same file, there is only one copy of that file in a deduplicated environment.  A loss of that file would result in 100 users losing the file.

References

http://en.wikipedia.org/wiki/Data_deduplication