Data deduplication (or ‘dedupe’), is a method for reducing what’s written to data storage media by eliminating duplicate data. Dedupe techniques may be seen in most backup software, operating systems, storage arrays and dedicated deduplication appliances.
Deduplication operates first by breaking up data to be written to storage into chunks or blocks. Each block of data is compared against what has already been written to determine whether it is unique or is a duplicate of what has already been stored across the entire file system or logical volume. If the block is unique, it is written to disk, if it is an exact match
for what has been previously stored, the deduplication file system records a pointer to what has been previously stored, thereby eliminating a redundancy.
Much of the value of the deduplication system is the process for ensuring change does not result in data loss. For example, over time files are added and deleted. When a deletion happens, a dedupe file system cannot just erase the file, the system must first determine whether any other files are using any block associated with the file marked for deletion and adjust the pointers to correctly reflect which blocks are valid and which may be deleted during the garbage collection process.
What else does deduplication do?
- Allows for a cost-effective way to keep more backup data on disk for the faster recovery of individual files than was possible using tape; reducing the time to recover from minutes and hours to seconds and minutes.
- Makes more efficient the movement of backup data across networks. When the deduplication process of data reduction occurs at the source of the data, significantly less infrastructure is required to support the movement of data associated with backup processes.
- Solves the challenge of large data volume storage requirements typically associated with database backups. Because deduplication operates at the sub-file level, only the subfile changes need to be stored, even though the full database would need to be copied to the dedupe system.
Is this too good to be true?
So less data on storage = storage can last longer. But if deduplication is so good at eliminating redundancy and reducing data amounts, then why is it not used everywhere?
- Optimisation failure – for example, deduplicating previously deduped, compressed, or encrypted data may result in data sizes actually growing.
- Performance issues – due to the processing required for the initial ingest and deduplication process, as well as the retrieval and re-hydration process.
- Cost – while deduplication is increasingly built into many software and hardware platforms as a no-cost feature, there is a cost associated with the CPU and memory resources needed for performance. Dedicated dedupe platforms are sized to support deduplication and a premium is paid for the proprietary functionality.
- Vendor lock-in – there is no standard for deduplication between vendors; each offering is unique and incompatible with that of another vendor. This can restrict future flexibility, increasing the difficulty of replacing one dedupe hardware platform with another.
- ‘All eggs in one basket” design – by fully eliminating any and all redundancy, you risk your entire archive should you run into a data corruption issue. Specifically, this is true if systems use global deduplication and just replicate the storage as secondary “backup.”
Data deduplication is not an optimal storage technique for every backup-related workload. Applications which make use of backup data as the source to power things like business continuance and testing all benefit from faster back-end storage input/output than is available with dedupe. But when long-term retention and data archival is more important, deduplication makes more economic sense.
Dedupe can also be combined with other data reduction techniques for greater effect. For example, a backup application will use an incremental technique which writes data to a deduplication file system. The deduplicated data will then typically be compressed as it is written to storage. The net effect on the size of the overall data reduction is significant.
We partner with Veeam Software (need new link) as the innovative provider of solutions that deliver ‘Availability for the Always-On Enterprise’. Customers save time, mitigate risks, and dramatically reduce capital and operational costs.
‘De dupe dedupe…’ extracts taken from Veeam paper entitled ‘Backup and Disaster Recovery: What you need to know about data reduction techniques!’