Variable-length deduplication is the process used by EMC’s Avamar and DataDomain backup devices to condense data by removing common segments of data. As the most powerful and efficient way to deduplicate data, compression rates are often much higher than more traditional forms.
With data broken down into smaller chunks, the individual pieces can be checked for commonality. Common segments might appear multiple times within a single backup, and across multiple sources. Deduplicating these common segments greatly reduces the amount of data needed to be stored on the backup target. Only unique data gets stored, while the system maintains metadata used to rehydrate files.
Avamar data chunks range in size from 1 byte to 64 KB and generally average 24 KB in size. DataDomain’s process breaks data into 4 to 12 KB chunks, with an average chunk size of 8 KB.
Beyond deduplicating common segments, variable-length deduplication allows a backup system to only restore changed data between backup jobs. If a system or file should receive only a small change, during the next backup only that new information would need to be stored. For example, take a 100 MB file that receives a 1% change per day. Over 30 days of daily backups roughly 130 MB of storage would be utilized in a changed-block backup scheme versus 3000MB in a traditional full-file backup. The amount of data in an Avamar powered backup could be even less when considering deduplication of common variable-length chunks across other backed up data.
Two other common methods of deduplication include file level and fixed length. Fixed-length works the same way as variable- length deduplication but instead breaks data into fix segment blocks. Data changing mid-stream often causes the following segments to change, resulting in more data that needs to be stored.
File-level deduplication only looks for and sorts unique files, regardless of how similar they are. Multiple Word documents containing the same content but with an extra period or comma would each have to be stored. The 100 MB file that’s receiving 1% of changes each day would result in 3000 MB of utilized backup space after 30 days.
Though it may seem counter intuitive, the more data stored to a backup device using variable-length deduplication the higher the rate of deduplication will be achieved. More systems providing more data will result in more commonality. With each new system added the amount of new data to actually be stored will decrease.
Variable-length deduplication is a technological must in a backup storage device. By highlighting and stripping out common data across multiple platforms administrators can safely store more backups. Better storage utilization can also result in reduced operating costs and environmental requirements while improving the preservations of data.