Deduplication technology is quickly becoming the new hotness in the IT industry. Previously, deduplication was delegated to secondary storage tiers as the controller could not always keep up with the storage IO demand. These devices were designed to handle streams of data in and out versus random IO that may show up on primary storage devices. Heck… deduplication has been around in email environments for some time. Just not in the same form we are seeing it today.
However, deduplication is slowly sneaking into new areas of IT… and we are seeing more and more benefit elsewhere. Backup clients, backup servers, primary storage, and who-knows-where in the future.
As deduplication is being deployed across the IT world, the technology continues to advance and become quicker and more efficient. So, in order to try and stay on top of your game, knowing a little about the techniques for deduplication may add another tool in your tool belt and allow you to make a better decision for your company/clients.
Deduplication is accomplished by sharing common blocks of data on storage environments and only storing the changes to the data versus storing a copy of the data AGAIN! This allows for some significant storage savings… especially when you consider that many of file changes are minor adjustments versus major data loads (at least as far as corporate IT user data).
So, how is this magic accomplished? — Great question, I am glad you asked! Enter Fixed Block deduplication and Variable Block deduplication…
Fixed Block deduplication involves determining a block size and segmenting files/data into those block sizes. Then, those blocks are what are stored in the storage subsystem.
Variable Block deduplication involves using algorithms to determine a variable block size. The data is split based on the algorithm’s determination. Then, those blocks are stored in the subsystem.
Check out the following example based on the following sentence: “deduplication technologies are becoming more an more important now.”
Notice how the variable block deduplication has some funky block sizes. While this does not look too efficient compared to fixed block, check out what happens when I make a correction to the sentence. Oops… it looks like I used ‘an’ when it should have been ‘and’. Time to change the file: “deduplication technologies are becoming more and more important now.” File —> Save
After the file was changed and deduplicated, this is what the storage subsystem saw:
The red sections represent the changed blocks that have changed. By adding a single character in the sentence, a ‘d’, the sentence length shifted and more blocks suddenly changed. The Fixed Block solution saw 4 out of 9 blocks changed. The Variable Block solution saw 1 out of 9 blocks changed. Variable block deduplication ends up providing a higher storage density.
Now, if you determine you have something doing fixed block deduplication, don’t go and return it right now. It probably rocks and you are definitely seeing value in what you have. However, if you are in the market for something that deduplicates data, it is not going to hurt to ask the vendor if they use fixed block or variable block deduplication. You should find that you get better density and maximize your storage purchase even more.