The next version of Microsoft Windows Server includes integrated data deduplication technology. Microsoft is positioning this as a boon for server virtualization and claims it has very little performance impact. But how exactly does Microsoft’s de-duplication technology work?
Introducing Windows 8 Deduplication
Let’s make one thing clear right from the start: Microsoft started from a clean sheet and invented their own deduplication technology. This is not a licensed, cloned, or copied feature as far as I can tell. There are some clever aspects to it, along with a few head scratchers for folks like me who’ve seen lots of different deduplication approaches.
Microsoft’s deduplication is layered onto NTFS in Windows 8, and will be a feature add-on for Server users. It is implemented as a filter driver on a per volume basis, with each volume a complete, self describing unit. It is cluster aware, and fully crash consistent on all operations. This is a pretty neat trick: As is typical for Microsoft, deduplication will be a simple, transparent feature.
Now let’s talk for a moment about what Windows 8 deduplication is not.
- It is a server-only feature, like so many of Microsoft’s storage developments. But perhaps we might see it deployed in low-end or home servers in the future.
- It is not supported on boot or system volumes.
- Although it should work just fine on removable drives, deduplication requires NTFS so you can forget about FAT or exFAT. And of course the connected system must be running a server edition of Windows 8.
- Although deduplication does not work with clustered shared volumes, it is supported in Hyper-V configurations that do not use CSV.
- Finally, deduplication does not function on encrypted files, files with extended attributes, tiny (less than 64 kB) files, or re-parse points.
Some Technical Details on Deduplication in Windows 8
Microsoft Research spent 2 years experimenting with algorithms to find the “cheapest” in terms of overhead. They select a chunk size for each data set. This is typically between 32 KB and 128 KB, but smaller chunks can be created as well. Microsoft claims that most real-world use cases are about 80 KB. The system processes all the data looking for “fingerprints” of split points and selects the “best” on the fly for each file.
After data is de-duplicated, Microsoft compresses the chunks and stores them in a special “chunk store” within NTFS. This is actually part of the System Volume store in the root of the volume, so dedupe is volume-level. The entire setup is self describing, so a deduplication NTFS volume can be read by another server without any external data.
There is some redundancy in the system as well. Any chunk that is referenced more than x times (100 by default) will be kept in a second location. All data in the filesystem is checksummed and will be proactively repaired. The same is done for the metadata. The deduplication service includes a scrubbing job as well as a file system optimization task to keep everything running smoothly.
Windows 8 deduplication cooperates with other elements of the operating system. The Windows caching layer is dedupe-aware, and this will greatly accelerate overall performance. Windows 8 also includes a new “express” library that makes compression “20 times faster”. Compressed files are not re-compressed based on filetype, so zip files, Office 2007+ files, etc will be skipped and just deduped.
New writes are not deduped — this is a post-process technology. The data deduplication service can be scheduled or can run in “background mode” and wait for idle time. Therefore, I/O impact is between “none and 2x” depending on type. Opening a file is less than 3% greater I/O and can be faster if it’s cached. Copying a large file can make some difference (e.g. 10 GB VHD) since it adds additional disk seeks, but multiple concurrent copies that share data can actually improve performance.
Stephen’s Stance
Although I am intrigued by Microsoft’s new deduplication technology in Windows 8 server, I still have many questions about its usefulness and impact on performance. Concentrating duplicate data in the system volume makes sense from a technical perspective, but could lead to an I/O hotspot on the disk. This is especially true for external caching storage systems, since there is no integration between Microsoft deduplication and storage array features. I am particularly concerned about the use of deduplication with VHD files in Hyper-V, since it could eat up valuable system RAM and impact I/O performance.
If you would like to try Microsoft deduplication for yourself, I am happy to report that it is included in the developer preview of Windows 8 that is available on Dev Center. Here are a few commands to get you started, and read Rick Vanover’s post too!
Import-Module ServerManager Add-WindowsFeature -name FS-Data-Deduplication Import-Module Deduplication Enable-DedupVolume E: get-dedupvolume
Note: I am a Microsoft MVP and Microsoft briefs me on upcoming technologies under NDA. This post is based on a Microsoft briefing from November which was said at the time not to be covered by any NDA. All of this information could be gleaned by experimenting with the Windows 8 developer preview, but it’s much easier to just go to the source.