Syndicated

Data Dedupe comes to ZFS

It’s official… Data deduplication has been added to ZFS (read the link if you’re new to data deduplication). Hats off to Jeff Bonwick and Bill Moore who did a ton of the work in addition to Mark Maybee, Matt Ahrens, Adam Leventhal, George Wilson and the entire ZFS team.   The implementation is a synchronous block-level one which deduplicates data immediately as it is written.   This is analogous as to how DataDomain does it in their dedupe appliances.

What’s interesting about this is now dedupe will be available for *free* unless Oracle does something stupid.   Sun’s implementation is complimentary to the already-existing filesystem compression.   I’m not sure how much of an issue this is yet but the current iteration can not take advantage of SHA256 acceleration in the SPARC Niagara2 CPUs but eventually we should see hardware acceleration implemented.

When will it be available? It should be available in the Opensolaris dev branches in the next couple of weeks as code was just committed to be part of snv_128.   General available in Solaris 10 will take a bit longer until the next update happens.

For OpenSolaris, you change your repository and switch to the development branches – should be available to public in about 3-3.5 weeks time.  Plenty of instructions on how to do this on the net and in this list.   — James Lever on the zfs-discuss mailing list

How do I use it? If you haven’t built an Opensolaris box before, you should start looking at this great blog post here.   I wouldn’t get things rolling until dedupe is in the public release tree.

Ah, finally, the part you’ve really been waiting for.

If you have a storage pool named ‘tank’ and you want to use dedup, just type this:

zfs set dedup=on tank

That’s it.

Like all zfs properties, the ‘dedup’ property follows the usual rules for ZFS dataset property inheritance. Thus, even though deduplication has pool-wide scope, you can opt in or opt out on a per-dataset basis.

– Jeff Bonwick http://blogs.sun.com/bonwick/en_US/entry/zfs_dedup#comments

What does this mean to me? Depends.   For people who like to tinker, you can build your own NAS or iSCSI server with dedupe *and* compression turned on.   Modern CPUs keep increasing in speed and can handle this.   This is huge.   Now, should you abandon considering commercial dedupe appliances that are shipping today?   Not if you want a solution for production as this won’t be officially supported until it’s rolled into the next Solaris update.   For commercial dedupe technology vendors, this is another mark on the scorecard for the commoditization of dedupe.

What things do I need to be aware of? The bugs need to be worked out of this early on so apply standard caution.   READ JEFF’s BLOG POST FIRST!!! There is a verification feature, use it if you’re either worried about your data or using fletcher-4 as a hashing algorithm to speed up dedupe performance (zfs set dedup=verify tank or zfs set dedup=fletcher4,verify tank).

How do I stay up to date on ZFS in general? Subscribe to the zfs-discuss mailing list (also in forum format).   It can be high volume but it is worth it if you want to stay on top of all things zfs.

http://mail.opensolaris.org/pipermail/onnv-notify/2009-November/010683.htmlHow do

About the author

Ed Saipetch

Ed Saipetch is virtualization practice lead and systems engineer. He has been and both the end user and value added reseller space with a focus on application infrastructure and web architecture scalability.

4 Comments

  • Interesting post, Ed, and the timing is perfect as we'll be seeing you later this week at Tech Field Day. You'll have a chance to see how a robust commercial dedupe platform like Ocarina works as opposed to a point technology such as this new ZFS offering. Here are some of the key differences:

    – Reliability and assured data integrity (clustering, auditability)
    – Performance: The ability to adjust the horsepower applied to compressing (cores, algorithm settings, clustering…)
    – Working transparently with 3rd party storage
    – End-to-end
    – Manageability and flexibility…controlling what gets compressed when
    – Content awareness – mechanisms for utilizing multiple algorithms for different scenarios

    In short, you want to be able to use any open or proprietary algorithms, at the right time, on the right data, and being able to introduce that platform into standard data workflows and existing storage architectures in a non-disruptive way. Looking forward to having a chance to dig into this next Friday in more detail. Thanks for listening, Mike Davis, Ocarina

  • @Mike Davis

    Don't know if there are that many differences….

    -Reliability and assured data integrity: All possible with ZFS / OpenStorage
    -Performance: Many tweeks available, but target is that this should not be needed
    -ZFS uses _any_ backend storage, but likes JBODs the most
    -End-to-end -> Don't know what you mean by that
    -Manageability and flexibilty: Available down to filesystem/volume level

    I would say, ZFS dedup (and compression!) is not a point technology. It is just a feature amongst many others.

    Sooner or later dedup is just a checkbox in a product description.

  • @Mike Davis

    Don't know if there are that many differences….

    -Reliability and assured data integrity: All possible with ZFS / OpenStorage
    -Performance: Many tweeks available, but target is that this should not be needed
    -ZFS uses _any_ backend storage, but likes JBODs the most
    -End-to-end -> Don't know what you mean by that
    -Manageability and flexibilty: Available down to filesystem/volume level

    I would say, ZFS dedup (and compression!) is not a point technology. It is just a feature amongst many others.

    Sooner or later dedup is just a checkbox in a product description.

Leave a Comment