Getting excited by a file system makes people give you weird looks. Whenever someone talks about snapshotting or journalling features, my ears perk up. But even if this leaves you nonplussed, it’s impossible not to recognize that you’re always working within one. People might not get excited by a car frame, but it doesn’t mean you’re not riding on it all the time.
If file systems are cool, then distributed file systems are Miles Davis. They’re also incredibly hard. A lot of “distributed” file systems only fit the term in the broadest sense. Many rely on a centralized model, which can potentially be fine, but really limits how you can scale. Others offer true distribution but run into performance trouble. Elastifile’s solution is the Elastifile Cloud File System. This isn’t just simple distribution, they are offering a application-level distributed file system, with the goal of offering the flexibility of the public cloud, with all the enterprise data services expected in a hybrid cloud.
This isn’t a bunch of open-source options wrapped up with some optimizations and service offerings either. Elastifile did some hard computer science in their solution.
Consistency is Key
One of the biggest issues with distributed file systems is performance. Because it’s kind of a big deal for a file system not to lose data, generally a lot of distributed designs preference consistency over performance. That’s fine, but it often limits how much a system can scale, as greater complexity often requires slower performance to achieve consistency.
This is because most distributed file systems use a log based consensus algorithm. Essentially there is a distributed record of state changes. The initial state of the file system is known to all servers in the distribution, so by using the log of changes, the servers can reach a state of consistency. So if one node fails, the whole system can keep ticking. Awesome right?
Elastifile thinks there are a couple of issues with a log based approach. First, performance tends to be inconsistent at best, as slow disk IO and packet loss force every other node to wait as a change is committed to the change log.
The log also serves as a single point of failure. If a node is lost, the log must be completely recovered for the other nodes to review the changes to make sure they are in a consistent state. There isn’t so much a risk of data loss here, as much as it bottlenecks the entire system.
Cue, “there’s got to be a better way!”
Elastifile created their own consistency algorithm to account for the limitations of a log based approach. This is the key to how they are able to better balance consistency and performance in their file system. Bizur uses a different approach, it’s a key-value model. Each key is independent of each other.
This is important because requests on a given key don’t then require a replay of the entire series of changes before it can be done. Operations no longer require a inconsistent latency period as the entire distributed file system waits on I/O and network reliability issues. Instead, changes can be done concurrently. The key value approach let’s Elastifile not work with one large unwieldy consistency log file, but instead use the keys as a guide to hash them into predefined buckets of consensus.
Bizur is the key around which Elastifile has built their product. By their own admission, this is purposefully less general use than a distributed log approach. By being specific, it opens up a rethinking of what to expect from a distributed file system. Elastifile is at the forefront of not just distributed file systems, but one built with the modern data center in mind.
Elastifile also published a paper with more of Bizur’s technical details, if you’re inclined to dig into the computer science behind it. I’d also recommend their Storage Field Day presentation for a more general overview of their file system.