A lot of people seem to be curious about how GlusterFS works, not just in the sense of effects but in terms of internal algorithms etc. as well. Here’s an example from this morning. The documentation at this level really is kind of sparse, so I might as well start filling some of the gaps. Today I’ll talk about DHT, which is the real core of how GlusterFS aggregates capacity and performance across multiple servers. Its responsibility is to place each file on exactly one of its subvolumes – unlike either replication (which places copies on all of its subvolumes) or striping (which places pieces onto all of its subvolumes). It’s a routing function, not splitting or copying.
The basic method used in DHT is consistent hashing. Each subvolume (brick) is assigned a range within a 32-bit hash space, covering the entire range with no holes or overlaps. Then each file is also assigned a value in that same space, by hashing its name. Exactly one brick will have an assigned range including the file’s hash value, and so the file “should” be on that brick. However, there are many cases where that won’t be the case, such as when the set of bricks (and therefore the range assignment of ranges) has changed since the file was created, or when a brick is nearly full. Much of the complexity in DHT involves these special cases, which we’ll discuss in a moment. First, though, it’s worth making a couple more observations about the basic algorithm.
So, those are the basics. How about those special cases? It’s probably easiest to look at the “read” path first, where we’re trying to find a file that we expect to be there. Here’s the sequence of operations.
What’s a link file, then? Have you ever looked on one of your bricks and seen zero-length files with weird permissions (sticky bit set)? Those are link files. If you look closer, you’ll also see that they have trusted.dht.linkfile xattrs with brick names in them. That’s how we avoid the “broadcast” mentioned above. On subsequent lookups, if we find a link file we just follow it to the real brick. Considering that we only go through this lookup procedure once per file per client anyway (location information is cached), the cost of “guessing wrong” is therefore pretty minimal. I once implemented a scheme where we do an exponentially expanding search instead of an immediate broadcast, hoping to achieve a better balance of lookup latency vs. network traffic, but in the end it just didn’t seem to make a difference so the extra complexity wasn’t worth it. Now, let’s look at the file-creation path.
This brings us to rebalancing, which is one of the key challenges – and therefore one of the most interesting research areas IMO – in this kind of system. The first thing to know about GlusterFS rebalancing is that it’s not automatic. If you add a new brick, even new files won’t be put on it until you do the “fix-layout” part of rebalance, and old files won’t be put on it until you do the “migrate-data” part. What do these do?
In my personal opinion, there are problemsenhancement opportunities in both of these areas. Let’s take these in reverse order. Migrate-data is slow. What it should do is run in parallel on all of the bricks, with each brick either “pushing” data that is currently local but needs to be elsewhere or “pulling” data that’s the other way around. What it does instead is run on one node, potentially moving files for which it is neither source nor destination. This is a big deal, because it causes rebalance to take days when it should take hours – or weeks when it should take days, on larger installations. The amount of I/O involved is also why you don’t necessarily want this to be an automatic process.
While the migrate-data issue is at the level of mechanics and implementation, the fix-layout issue is at more of a conceptual level. To put it simply, when we add a new brick we should reallocate approximately 1/new_brick_count hash values. Because our layout calculations are naive, we will usually reallocate much more than that – exacerbating the migrate-data problem because reallocated hash values correspond to moved files. Time for a picture.
The outer ring represents the state with just three bricks – hash value zero at the top, split into three equal ranges. The inner ring represents the state after adding a fourth brick. Any place where the inner and outer rings are different colors represents a range that has been reassigned from one brick to another – implying a migration of data likewise. If you look carefully, you’ll see that we’re moving half of the data when it should be only a quarter – 8% blue to orange, 17% orange to yellow, and 25% yellow to green. What could we do that’s better? Not much, if we stay within the limitation of a single brick claiming a single range, but there really doesn’t need to be such a limitation. Instead, we could borrow from Dynamo and assign multiple “virtual node IDs” for brick four, giving it a total of 25% drawn equally from bricks one through three. (If you look at this as a clock, that’s one hour each at three, seven, and eleven o’clock). Taking this approach too far can cause layout tables to get unreasonably large, so sometimes it makes more sense to forego this optimization or even reverse it by combining/swapping ranges even though that will cause “unnecessary” data migration. Done sensibly, this can keep the table size under control without resorting to the scale-limiting “hash bucket” approach of some Dynamo derivatives. There are even more sophisticated ways to address this problem, but that’s a whole different post and this should be enough to convey the general problem space.
That’s how DHT works today, and some thoughts about how it might evolve in the future. The next article will focus on replication.
2020 has not been a year we would have been able to predict. With a worldwide pandemic and lives thrown out of gear, as we head into 2021, we are thankful that our community and project continued to receive new developers, users and make small gains. For that and a...
It has been a while since we provided an update to the Gluster community. Across the world various nations, states and localities have put together sets of guidelines around shelter-in-place and quarantine. We request our community members to stay safe, to care for their loved ones, to continue to be...
The initial rounds of conversation around the planning of content for release 8 has helped the project identify one key thing – the need to stagger out features and enhancements over multiple releases. Thus, while release 8 is unlikely to be feature heavy as previous releases, it will be the...