Gluster can have trouble delivering good performance for small file workloads. This problem is acute for features such as tiering and RDMA, which employ expensive hardware such as SSDs or infiniband. In such workloads the hardware’s benefits are unrealized, so there is little return on the investment.
A major contributing factor to this problem has been excessive network overhead in fetching file and directory metadata. Their aggregated costs exceed the benefits of the hardware’s accelerated data transfers. This fetch is called a LOOKUP. Note that for larger file sizes, the picture changes. For large files the improved transfer times exceed the LOOKUP costs, so in those cases RDMA and tiering features work well.
The chart below depicts the problem with RDMA. Large read-file workloads perform well, small read-file workloads perform poorly.
The following examples use the “smallfile” [1] utility as a workload generator. I run a large 28 brick tiered volume “vol1”. The configuration’s hot tier is a 2×2 ram disk, and the cold tier is a 2 x (8 + 4) HDD. I run from a single client, mounted over FUSE. The entire working set of files resides on the hot tier. The experiments using tiering can also be found in the SNIA SDC presentation here [3].
Running Gluster’s profile against a tiered volume generates a count of the number of LOOKUPs and depicts the problem.
$ ./smallfile_cli.py --top /mnt/p66.b --host-set gprfc066 --threads 8 \
--files 5000 --file-size 64 --record-size 64 --fsync N --operation read
$ gluster volume profile vol1 info cumulative|grep -E 'Brick|LOOKUP'.. . . Brick: gprfs018:/t4
93.29 386.48 us 100.00 us 2622.00 us 20997 LOOKUP
.. 20K LOOKUPs are sent to each brick, on the first run.
The purpose behind most LOOKUPs is to confirm the existence and permissions of a given directory and file. The client sends such LOOKUPs for each level of the path. This phenomena has been dubbed the “path traversal problem.” It is a well known issue with distributed storage systems [2]. The round trip time for each LOOKUP is not small and the cumulative effect is big. Alas, Gluster has suffered from it for years.
The smallfile_cli.py utility opens a file, does an IO, and then closes it. The path is 4 levels deep (p66/file_srcdir/gprfc066/thrd_00/<file>).
The 20K figure can be derived. There are 5000 files, and 4 levels of directories. 5000*4=20K.
The DHT and tier translators must validate on which brick the file resides. To do this, the first LOOKUPs received are sent to all subvolumes. The brick that has the file is called the “cached subvolume”. Normally, it is predicted by the distributed hash’s algorithm, unless the set of bricks has recently changed. Subsequent LOOKUPs are sent only to the cached subvolume.
Regardless of this phenomenon, the cached subvolume still receives as many LOOKUPs as the path length, due to the path traversal problem. So when the test is run a second time, gluster profile still shows 20K LOOKUPs, but only on bricks on the hot tier (the tier translator’s cached subvolume), and nearly none on the cold tier. The round trips are still there, and the overall problem persists.
To cope with this “lookup amplification”, a project has been underway to improve Gluster’s meta-data cache translator (md-cache), so the stat information LOOKUP requests could be cached indefinitely on the client. This solution requires client side cache entries to be invalidated if another client modified a file or directory. The invalidation mechanism is called an “upcall.” It is complex and has taken time to be written. But as of October 2016 this new functionality is largely code complete and available in Gluster upstream.
Enabling upcall in md-cache:
$ gluster volume set <volname> features.cache-invalidation on
$ gluster volume set <volname> features.cache-invalidation-timeout 600
$ gluster volume set <volname> performance.stat-prefetch on
$ gluster volume set <volname> performance.cache-samba-metadata on
$ gluster volume set <volname> performance.cache-invalidation on
$ gluster volume set <volname> performance.md-cache-timeout 600
$ gluster volume set <volname> network.inode-lru-limit: <big number here>
In the example, I used 90000 for the inode-lru-limit.
At the time of this writing, a cache entry will expire after 5 minutes. The code will eventually be changed to allow an entry to never expire. That functionality will come once more confidence is gained in the upcall feature.
With this enabled, gluster profile shows the number of LOOKUPs drops to a negligible number on all subvolumes. As reported by the smallfile_cli.py benchmark, this translates directly to better throughput for small file workloads. YMMV, but in my experiments, I saw tremendous improvements and the SSD benefits were finally enjoyed.
Tuning notes..
$ kill -USR1 `pgrep gluster`
# wait a few seconds for the dump file to be created
$ find /var/run/gluster -name \*dump\* -exec grep -E 'stat_miss|stat_hit' {} \;
Some caveats
[2] CEPH: RELIABLE, SCALABLE, AND HIGH-PERFORMANCE DISTRIBUTED STORAGE section 4.1.2.3
[3] SNIA SDC 2016 “Challenges with persistent memory in distributed storage systems”.
2020 has not been a year we would have been able to predict. With a worldwide pandemic and lives thrown out of gear, as we head into 2021, we are thankful that our community and project continued to receive new developers, users and make small gains. For that and a...
It has been a while since we provided an update to the Gluster community. Across the world various nations, states and localities have put together sets of guidelines around shelter-in-place and quarantine. We request our community members to stay safe, to care for their loved ones, to continue to be...
The initial rounds of conversation around the planning of content for release 8 has helped the project identify one key thing – the need to stagger out features and enhancements over multiple releases. Thus, while release 8 is unlikely to be feature heavy as previous releases, it will be the...