GlusterFS is a distributed file system implemented in user space. It is strictly not a native file system in itself but is an aggregator of different file systems. GlusterFS can aggregate individual file system mount points or directories (called bricks in gluster terminology) to provide a single unified file system namespace. In addition to NFS and CIFS, the most common
way to access GlusterFS namespace is via FUSE based Gluster native client.
More information on creating and mounting GlusterFS volume can be obtained from GlusterFS website.
Until recently using GlusterFS volumes to host VM images and data was sub-optimal due to the FUSE overhead involved in accessing gluster volumes via GlusterFS native client. However this has changed now with two specific enhancements:
– A new library called libgfapi is now available as part of GlusterFS that provides POSIX-like C APIs for accessing gluster volumes. libgfapi support will be available from GlusterFS-3.4 release.
– QEMU (starting from QEMU-1.3) will have GlusterFS block driver that uses libgfapi and hence there is no FUSE overhead any longer when QEMU works with VM images on gluster volumes.
GlusterFS with its pluggable translator model can serve as a flexible storage backend for QEMU. QEMU has to just talk to GlusterFS and GlusterFS will hide different file systems and storage types underneath. Various GlusterFS storage features like replication and striping will automatically be available for QEMU. Efforts are also on to add block device backend in Gluster via Block Device (BD) translator that will expose underlying block devices as files to QEMU. This allows GlusterFS to be a single storage backend for both file and block based storage types.
VM image residing on gluster volume can be specified on QEMU command line using URI format:
gluster[+transport]://[server[:port]]/volname/image[?socket=…]
gluster is the protocol.
transport specifies the transport type used to connect to gluster management daemon (glusterd). Valid transport types are tcp, unix and rdma. If a transport type isn’t specified, then tcp type is assumed.
server specifies the server where the volume file specification for the given volume resides. This can be either hostname, ipv4 address or ipv6 address. ipv6 address needs to be within square brackets [ ]. If transport type is unix, then server field should not be specified. Instead the socket field needs to be populated with the path to unix domain socket.
port is the port number on which glusterd is listening. This is optional and if not specified, QEMU will send 0 which will make gluster to use the default port. If the transport type is unix, then port should not be specified.
volname is the name of the gluster volume which contains the VM image.
image is the path to the actual VM image that resides on gluster volume.
gluster://1.2.3.4/testvol/a.img
gluster+tcp://1.2.3.4/testvol/a.img
gluster+tcp://1.2.3.4:24007/testvol/dir/a.img
gluster+tcp://[1:2:3:4:5:6:7:8]/testvol/dir/a.img
gluster+tcp://[1:2:3:4:5:6:7:8]:24007/testvol/dir/a.img
gluster+tcp://server.domain.com:24007/testvol/dir/a.img
gluster+unix:///testvol/dir/a.img?socket=/tmp/glusterd.socket
gluster+rdma://1.2.3.4:24007/testvol/a.img
(GlusterFS URI description and above examples are taken from QEMU documentation)
While building QEMU from source, in addition to the normal configuration options, ensure that –enable-uuid and –enable-glusterfs options are is specified explicitly with ./configure script. (Update Feb 2013: A fix in QEMU-1.3 time frame makes the use of –enable-uuid unnecessary for GlusterFS support in QEMU)
Update Aug 2013: Starting with QEMU-1.6, pkg-config is used to configure the GlusterFS backend in QEMU. If you are using GlusterFS compiled and installed from sources, then the GlusterFS package config file (glusterfs-api.pc) might not be present at the standard path and you will have to explicitly add the path by executing this command before running the QEMU configure script:
export PKG_CONFIG_PATH=/usr/local/lib/pkgconfig/
Without this, GlusterFS driver will not be compiled into QEMU even when GlusterFS is present in the system.
qemu-img command can be used to create VM images on gluster backend. The general syntax for image creation looks like this:
qemu-img create gluster://server/volname/path/to/image size
To create a raw image, qemu-img create gluster://1.2.3.4/testvol/dir/a.img 5G
To create a qcow2 image, qemu-img create -f qcow2 gluster://server.domain.com:24007/testvol/a.img 5G
A VM image a.img residing on gluster volume testvol can be booted using QEMU like this:
qemu-system-x86_64 -drive file=gluster://1.2.3.4/testvol/a.img,if=virtio
In addition to VM images, gluster drives can also be used as data drives:
qemu-system-x86_64 -drive file=gluster://1.2.3.4/testvol/a.img,if=virtio -drive file=gluster://1.2.3.4/datavol/a-data.img,if=virtio
Here a-data.img from datavol gluster volume appears as a 2nd drive for the guest.
The following numbers from FIO benchmark are to show the performance advantage of using QEMU’s GlusterFS block driver instead of the usual FUSE mount while accessing the VM image.
Test setup
Host | Dual core x86_64 system running Fedora 17 kernel (3.5.6-1.fc17.x86_64) |
Guest | Fedora 17 image, 4 way SMP, 2GB RAM, using virtio and cache=none QEMU options |
QEMU options
FUSE mount | qemu-system-x86_64 –enable-kvm –nographic -smp 4 -m 2048 -drive file=/mnt/F17,if=virtio,cache=none => /mnt is GlusterFS FUSE mount point |
GlusterFS block driver in QEMU (FUSE bypass) | qemu-system-x86_64 –enable-kvm –nographic -smp 4 -m 2048 -drive file=gluster://bharata/test/F17,if=virtio,cache=none |
Base (VM image accessed directly from brick) | qemu-system-x86_64 –enable-kvm –nographic -smp 4 -m 2048 -drive file=/test/F17,if=virtio,cache=none => /test is brick directory |
FIO load files
Sequential read direct IO | ; Read 4 files with aio at different depths [global] ioengine=libaio direct=1 rw=read bs=128k size=512m directory=/data1 [file1] iodepth=4 [file2] iodepth=32 [file3] iodepth=8 [file4] iodepth=16 |
Sequential write direct IO | ; Write 4 files with aio at different depths [global] ioengine=libaio direct=1 rw=write bs=128k size=512m directory=/data1 [file1] iodepth=4 [file2] iodepth=32 [file3] iodepth=8 [file4] iodepth=16 |
FIO READ numbers
aggrb (KB/s) | minb (KB/s) | maxb (KB/s) | |
FUSE mount | 15219 | 3804 | 5792 |
QEMU’s GlusterFS block driver (FUSE bypass) | 39357 | 9839 | 12946 |
Base | 43802 | 10950 | 12918 |
FIO WRITE numbers
aggrb (KB/s) | minb (KB/s) | maxb (KB/s) | |
FUSE mount | 24579 | 6144 | 8423 |
QEMU’s GlusterFS block driver (FUSE bypass) | 42707 | 10676 | 17262 |
Base | 42393 | 10598 | 15646 |
Here are the recent FIO numbers averaged from 5 runs using latest QEMU (git commit: 03a36f17d77) and GlusterFS (git commit: cee1b62d01). The test environment remains same as above with the following two changes:
FIO READ numbers
aggrb (KB/s) | % Reduction from Base | |
Base | 44464 | 0 |
FUSE mount | 21637 | -51 |
QEMU’s GlusterFS block driver (FUSE bypass) | 38847 | -12.6 |
FIO WRITE numbers
aggrb (KB/s) | % Reduction from Base | |
Base | 45824 | 0 |
FUSE mount | 40919 | -10.7 |
QEMU’s GlusterFS block driver (FUSE bypass) | 45627 | -0.43 |
While I described how to use GlusterFS as a storage backend for QEMU manually, there have been efforts to enable QEMU-GlusterFS native support from libvirt, VDSM and oVirt as well. We now have GlusterFS enabled completely from oVirt which allows user to use self-help portal of oVirt to create GlusterFS volume and use it as storage backend to host VM images. The GlusterFS storage domain work in VDSM and the enablement of the same from oVirt allows oVirt to exploit the QEMU-GlusterFS native integration rather than using FUSE for accessing GlusterFS volume.
Deepak C Shetty has created a nice video demo of how to use oVirt to create a GlusterFS storage domain and boot VMs off it.
UNMAP support in QEMU-GlusterFS is explained here.
2020 has not been a year we would have been able to predict. With a worldwide pandemic and lives thrown out of gear, as we head into 2021, we are thankful that our community and project continued to receive new developers, users and make small gains. For that and a...
It has been a while since we provided an update to the Gluster community. Across the world various nations, states and localities have put together sets of guidelines around shelter-in-place and quarantine. We request our community members to stay safe, to care for their loved ones, to continue to be...
The initial rounds of conversation around the planning of content for release 8 has helped the project identify one key thing – the need to stagger out features and enhancements over multiple releases. Thus, while release 8 is unlikely to be feature heavy as previous releases, it will be the...