GlusterFS Technical FAQ

From GlusterDocumentation

Contents

Does 'ls' query all servers in parallel to deliver the logical summation of the namespace?

Yes, As GlusterFS doesn't have a centralized metadata, it queries all the servers when a 'ls' command is issued.

'ls' performance

Q: How many files per second can you stat with "ls" and how many stats per second if I am querying on last-modified state?

A: Lets see the two possible cases of 'ls'.

# "ls"

"ls" issues a opendir(), multiple readdir() and closedir(). Just doing "ls" doesn't trigger stat() call. But, gets the directory entries.

# "ls -l"

"-l" options triggers stat() call for each entry obtained by readdir().

Now, In GlusterFS the performance depends on what translators are used to get the exact number of stat calls which can be made. If we have to send stat() call in AFR, its sent to only first child, and is very fast, in unify it will be sent only to namespace and should be fast, in dht (or hash), client knows where to find the file hence it should be fast too.

Where is the meta data stored?

There is no special metadata storage concept in GlusterFS. The metadata is stored with the file data itself in its backend disk (very much like ext3/ZFS/xfs etc). GlusterFS doesn't manage disk/block level filesystem. It manages backend filesystems as a whole. So, it has no need of keeping metadata.

Does each client maintain a cache of the summed namespace?

No we do not maintain any client side cache, because it will lead to severe cache coherency and scalability issues. Currently each translators has its own mechanism to maintain global namespace. Unify has separate namespace volume which is shared by all clients. DHT distributes it depending on the name hash.

How do I know which order the translators should be implemented?

There are no hard bound rule about places of translators. It may vary from setup to setup. In general, posix/bdb, locks, io-threads should be defined as a pack for required behavior and performance. write-behind and read-ahead makes more sense at client side. io-cache can be loaded on both client and server. cluster translators can sit as per need.

How is locking handled?

File level locking is handled distributedly across the bricks using features/posix-locks translator. GlusterFS supports both fcntl() and flock() calls.

Do clients communicate with each other?

No clients do not communicate with each other.

Do servers communicate with each other?

Yes, glusterd daemons communicate directly for administration purposes.

Additionally, each server re-shares the volume via nfs through a client process. That client process connects to it's volume servers.

Finally, a rebalance also uses the client to connect to a volume's servers to perform the rebalance.


What happens if a GlusterFS brick crashes?

You treat it like any other storage server. The underlying filesystem will run fsck and recover from crash. With journaled file system such as Ext3 or XFS, recovery is much faster and safer. When the brick comes back, glusterfs fixes all the changes on it by its self-heal feature.

What about deletion self/auto healing?

With auto healing only file creation is healed. If a brick is missing because of a disk crash re-creation of files is ok but if it's a temporary network problem synchronizing deletion is mandatory.

See also Gluster 3.2: Triggering Self-Heal on Replicate.

Can I add or remove a storage node while GlusterFS is online?

Dynamic volume management tools to add/remove bricks online were released in v3.1.

Can I directly access the data on the underlying storage volumes?

If you are just doing just read()/access()/stat() like operations, you should be fine. If you are not using any new features (like quota/geo-replication etc etc) then technically, you can modify (but surely not rename(2) and link(2)) the data inside.

Note that this is not tested as part of gluster's release cycle and not recommended for production use.

What happens in case of hardware or GlusterFS crash?

You don't risk any corruption. How ever if the crash happened in the middle of your application writing data, the data in transit may be lost. All file systems are vulnerable to such loses.

Metadata Storage - When using striping (unify), how does the file data get split?

Individual files are never split and stored on multiple bricks, rather, the scheduling algorithm you specify is used to determine which brick a file is stored on.

Metadata Storage - When using striping (unify), how/where is the metadata kept?

As said earlier, there is no metadata in unify (in whole GlusterFS itself). Unify keeps its namespace cache in an separate namespace volume.

How to make GlusterFS secure?

GlusterFS as of now supports only IP/port based authentication. You specify a range of IP addresses separately for clients and management nodes to allow access. Client side port is always restricted to less than 1024 to ensure only root can perform management operations including mount/umount. New GNU TLS (secure certificate) based authentication is under development. We are also planning to implement encryption translator in the upcoming release. Till then you can even stunnel GlusterFS connections.

Here is one article about setting up Encrypted Network between client and server.

How do I mount/umount GlusterFS?

Refer to Mounting a GlusterFS Volume.

Do I need to synchronize UIDs and GIDs on all servers using GlusterFS ?

No. Only clients machines need to be synchronized, since the access control is done on the client's side VFS layer.

Do I need to synchronize time on all servers using GlusterFS ?

Yes. You can use NTP (Network Time Protocol) client to do this if your hosting environment does not do this for you (for example Amazon EC2 already does this). Keeping all server time in sync is a good thing. Few translators like io-cache which works based on mtime may not work properly otherwise.

Simple example of NTP command:

bash# /usr/sbin/ntpdate pool.ntp.org

How do I add a new node to an already running cluster of GlusterFS

Yes, you can add more bricks in your volume specification file and restart GlusterFS (re-mount). Its schedulers (alu) are designed to balance the file system data as you grow.

For releases after 1.3.0-pre5

Just add the extra node in unify's or DHT's subvolumes list, and restart the GlusterFS, the directory structure is automatically replicated in the new server :D The much desired self-heal property of unify solves the burden of manually maintaining equal directory structure in all the servers before mount.

Note: We are planning to add on-the-fly addition of storage bricks in our next release. The above steps will be taken care automatically.

How do I add a new AFR namespace brick to an already running cluster?

The question is not quite clear. Right now, GlusterFS doesn't support on-the-fly change of volfiles to add volumes. You need to edit volfile, add afr volume, stop glusterfs process, start it with new volfile to achieve this.

Loop mounting image files stored in GlusterFS file system

To mount one image file stored in glusterfs file system, you have to disable the direct-io in the glusterfs mount. to do this with GlusterFS use the following command:

  #glusterfs -f <your_spec_file>  --disable-direct-io-mode /<mount_path> 

After that you can use your glusterfs file system mounted on /<mount_path> to store your images. If you disable direct-io you can use glusterfs to store xen virtual machines virtual block device as files. Xen + Live Migration works fine using the option above.

How do you allow more than one IP in auth.addr?

Q: If you can only have one auth.addr line in a config, how do you allow 127.0.0.1 as well as a 192.168.* range?

A: Make your auth.addr.<volumename>.allow look like this:

option auth.addr.<volumename>.allow 127.0.0.1,192.168*

Note the comma separated ip address patterns.

Stripe behavior not working as expected

Q: Striping doesn't work well. I made a file of 4MB with 'option block-size 2MB', but on my two servers the file is distributed like this:

PC1: file = 2MB
PC2: file = 4MB

A: GlusterFS's stripe translator saves files as files in backend, but with a filesystem holes. 'ls' doesn't understand the filesystem holes, but du does. Please check the disk used by the file with the help of du command. That should show 2MB each.

Duplicate volume name in volfile

Q: Is it possible to use the same brick name several times in the same glusterfs-server.vol like in the example below?

volume brick
   type storage/posix
   option directory /dfslarge
end-volume

volume brick
   type storage/posix
   option directory /dfssmall
end-volume

A: No, volume name should be unique across the volume file. GlusterFS process will be erroring out in this case.

File location

Q: For example, I have 3 servers(no ssh) in Unify mode with RR scheduler and I've uploaded some file. How can I find out on which brick a file is located?

A: Currently there is no mechanism to find out the files at server. This should be made available in next releases.

What is GlusterFS scheduler?

The GlusterFS scheduler handles load-balancing and high-availability in clustered mode when unify translator is used. You select a scheduler of your choice in your "unify" volume. Check this link for more information about type of schedulers, their options, benefits of using them etc..

Servers with Multiple Purposes

Q: Can a cluster of servers be used for multiple purposes, e.g. run GlusterFS + Apache/PHP?

A: Dedicated servers are recommended due to security and performance concerns. How ever there are no restrictions for GlusterFS to coexist with other services (such as Apache or MySQL).

How can I improve the performance of reading many small files?

Use the NFS client. For reading many small files, i.e. PHP web serving, the NFS client will perform much better.

That that for a write-heavy load the native client will perform better.

 

Copyright © Gluster, Inc. All Rights Reserved.