GlusterFS 2.0 I/O Benchmark Results

From GlusterDocumentation

Contents

Abstract

About author

Test Objective

  • Test performance and stabilities under the basic configuration. Unified Volume with RR policy.
  • No AFR, No fail-over, No Performace Translators such as io-cache, write-behind...
  • This test is only for getting idea of GlusterFS. So loosing file when the one of servers goes down is not our concern. (with this configuration, it's expected behavior)
  • So "stability" means no-broken connection, no-daemon-crashes & no-unexpected-behaviors during operation.

For the test

  • we setup 3 clients. These clients are glusterfs client not end-user clients. So, each of 3 clients is configured as it communicates with servers directly. (no re-exporting nodes between clients and servers)
  • Single-threaded writing, reading loop test on 1 client.
  • Multi-threaded writing, reading loop test on 1 clients.
  • Do same test on 3 clients simultaneously.
  • Do same test with the situation of millions of files in there.
  • We measure results - basically total elapsed time for each test job.
  • It will be good if we can log every test job with server's status(CPU, I/O, Network, ...) using some system monitoring tool.
  • It will be good if we can setup millions of files under hierarchical directory structures just like real-world.
  • We do not test unusual case such like what happen if 1000 of directories are created vertically.

Test environment

Server Specification

All servers have same specification

  • CPU : Dual AMD Dual-Core Opteron Processor 2212
  • Memory : 4GB
  • Disk : 1.4T RAID0
  • NIC : Intel PRO/1000
  • Network : Gigabit Ethernet Switch
  • OS : CentOS 4.7 (kernel 2.6.9-78.0.13.ELsmp)

RAID Configuration

  • Software RAID : md
  • Raid level : 0 stripping
  • Stripping size : 16K
  • Number of disks : SATA 500GB x 3EA
  • Total size : 1.4T
  • RAID0 Performance :
# hdparm -tT /dev/md0
Timing cached reads:   2108 MB in  2.00 seconds = 1053.11 MB/sec
Timing buffered disk reads:  428 MB in  3.01 seconds = 142.12 MB/sec

Test Platform Layout & Configuration

Unified GlusterFS Configuration

               +--------+   +--------+  +--------+
               |   C1   |   |   C2   |  |   C3   |
               |--------|   |--------|  |--------|
                    |            |           |  
                    +------------+-----------+
                                 |
                      (Gigabit Ethernet Switch)
                                 |
      +-------------+------------+------------+------------+
      |             |            |            |            |
 +---------+  +---------+  +---------+  +---------+  +---------+
 |    G1   |  |    G2   |  |    G3   |  |    G4   |  |    G5   |
 |---------|  |---------|  |---------|  |---------|  |---------|
 |file data|  |file data|  |file data|  |file data|  |file data|
 |namespace|  |         |  |         |  |         |  |         |
 +---------+  +---------+  +---------+  +---------+  +---------+

Unified GlusterFS Configuration

From the version of v2, namespace must be defined for the unified-volume. Multiple namespace nodes can be configured with AFR(Automatic File Replication) feature to prevent single point of failure, but we didn't configure such feature for this test.

  • Number of GlusterFS Servers : 5 nodes (G1 ~ G5)
  • S/W Version : GlusterFS v2.0.0 RC1
  • Compile option : ./configure --prefix=/usr/local --disable-fuse-client --disable-bdb --disable-mod_glusterfs
  • Configuration : G1
volume brick
  type storage/posix
  option directory /data/glusterfs
end-volume

volume brick-ns
  type storage/posix
  option directory /data/glusterfs-ns
end-volume

volume server
  type protocol/server
  option transport-type tcp
  subvolumes brick brick-ns
  option auth.addr.brick.allow *
  option auth.addr.brick-ns.allow *
end-volume
  • Configuration : G2 ~ G5
volume brick
  type storage/posix
  option directory /data/glusterfs
end-volume

volume server
  type protocol/server
  option transport-type tcp
  subvolumes brick
  option auth.addr.brick.allow *
end-volume
  • GlusterFS Client Node
    • S/W Version : GlusterFS v2.0.0 RC1 (FUSE 2.7.4)
    • Compile option : ./configure --prefix=/usr/local --disable-bdb --disable-mod_glusterfs
    • Unified Volume + Round-Robin Policy
    • Configuration : C1 ~ C3
volume g1
  type protocol/client
  option transport-type tcp
  option remote-host 10.40.197.105
  option remote-subvolume brick
end-volume

volume g2
  type protocol/client
  option transport-type tcp
  option remote-host 10.40.197.106
  option remote-subvolume brick
end-volume

volume g3
  type protocol/client
  option transport-type tcp
  option remote-host 10.40.197.107
  option remote-subvolume brick
end-volume

volume g4
  type protocol/client
  option transport-type tcp
  option remote-host 10.40.197.109
  option remote-subvolume brick
end-volume

volume g5
  type protocol/client
  option transport-type tcp
  option remote-host 10.40.197.112
  option remote-subvolume brick
end-volume

volume g1-ns
  type protocol/client
  option transport-type tcp
  option remote-host 10.40.197.105
  option remote-subvolume brick-ns
end-volume

volume unify
  type cluster/unify
  option namespace g1-ns
  option scheduler rr
  subvolumes g1 g2 g3 g4 g5
end-volume

Single GlusterFS & NFS Test Platform

               +--------+   +--------+  +--------+
               |   C1   |   |   C2   |  |   C3   |
               |--------|   |--------|  |--------|
                    |            |           |  
                    +------------+-----------+
                                 |
                      (Gigabit Ethernet Switch)
                                 |
                     +-----------+-----------+
                     |          G0           |
                     |-----------+-----------|
                     | GlusterFS |    NFS    |
                     |-----------+-----------|
                     |    RAID0 FileSystem   |
                     +-----------------------+

Single GlusterFS Configuration

  • Number of GlusterFS Servers : 1 node (G0)
  • Configuration : G0
volume brick
  type storage/posix
  option directory /data/glusterfs
end-volume

volume brick-ns
  type storage/posix
  option directory /data/glusterfs-ns
end-volume

volume server
  type protocol/server
  option transport-type tcp
  subvolumes brick brick-ns
  option auth.addr.brick.allow *
  option auth.addr.brick-ns.allow *
end-volume
  • GlusterFS Client Node
  • Configuration : C1 ~ C3
volume g0
  type protocol/client
  option transport-type tcp
  option remote-host 10.40.197.111
  option remote-subvolume brick
end-volume

volume g0-ns
  type protocol/client
  option transport-type tcp
  option remote-host 10.40.197.111
  option remote-subvolume brick-ns
end-volume

volume unify
  type cluster/unify
  option namespace g0-ns
  option scheduler rr
  subvolumes g0
end-volume

NFS Server Configuration

/data/nfs *(rw,insecure,no_root_squash)

Mount Status

  1. df -h

glusterfs 6.8T 552M 6.4T 1% /mnt/unify glusterfs 1.1T 108M 1.1T 1% /mnt/single d111:/data/nfs 1.1T 109M 1.1T 1% /mnt/nfs

Benchmark Results

Sequencial Write : 1KB x 1,000,000 times = 1GB

# time dd if=/dev/zero of=/mnt/unify/file bs=1024 count=1000000
Test Case Local SATA 500G Local RAID0 NFS Single GlusterFS Unified GlusterFS
1 Worker - 1st test 11.836s (82.5MB/s) 11.371s (85.9MB/s) 23.162s (42.2MB/s) 2m19.597s (7.0MB/s) 3m39.279s (4.4MB/s)
1 Worker - 2nd test 10.537s (92.7MB/s) 10.777s (90.6MB/s) 24.181s (40.4MB/s) 2m24.623s (6.7MB/s) 3m40.334s (4.4MB/s)

Sequencial Write : 64KB x 15,625 times = 1GB

# time dd if=/dev/zero of=/mnt/unify/file bs=65536 count=15625
Test Case Local SATA 500G Local RAID0 NFS Single GlusterFS Unified GlusterFS
1 Worker - 1st test 6.390s(152.8MB/s) 7.939s (123.0MB/s) 22.766s (42.9MB/s) 24.637s (39.6MB/s) 22.436s (43.5MB/s)
1 Worker - 2nd test 6.588s(148.2MB/s) 7.542s (129.5MB/s) 21.901s (44.6MB/s) 22.001s (44.4MB/s) 23.378s (41.8MB/s)

Sequencial Read : 1KB x 1,000,000 times = 1GB

# echo 3 > /proc/sys/vm/drop_caches // clear buffer cache
# time dd if=/mnt/unify/file of=/dev/null bs=1024 count=1000000
Test Case Local SATA 500G Local RAID0 NFS Single GlusterFS Unified GlusterFS
1 Worker - 1st test 17.230s (56.7MB/s) 10.464s (93.3MB/s) 10.493s (93.0MB/s) 14.300s (68.3MB/s) 18.532s (52.7MB/s)
1 Worker - 2nd test 17.201s (56.8MB/s) 10.242s (95.3MB/s) 10.962s (89.0MB/s) 14.596s (66.9MB/s) 18.392s (53.1MB/s)

Sequencial Read : 64KB x 15,625 times = 1GB

# echo 3 > /proc/sys/vm/drop_caches // clear buffer cache
# time dd if=/mnt/unify/file of=/dev/null bs=65536 count=15625
Test Case Local SATA 500G Local RAID0 NFS Single GlusterFS Unified GlusterFS
1 Worker - 1st test 21.590s (45.2MB/s) 4.817s (202.7MB/s) 9.712s (100.5MB/s) 14.425s (67.7MB/s) 14.777s (66.1MB/s)
1 Worker - 2nd test 22.031s (44.3MB/s) 4.645s (210.2MB/s) 9.996s (97.7MB/s) 14.109s (69.2MB/s) 14.971s (65.2MB/s)

How many files can be created in 10 minutes

Simple script is wroted for this test. This script will continuously create 1MB file into the 2-depth hierarchical directory structures. And 100 files will be created per a directory.

#!/bin/sh
HOSTNAME=`hostname`
TIMEOUT=$1
BASEDIR=$2
PID=$$

TIMESTART=`date '+%s'`
TIMEEND=`expr $TIMESTART + $TIMEOUT`

FILECNT=0
RCNT=1
while [ 1 ]; do
       ROOTPATH=${BASEDIR}/${HOSTNAME}_${PID}_${RCNT}
       mkdir $ROOTPATH

       DCNT=1
       while [ $DCNT -le 100 ]; do
               DIRPATH=${ROOTPATH}/dir_${DCNT}
               mkdir $DIRPATH

               FCNT=1
               while [ $FCNT -le 100 ]; do
                       FILEPATH=${DIRPATH}/file_${FCNT}.bin
                       dd if=/dev/zero of=$FILEPATH bs=65536 count=16 &> /dev/null
                       sync
                       echo -n "."
                       let FILECNT=FILECNT+1
                       let FCNT=FCNT+1

                       TIMENOW=`date '+%s'`
                       if [ $TIMENOW -ge $TIMEEND ]; then
                               echo "PID $PID : $FILECNT files created in $[$TIMENOW - $TIMESTART] seconds."
                               exit
                       fi
               done
               echo ""
               let DCNT=DCNT+1
       done
       let RCNT=RCNT+1
done

Test 10 minutes for the test.

# ./genfiles 600 /mnt/unify
(threads x clients) Local SATA 500G Local RAID0 NFS Single GlusterFS Unified GlusterFS
1 Worker (1 x 1) 14,270 (23.8MB/s) 26,144 files (43.6MB/s) 17,466 files (29.1MB/s) 17,734 files (29.6MB/s) 19,122 files (31.2MB/s)
10 Workers (10 x 1) 12,282 (20.4MB/s) 33,916 files (56.5MB/s) 16,881 files (28.1MB/s) 27,938 files (46.6MB/s) 36,096 files (60.2MB/s)
15 Workers (5 x 3) x x 17,837 files (29.7MB/s) 22,367 files (37.3MB/s) 61,280 files (102.1MB/s)
30 Workers (10 x 3) x x x x 78,371 files (130.6MB/s)
n Workers (n x 3) (25MB/s may be max) (50~60MB/s may be max) (30MB/s may be max) (40~50MB/s may be max) (150~200MB/s may be max)

How many files can be readed in 10 minutes

This script will continuously read 1MB files using cp. To prevent buffer caching effect, it randomize list of files everytime started and it clear buffer cache everytime file readed.

#!/bin/sh
HOSTNAME=`hostname`
TIMEOUT=$1
FILELIST=$2
RANGE=$3
PID=$$

# make random list
RANDLIST=${FILELIST}.${PID}
for FILEPATH in `head -$RANGE $FILELIST`; do
       echo "$RANDOM $FILEPATH"
done | sort -n | cut -d ' ' -f '2-' > $RANDLIST

TIMESTART=`date '+%s'`
TIMEEND=`expr $TIMESTART + $TIMEOUT`

# read operation
FILECNT=0
while [ 1 ]; do
       for FILEPATH in `cat $RANDLIST`; do
               cp $FILEPATH /dev/null
               echo 3 > /proc/sys/vm/drop_caches
               echo -n "."

               let FILECNT=FILECNT+1
               TIMENOW=`date '+%s'`
               if [ $TIMENOW -ge $TIMEEND ]; then
                       echo "PID $PID : $FILECNT files readed in $[$TIMENOW - $TIMESTART] seconds."
                       rm $RANDLIST
                       exit
               fi
       done
done

Make list of files.

# find /mnt/single/10 -type f > /mnt/single/filelist
# head /mnt/single/filelist
/mnt/single/10/dev-fs103_30755_1/dir_1/file_1.bin
/mnt/single/10/dev-fs103_30755_1/dir_1/file_2.bin
/mnt/single/10/dev-fs103_30755_1/dir_1/file_3.bin
/mnt/single/10/dev-fs103_30755_1/dir_1/file_4.bin
(...)

Run test.

# ./readfiles 600 /mnt/unify/filelist 5000 // read random-ordered 5000 files continuously
(threads x clients) Local SATA 500G Local RAID0 NFS Single GlusterFS Unified GlusterFS
1 Worker (1 x 1) 8,523 (14.2MB/s) 10,804 files (18.0MB/s) 15,605 files (26.0MB/s) 7,314 files (12.2MB/s) 10,825 files (18.0MB/s)
10 Workers (10 x 1) x 9680 files (16.1MB/s) 16,215 files (27.0MB/s) 13056 files (21.8MB/s) 41,453 files (69.1MB/s)
15 Workers (5 x 3) x x 19,969 files (33.3MB/s) 10,189 files (17.0MB/s) 88,724 files (147.9MB/s)
30 Workers (10 x 3) x x x x 101979 files (170.0MB/s)
n Workers (n x 3) (15MB/s may be max) (20 MB/s may be max) (35MB/s may be max) (25MB/s may be max) (200 MB/s may be max)

Stress test

Failed cases

  • 5 writing threads and 20 reading threads per each client. Totally 15 writing threads and 60 reading threads. - FAILED
  • 3 writing threads and 10 reading threads per each client. Totally 9 writing threads and 30 reading threads. - FAILED
# ./genfiles 43200 /mnt/unify/big &  // x 5
# ./readfiles 43200 /mnt/unify/filelist 100000 &  // x 20

This test was not succeed. I tried several times with different stress load... Whenever I press more stress with reading & writing combination, it didn't work correctly. At the first time, I thought that the race condition during clearing buffer cache makes troubles. So I tried without it, but it happened again.

I'm not sure it's solution problem, because the problem is based on the ethernet connection and it can be happen due to linux etherner driver or hardware problems including gigabit-swich(NETGEAR) or linux modules. But one sure thing is that *it is only happened when the reading & writing threads runs together*. I think FUSE is not the problem, ethernet problem occured on both server side and client side but FUSE is only installed on the client side. 2.0.0rc1 is current release, so it might not be happened on the future version or legacy 1.x stable version.

See this...

On the server side. it works fine about 1 or 2 minutes. But after that, some server does not response. When I check on the console, it was alive. But the ethernet does not work correctly.

# ping 10.40.197.90
(network unreachable error messages)
# service network restart
# ping 10.40.197.90
(works fine)
# ps ax | grep gluster
32545 ?        Rsl    6:08 glusterfsd -f /usr/local/etc/glusterfs/glusterfs-server.vol
// no need to restart glusterfsd, this is why I'm not sure GlusterFS makes this problem

On the client side, it looks like below...

('w' means writing, '.' means reading)
.w........w.w....w...ww..........ww..w....w.ww......w....w....w...w..w.w.....
ww...w...w...w....ww.w.w..w.......w.w...w..w........w..w......w....w......w..
..ww.....w..w.....ww....w......w.w......w.w.w...w.w.......w.w..........w..ww.
......w...w.w......w.w.....w...w.w..ww.......ww....w..w....w....w.w....w.....
.w......w.w.....w.ww.......ww.....w...w...w........ww..w....w.....w.ww.......
.ww..ww...ww...ww...w.w..w.
============ after 1-2 minutes
cp: cannot open `/mnt/unify/10/dev-fs108_21569_1/dir_6/file_50.bin' for reading: Input/output error
..ww........cp: cannot open `/mnt/unify/10/dev-fs108_21564_1/dir_12/file_83.bin' for reading: Input/output error
.cp: cannot open `/mnt/unify/10/dev-fs108_21565_1/dir_26/file_1.bin' for reading: Input/output error
.w....w.wcp: cannot open `/mnt/unify/10/dev-fs108_21564_1/dir_21/file_100.bin' for reading: Input/output error
============ getting slower
......cp: cannot open `/mnt/unify/10/dev-fs108_21564_1/dir_37/file_9.bin' for reading: Input/output error
.cp: cannot open `/mnt/unify/10/dev-fs108_21564_1/dir_16/file_70.bin' for reading: Input/output error
..cp: cannot open `/mnt/unify/10/dev-fs108_21563_1/dir_25/file_100.bin' for reading: Input/output error
.....ww.......w.cp: cannot open `/mnt/unify/10/dev-fs108_21566_1/dir_17/file_42.bin' for reading: Input/output error
============ getting more slower
.....wcp: cannot open `/mnt/unify/10/dev-fs108_21569_1/dir_22/file_73.bin' for reading: Input/output error
.......ww..w.....cp: cannot open `/mnt/unify/10/dev-fs108_21566_1/dir_12/file_56.bin' for reading: Input/output error
..w.....cp: cannot open `/mnt/unify/10/dev-fs108_21570_1/dir_24/file_10.bin' for reading: Input/output error
.wcp: cannot open `/mnt/unify/10/dev-fs108_21569_1/dir_23/file_39.bin' for reading: Input/output error
============ finally it was stopped both reading/writing operation
  • At this moment, 'df' command was blocked just before printing 'unify' volume information.
  • Sometimes ethernet died.
  • To recover this, I did like below
# killall genfiles readfiles
# service network restart (if it does not work)
# umount /mnt/unify
# killall glusterfs

# glusterfs -f /usr/local/glusterfs/glusterfs-unify.vol /mnt/unify // remounting
# df | grep unify
glusterfs            7210805248 257527808 6586988544   4% /mnt/unify

Succeed cases

  • 10 writing threads on C1 and 30 reading threads on C2 and C3. Totally 10 writing threads and 60 reading threads. - SUCCEED
# ./genfiles 43200 /mnt/unify/big &  // x 5 on C1
# ./readfiles 43200 /mnt/unify/filelist 100000 &  // x 30 on C2 and C3

Problems

while deleting file structures

Whenever I tried to delete huge file structures. It reports some error like below. To remove entire structures clearly, It is needed to send rm command repeatly. This is happend because some of files does not removed actually(it succeed, but still remains), so the directory was not empty and removing directory just fails.

[question: was this ever identified and fixed as a bug?]

# rm -rfv *
removed `dev-fs108_4508_1/dir_18/file_72.bin'
removed `dev-fs108_4508_1/dir_18/file_88.bin'
removed directory: `dev-fs108_4508_1/dir_18'
rm: cannot remove directory `dev-fs108_4508_1/dir_18': No such file or directory
removed `dev-fs108_4508_1/dir_19/file_1.bin'
removed `dev-fs108_4508_1/dir_19/file_8.bin'
removed `dev-fs108_4508_1/dir_19/file_9.bin'
removed `dev-fs108_4508_1/dir_19/file_17.bin'
removed directory: `dev-fs108_4508_1/dir_19'
rm: cannot remove directory `dev-fs108_4508_1/dir_19': No such file or directory

# rm -rfv *
(...)

# rm -rfv * 
# // done. all removed

 

Copyright © Gluster, Inc. All Rights Reserved.