GlusterFS 2.0 I/O Benchmark Results
From GlusterDocumentation
Abstract
About author
- Seungyoung Kim <wolkykim at gmail.com>
- The leader of qDecoder Project : http://www.qdecoder.org
Test Objective
- Test performance and stabilities under the basic configuration. Unified Volume with RR policy.
- No AFR, No fail-over, No Performace Translators such as io-cache, write-behind...
- This test is only for getting idea of GlusterFS. So loosing file when the one of servers goes down is not our concern. (with this configuration, it's expected behavior)
- So "stability" means no-broken connection, no-daemon-crashes & no-unexpected-behaviors during operation.
For the test
- we setup 3 clients. These clients are glusterfs client not end-user clients. So, each of 3 clients is configured as it communicates with servers directly. (no re-exporting nodes between clients and servers)
- Single-threaded writing, reading loop test on 1 client.
- Multi-threaded writing, reading loop test on 1 clients.
- Do same test on 3 clients simultaneously.
- Do same test with the situation of millions of files in there.
- We measure results - basically total elapsed time for each test job.
- It will be good if we can log every test job with server's status(CPU, I/O, Network, ...) using some system monitoring tool.
- It will be good if we can setup millions of files under hierarchical directory structures just like real-world.
- We do not test unusual case such like what happen if 1000 of directories are created vertically.
Test environment
Server Specification
All servers have same specification
- CPU : Dual AMD Dual-Core Opteron Processor 2212
- Memory : 4GB
- Disk : 1.4T RAID0
- NIC : Intel PRO/1000
- Network : Gigabit Ethernet Switch
- OS : CentOS 4.7 (kernel 2.6.9-78.0.13.ELsmp)
RAID Configuration
- Software RAID : md
- Raid level : 0 stripping
- Stripping size : 16K
- Number of disks : SATA 500GB x 3EA
- Total size : 1.4T
- RAID0 Performance :
# hdparm -tT /dev/md0 Timing cached reads: 2108 MB in 2.00 seconds = 1053.11 MB/sec Timing buffered disk reads: 428 MB in 3.01 seconds = 142.12 MB/sec
Test Platform Layout & Configuration
Unified GlusterFS Configuration
+--------+ +--------+ +--------+
| C1 | | C2 | | C3 |
|--------| |--------| |--------|
| | |
+------------+-----------+
|
(Gigabit Ethernet Switch)
|
+-------------+------------+------------+------------+
| | | | |
+---------+ +---------+ +---------+ +---------+ +---------+
| G1 | | G2 | | G3 | | G4 | | G5 |
|---------| |---------| |---------| |---------| |---------|
|file data| |file data| |file data| |file data| |file data|
|namespace| | | | | | | | |
+---------+ +---------+ +---------+ +---------+ +---------+
Unified GlusterFS Configuration
From the version of v2, namespace must be defined for the unified-volume. Multiple namespace nodes can be configured with AFR(Automatic File Replication) feature to prevent single point of failure, but we didn't configure such feature for this test.
- Number of GlusterFS Servers : 5 nodes (G1 ~ G5)
- S/W Version : GlusterFS v2.0.0 RC1
- Compile option : ./configure --prefix=/usr/local --disable-fuse-client --disable-bdb --disable-mod_glusterfs
- Configuration : G1
volume brick type storage/posix option directory /data/glusterfs end-volume volume brick-ns type storage/posix option directory /data/glusterfs-ns end-volume volume server type protocol/server option transport-type tcp subvolumes brick brick-ns option auth.addr.brick.allow * option auth.addr.brick-ns.allow * end-volume
- Configuration : G2 ~ G5
volume brick type storage/posix option directory /data/glusterfs end-volume volume server type protocol/server option transport-type tcp subvolumes brick option auth.addr.brick.allow * end-volume
- GlusterFS Client Node
- S/W Version : GlusterFS v2.0.0 RC1 (FUSE 2.7.4)
- Compile option : ./configure --prefix=/usr/local --disable-bdb --disable-mod_glusterfs
- Unified Volume + Round-Robin Policy
- Configuration : C1 ~ C3
volume g1 type protocol/client option transport-type tcp option remote-host 10.40.197.105 option remote-subvolume brick end-volume volume g2 type protocol/client option transport-type tcp option remote-host 10.40.197.106 option remote-subvolume brick end-volume volume g3 type protocol/client option transport-type tcp option remote-host 10.40.197.107 option remote-subvolume brick end-volume volume g4 type protocol/client option transport-type tcp option remote-host 10.40.197.109 option remote-subvolume brick end-volume volume g5 type protocol/client option transport-type tcp option remote-host 10.40.197.112 option remote-subvolume brick end-volume volume g1-ns type protocol/client option transport-type tcp option remote-host 10.40.197.105 option remote-subvolume brick-ns end-volume volume unify type cluster/unify option namespace g1-ns option scheduler rr subvolumes g1 g2 g3 g4 g5 end-volume
Single GlusterFS & NFS Test Platform
+--------+ +--------+ +--------+
| C1 | | C2 | | C3 |
|--------| |--------| |--------|
| | |
+------------+-----------+
|
(Gigabit Ethernet Switch)
|
+-----------+-----------+
| G0 |
|-----------+-----------|
| GlusterFS | NFS |
|-----------+-----------|
| RAID0 FileSystem |
+-----------------------+
Single GlusterFS Configuration
- Number of GlusterFS Servers : 1 node (G0)
- Configuration : G0
volume brick type storage/posix option directory /data/glusterfs end-volume volume brick-ns type storage/posix option directory /data/glusterfs-ns end-volume volume server type protocol/server option transport-type tcp subvolumes brick brick-ns option auth.addr.brick.allow * option auth.addr.brick-ns.allow * end-volume
- GlusterFS Client Node
- Configuration : C1 ~ C3
volume g0 type protocol/client option transport-type tcp option remote-host 10.40.197.111 option remote-subvolume brick end-volume volume g0-ns type protocol/client option transport-type tcp option remote-host 10.40.197.111 option remote-subvolume brick-ns end-volume volume unify type cluster/unify option namespace g0-ns option scheduler rr subvolumes g0 end-volume
NFS Server Configuration
/data/nfs *(rw,insecure,no_root_squash)
Mount Status
- df -h
glusterfs 6.8T 552M 6.4T 1% /mnt/unify glusterfs 1.1T 108M 1.1T 1% /mnt/single d111:/data/nfs 1.1T 109M 1.1T 1% /mnt/nfs
Benchmark Results
Sequencial Write : 1KB x 1,000,000 times = 1GB
# time dd if=/dev/zero of=/mnt/unify/file bs=1024 count=1000000
| Test Case | Local SATA 500G | Local RAID0 | NFS | Single GlusterFS | Unified GlusterFS |
|---|---|---|---|---|---|
| 1 Worker - 1st test | 11.836s (82.5MB/s) | 11.371s (85.9MB/s) | 23.162s (42.2MB/s) | 2m19.597s (7.0MB/s) | 3m39.279s (4.4MB/s) |
| 1 Worker - 2nd test | 10.537s (92.7MB/s) | 10.777s (90.6MB/s) | 24.181s (40.4MB/s) | 2m24.623s (6.7MB/s) | 3m40.334s (4.4MB/s) |
Sequencial Write : 64KB x 15,625 times = 1GB
# time dd if=/dev/zero of=/mnt/unify/file bs=65536 count=15625
| Test Case | Local SATA 500G | Local RAID0 | NFS | Single GlusterFS | Unified GlusterFS |
|---|---|---|---|---|---|
| 1 Worker - 1st test | 6.390s(152.8MB/s) | 7.939s (123.0MB/s) | 22.766s (42.9MB/s) | 24.637s (39.6MB/s) | 22.436s (43.5MB/s) |
| 1 Worker - 2nd test | 6.588s(148.2MB/s) | 7.542s (129.5MB/s) | 21.901s (44.6MB/s) | 22.001s (44.4MB/s) | 23.378s (41.8MB/s) |
Sequencial Read : 1KB x 1,000,000 times = 1GB
# echo 3 > /proc/sys/vm/drop_caches // clear buffer cache # time dd if=/mnt/unify/file of=/dev/null bs=1024 count=1000000
| Test Case | Local SATA 500G | Local RAID0 | NFS | Single GlusterFS | Unified GlusterFS |
|---|---|---|---|---|---|
| 1 Worker - 1st test | 17.230s (56.7MB/s) | 10.464s (93.3MB/s) | 10.493s (93.0MB/s) | 14.300s (68.3MB/s) | 18.532s (52.7MB/s) |
| 1 Worker - 2nd test | 17.201s (56.8MB/s) | 10.242s (95.3MB/s) | 10.962s (89.0MB/s) | 14.596s (66.9MB/s) | 18.392s (53.1MB/s) |
Sequencial Read : 64KB x 15,625 times = 1GB
# echo 3 > /proc/sys/vm/drop_caches // clear buffer cache # time dd if=/mnt/unify/file of=/dev/null bs=65536 count=15625
| Test Case | Local SATA 500G | Local RAID0 | NFS | Single GlusterFS | Unified GlusterFS |
|---|---|---|---|---|---|
| 1 Worker - 1st test | 21.590s (45.2MB/s) | 4.817s (202.7MB/s) | 9.712s (100.5MB/s) | 14.425s (67.7MB/s) | 14.777s (66.1MB/s) |
| 1 Worker - 2nd test | 22.031s (44.3MB/s) | 4.645s (210.2MB/s) | 9.996s (97.7MB/s) | 14.109s (69.2MB/s) | 14.971s (65.2MB/s) |
How many files can be created in 10 minutes
Simple script is wroted for this test. This script will continuously create 1MB file into the 2-depth hierarchical directory structures. And 100 files will be created per a directory.
#!/bin/sh
HOSTNAME=`hostname`
TIMEOUT=$1
BASEDIR=$2
PID=$$
TIMESTART=`date '+%s'`
TIMEEND=`expr $TIMESTART + $TIMEOUT`
FILECNT=0
RCNT=1
while [ 1 ]; do
ROOTPATH=${BASEDIR}/${HOSTNAME}_${PID}_${RCNT}
mkdir $ROOTPATH
DCNT=1
while [ $DCNT -le 100 ]; do
DIRPATH=${ROOTPATH}/dir_${DCNT}
mkdir $DIRPATH
FCNT=1
while [ $FCNT -le 100 ]; do
FILEPATH=${DIRPATH}/file_${FCNT}.bin
dd if=/dev/zero of=$FILEPATH bs=65536 count=16 &> /dev/null
sync
echo -n "."
let FILECNT=FILECNT+1
let FCNT=FCNT+1
TIMENOW=`date '+%s'`
if [ $TIMENOW -ge $TIMEEND ]; then
echo "PID $PID : $FILECNT files created in $[$TIMENOW - $TIMESTART] seconds."
exit
fi
done
echo ""
let DCNT=DCNT+1
done
let RCNT=RCNT+1
done
Test 10 minutes for the test.
# ./genfiles 600 /mnt/unify
| (threads x clients) | Local SATA 500G | Local RAID0 | NFS | Single GlusterFS | Unified GlusterFS |
|---|---|---|---|---|---|
| 1 Worker (1 x 1) | 14,270 (23.8MB/s) | 26,144 files (43.6MB/s) | 17,466 files (29.1MB/s) | 17,734 files (29.6MB/s) | 19,122 files (31.2MB/s) |
| 10 Workers (10 x 1) | 12,282 (20.4MB/s) | 33,916 files (56.5MB/s) | 16,881 files (28.1MB/s) | 27,938 files (46.6MB/s) | 36,096 files (60.2MB/s) |
| 15 Workers (5 x 3) | x | x | 17,837 files (29.7MB/s) | 22,367 files (37.3MB/s) | 61,280 files (102.1MB/s) |
| 30 Workers (10 x 3) | x | x | x | x | 78,371 files (130.6MB/s) |
| n Workers (n x 3) | (25MB/s may be max) | (50~60MB/s may be max) | (30MB/s may be max) | (40~50MB/s may be max) | (150~200MB/s may be max) |
How many files can be readed in 10 minutes
This script will continuously read 1MB files using cp. To prevent buffer caching effect, it randomize list of files everytime started and it clear buffer cache everytime file readed.
#!/bin/sh
HOSTNAME=`hostname`
TIMEOUT=$1
FILELIST=$2
RANGE=$3
PID=$$
# make random list
RANDLIST=${FILELIST}.${PID}
for FILEPATH in `head -$RANGE $FILELIST`; do
echo "$RANDOM $FILEPATH"
done | sort -n | cut -d ' ' -f '2-' > $RANDLIST
TIMESTART=`date '+%s'`
TIMEEND=`expr $TIMESTART + $TIMEOUT`
# read operation
FILECNT=0
while [ 1 ]; do
for FILEPATH in `cat $RANDLIST`; do
cp $FILEPATH /dev/null
echo 3 > /proc/sys/vm/drop_caches
echo -n "."
let FILECNT=FILECNT+1
TIMENOW=`date '+%s'`
if [ $TIMENOW -ge $TIMEEND ]; then
echo "PID $PID : $FILECNT files readed in $[$TIMENOW - $TIMESTART] seconds."
rm $RANDLIST
exit
fi
done
done
Make list of files.
# find /mnt/single/10 -type f > /mnt/single/filelist # head /mnt/single/filelist /mnt/single/10/dev-fs103_30755_1/dir_1/file_1.bin /mnt/single/10/dev-fs103_30755_1/dir_1/file_2.bin /mnt/single/10/dev-fs103_30755_1/dir_1/file_3.bin /mnt/single/10/dev-fs103_30755_1/dir_1/file_4.bin (...)
Run test.
# ./readfiles 600 /mnt/unify/filelist 5000 // read random-ordered 5000 files continuously
| (threads x clients) | Local SATA 500G | Local RAID0 | NFS | Single GlusterFS | Unified GlusterFS |
|---|---|---|---|---|---|
| 1 Worker (1 x 1) | 8,523 (14.2MB/s) | 10,804 files (18.0MB/s) | 15,605 files (26.0MB/s) | 7,314 files (12.2MB/s) | 10,825 files (18.0MB/s) |
| 10 Workers (10 x 1) | x | 9680 files (16.1MB/s) | 16,215 files (27.0MB/s) | 13056 files (21.8MB/s) | 41,453 files (69.1MB/s) |
| 15 Workers (5 x 3) | x | x | 19,969 files (33.3MB/s) | 10,189 files (17.0MB/s) | 88,724 files (147.9MB/s) |
| 30 Workers (10 x 3) | x | x | x | x | 101979 files (170.0MB/s) |
| n Workers (n x 3) | (15MB/s may be max) | (20 MB/s may be max) | (35MB/s may be max) | (25MB/s may be max) | (200 MB/s may be max) |
Stress test
Failed cases
- 5 writing threads and 20 reading threads per each client. Totally 15 writing threads and 60 reading threads. - FAILED
- 3 writing threads and 10 reading threads per each client. Totally 9 writing threads and 30 reading threads. - FAILED
# ./genfiles 43200 /mnt/unify/big & // x 5 # ./readfiles 43200 /mnt/unify/filelist 100000 & // x 20
This test was not succeed. I tried several times with different stress load... Whenever I press more stress with reading & writing combination, it didn't work correctly. At the first time, I thought that the race condition during clearing buffer cache makes troubles. So I tried without it, but it happened again.
I'm not sure it's solution problem, because the problem is based on the ethernet connection and it can be happen due to linux etherner driver or hardware problems including gigabit-swich(NETGEAR) or linux modules. But one sure thing is that *it is only happened when the reading & writing threads runs together*. I think FUSE is not the problem, ethernet problem occured on both server side and client side but FUSE is only installed on the client side. 2.0.0rc1 is current release, so it might not be happened on the future version or legacy 1.x stable version.
See this...
On the server side. it works fine about 1 or 2 minutes. But after that, some server does not response. When I check on the console, it was alive. But the ethernet does not work correctly.
# ping 10.40.197.90 (network unreachable error messages)
# service network restart # ping 10.40.197.90 (works fine)
# ps ax | grep gluster 32545 ? Rsl 6:08 glusterfsd -f /usr/local/etc/glusterfs/glusterfs-server.vol // no need to restart glusterfsd, this is why I'm not sure GlusterFS makes this problem
On the client side, it looks like below...
('w' means writing, '.' means reading)
.w........w.w....w...ww..........ww..w....w.ww......w....w....w...w..w.w.....
ww...w...w...w....ww.w.w..w.......w.w...w..w........w..w......w....w......w..
..ww.....w..w.....ww....w......w.w......w.w.w...w.w.......w.w..........w..ww.
......w...w.w......w.w.....w...w.w..ww.......ww....w..w....w....w.w....w.....
.w......w.w.....w.ww.......ww.....w...w...w........ww..w....w.....w.ww.......
.ww..ww...ww...ww...w.w..w.
============ after 1-2 minutes
cp: cannot open `/mnt/unify/10/dev-fs108_21569_1/dir_6/file_50.bin' for reading: Input/output error
..ww........cp: cannot open `/mnt/unify/10/dev-fs108_21564_1/dir_12/file_83.bin' for reading: Input/output error
.cp: cannot open `/mnt/unify/10/dev-fs108_21565_1/dir_26/file_1.bin' for reading: Input/output error
.w....w.wcp: cannot open `/mnt/unify/10/dev-fs108_21564_1/dir_21/file_100.bin' for reading: Input/output error
============ getting slower
......cp: cannot open `/mnt/unify/10/dev-fs108_21564_1/dir_37/file_9.bin' for reading: Input/output error
.cp: cannot open `/mnt/unify/10/dev-fs108_21564_1/dir_16/file_70.bin' for reading: Input/output error
..cp: cannot open `/mnt/unify/10/dev-fs108_21563_1/dir_25/file_100.bin' for reading: Input/output error
.....ww.......w.cp: cannot open `/mnt/unify/10/dev-fs108_21566_1/dir_17/file_42.bin' for reading: Input/output error
============ getting more slower
.....wcp: cannot open `/mnt/unify/10/dev-fs108_21569_1/dir_22/file_73.bin' for reading: Input/output error
.......ww..w.....cp: cannot open `/mnt/unify/10/dev-fs108_21566_1/dir_12/file_56.bin' for reading: Input/output error
..w.....cp: cannot open `/mnt/unify/10/dev-fs108_21570_1/dir_24/file_10.bin' for reading: Input/output error
.wcp: cannot open `/mnt/unify/10/dev-fs108_21569_1/dir_23/file_39.bin' for reading: Input/output error
============ finally it was stopped both reading/writing operation
- At this moment, 'df' command was blocked just before printing 'unify' volume information.
- Sometimes ethernet died.
- To recover this, I did like below
# killall genfiles readfiles # service network restart (if it does not work) # umount /mnt/unify # killall glusterfs # glusterfs -f /usr/local/glusterfs/glusterfs-unify.vol /mnt/unify // remounting # df | grep unify glusterfs 7210805248 257527808 6586988544 4% /mnt/unify
Succeed cases
- 10 writing threads on C1 and 30 reading threads on C2 and C3. Totally 10 writing threads and 60 reading threads. - SUCCEED
# ./genfiles 43200 /mnt/unify/big & // x 5 on C1 # ./readfiles 43200 /mnt/unify/filelist 100000 & // x 30 on C2 and C3
Problems
while deleting file structures
Whenever I tried to delete huge file structures. It reports some error like below. To remove entire structures clearly, It is needed to send rm command repeatly. This is happend because some of files does not removed actually(it succeed, but still remains), so the directory was not empty and removing directory just fails.
[question: was this ever identified and fixed as a bug?]
# rm -rfv * removed `dev-fs108_4508_1/dir_18/file_72.bin' removed `dev-fs108_4508_1/dir_18/file_88.bin' removed directory: `dev-fs108_4508_1/dir_18' rm: cannot remove directory `dev-fs108_4508_1/dir_18': No such file or directory removed `dev-fs108_4508_1/dir_19/file_1.bin' removed `dev-fs108_4508_1/dir_19/file_8.bin' removed `dev-fs108_4508_1/dir_19/file_9.bin' removed `dev-fs108_4508_1/dir_19/file_17.bin' removed directory: `dev-fs108_4508_1/dir_19' rm: cannot remove directory `dev-fs108_4508_1/dir_19': No such file or directory # rm -rfv * (...) # rm -rfv * # // done. all removed


