[Gluster-users] replication problems

Thu Oct 1 07:59:48 UTC 2009

Hi

Yes, I have 2 servers and 2 clients.

I have 1 client connecting to both servers and another client connecting 
to both servers.
Both clients configs are exactly the same

If server1 goes down, client1 breaks (as per my mail below), client1 
doesn't just continue working on server2.

Is that enough info?

Adrian Moisey
Systems Designer | CareerJunction | Better jobs. More often.
Web: www.careerjunction.co.za | Email: adrian at careerjunction.co.za
Phone: +27 21 818 8621 | Mobile: +27 82 858 7830 | Fax: +27 21 818 8855

Pavan Vilas Sondur wrote:
> Hi Adrian,
> Correct me if I've got you wrong - You have 2 servers and a client replicates to both the servers. If the first server is down, the client also does not respond. You mentioned about more than 1 client - can you clarify this so that we can try and understand the issue.
> 
> Pavan
> 
> On 01/10/09 08:41 +0200, Adrian Moisey wrote:
>> Hi
>>
>> I am currently testing GlusterFS in with replication.
>> I am running Ubuntu hardy using packages from the PPA on launchpad.net.  
>> I am currently using glusterfs 2.0.6.
>>
>> I have 2 machines, both exporting 1 brick each. This is the config I'm  
>> using:
>> ----8<----8<----8<----8<----8<----8<----8<----8<----8<----
>> volume posix
>>  type storage/posix
>>  option directory /home/export/
>> end-volume
>>
>> volume locks
>>   type features/locks
>>   subvolumes posix
>> end-volume
>>
>> volume cache
>>   type performance/io-cache
>>   subvolumes locks
>> end-volume
>>
>> volume brick
>>   type performance/io-threads
>>   option thread-count 8
>>   subvolumes cache
>> end-volume
>>
>> ### Add network serving capability to above brick.
>> volume server
>>  type protocol/server
>>  option transport-type tcp
>>  subvolumes brick
>>  option auth.addr.brick.allow * # Allow access to "brick" volume
>> end-volume
>> ----8<----8<----8<----8<----8<----8<----8<----8<----8<----
>>
>> I then have 2 clients (which happen to be the same 2 machines) that  
>> connect to both bricks and replicate them using this config:
>>
>> ----8<----8<----8<----8<----8<----8<----8<----8<----8<----
>> ### Add client feature and attach to remote subvolume of server1
>> volume brick1
>>  type protocol/client
>>  option transport-type tcp
>>  option remote-host 172.19.45.102      # IP address of the remote brick
>>  option remote-subvolume brick        # name of the remote volume
>> end-volume
>>
>> ### Add client feature and attach to remote subvolume of server2
>> volume brick2
>>  type protocol/client
>>  option transport-type tcp
>>  option remote-host 172.19.45.103      # IP address of the remote brick
>>  option remote-subvolume brick        # name of the remote volume
>> end-volume
>>
>> volume replicate
>>  type cluster/replicate
>>  subvolumes brick1 brick2
>> end-volume
>> ----8<----8<----8<----8<----8<----8<----8<----8<----8<----
>>
>> If I start the 2 servers up, then mount both clients everything works  
>> file. I have shared storage which is replicated to each host.
>>
>> If I shut the one brick down, the client on that machine also dies and I  
>>  get strange errors:
>> ----8<----8<----8<----8<----8<----8<----8<----8<----8<----
>> # cd /mnt/gluster
>> bash: cd: /mnt/gluster: Transport endpoint is not connected
>> # df -h
>> Filesystem            Size  Used Avail Use% Mounted on
>> /dev/sda1             9.5G  1.1G  7.9G  13% /
>> varrun                125M   68K  125M   1% /var/run
>> varlock               125M     0  125M   0% /var/lock
>> udev                  125M   44K  125M   1% /dev
>> devshm                125M     0  125M   0% /dev/shm
>> df: `/mnt/gluster': Transport endpoint is not connected
>> # mount
>> /dev/sda1 on / type ext3 (rw,relatime,errors=remount-ro)
>> proc on /proc type proc (rw,noexec,nosuid,nodev)
>> /sys on /sys type sysfs (rw,noexec,nosuid,nodev)
>> varrun on /var/run type tmpfs (rw,noexec,nosuid,nodev,mode=0755)
>> varlock on /var/lock type tmpfs (rw,noexec,nosuid,nodev,mode=1777)
>> udev on /dev type tmpfs (rw,mode=0755)
>> devshm on /dev/shm type tmpfs (rw)
>> devpts on /dev/pts type devpts (rw,gid=5,mode=620)
>> securityfs on /sys/kernel/security type securityfs (rw)
>> /etc/glusterfs/glusterfs.vol on /mnt/gluster type fuse.glusterfs  
>> (rw,allow_other,default_permissions,max_read=131072)
>> ----8<----8<----8<----8<----8<----8<----8<----8<----8<----
>>
>> Here is a copy of debug logs:
>> [2009-10-01 08:16:15] D [glusterfsd.c:354:_get_specfp] glusterfs:  
>> loading volume file /etc/glusterfs/glusterfs.vol
>> ================================================================================
>> Version      : glusterfs 2.0.6 built on Aug 31 2009 20:14:31
>> TLA Revision : v2.0.6
>> Starting Time: 2009-10-01 08:16:15
>> Command line : glusterfs --log-level=DEBUG  
>> --volfile=/etc/glusterfs/glusterfs.vol /mnt/gluster/
>> PID          : 17884
>> System name  : Linux
>> Nodename     : cj-cpt-molb01
>> Kernel Release : 2.6.24-24-server
>> Hardware Identifier: i686
>>
>> Given volfile:
>> +------------------------------------------------------------------------------+
>>   1: ### Add client feature and attach to remote subvolume of server1
>>   2: volume brick1
>>   3:  type protocol/client
>>   4:  option transport-type tcp
>>   5:  option remote-host 172.19.45.102      # IP address of the remote  
>> brick
>>   6:  option remote-subvolume brick        # name of the remote volume
>>   7: end-volume
>>   8:
>>   9: ### Add client feature and attach to remote subvolume of server2
>>  10: volume brick2
>>  11:  type protocol/client
>>  12:  option transport-type tcp
>>  13:  option remote-host 172.19.45.103      # IP address of the remote  
>> brick
>>  14:  option remote-subvolume brick        # name of the remote volume
>>  15: end-volume
>>  16:
>>  17: volume replicate
>>  18:  type cluster/replicate
>>  19:  subvolumes brick1 brick2
>>  20: end-volume
>>
>> +------------------------------------------------------------------------------+
>> [2009-10-01 08:16:15] D [glusterfsd.c:1205:main] glusterfs: running in  
>> pid 17884
>> [2009-10-01 08:16:15] D [client-protocol.c:5952:init] brick1: defaulting  
>> frame-timeout to 30mins
>> [2009-10-01 08:16:15] D [client-protocol.c:5963:init] brick1: defaulting  
>> ping-timeout to 10
>> [2009-10-01 08:16:15] D [transport.c:141:transport_load] transport:  
>> attempt to load file /usr/lib/glusterfs/2.0.6/transport/socket.so
>> [2009-10-01 08:16:15] D [transport.c:141:transport_load] transport:  
>> attempt to load file /usr/lib/glusterfs/2.0.6/transport/socket.so
>> [2009-10-01 08:16:15] D [client-protocol.c:5952:init] brick2: defaulting  
>> frame-timeout to 30mins
>> [2009-10-01 08:16:15] D [client-protocol.c:5963:init] brick2: defaulting  
>> ping-timeout to 10
>> [2009-10-01 08:16:15] D [transport.c:141:transport_load] transport:  
>> attempt to load file /usr/lib/glusterfs/2.0.6/transport/socket.so
>> [2009-10-01 08:16:15] D [transport.c:141:transport_load] transport:  
>> attempt to load file /usr/lib/glusterfs/2.0.6/transport/socket.so
>> [2009-10-01 08:16:15] D [client-protocol.c:6280:notify] brick1: got  
>> GF_EVENT_PARENT_UP, attempting connect on transport
>> [2009-10-01 08:16:15] D [client-protocol.c:6280:notify] brick1: got  
>> GF_EVENT_PARENT_UP, attempting connect on transport
>> [2009-10-01 08:16:15] D [client-protocol.c:6280:notify] brick2: got  
>> GF_EVENT_PARENT_UP, attempting connect on transport
>> [2009-10-01 08:16:15] D [client-protocol.c:6280:notify] brick2: got  
>> GF_EVENT_PARENT_UP, attempting connect on transport
>> [2009-10-01 08:16:15] D [client-protocol.c:6280:notify] brick1: got  
>> GF_EVENT_PARENT_UP, attempting connect on transport
>> [2009-10-01 08:16:15] D [client-protocol.c:6280:notify] brick1: got  
>> GF_EVENT_PARENT_UP, attempting connect on transport
>> [2009-10-01 08:16:15] D [client-protocol.c:6280:notify] brick2: got  
>> GF_EVENT_PARENT_UP, attempting connect on transport
>> [2009-10-01 08:16:15] D [client-protocol.c:6280:notify] brick2: got  
>> GF_EVENT_PARENT_UP, attempting connect on transport
>> [2009-10-01 08:16:15] N [glusterfsd.c:1224:main] glusterfs: Successfully  
>> started
>> [2009-10-01 08:16:15] D [client-protocol.c:6294:notify] brick1: got  
>> GF_EVENT_CHILD_UP
>> [2009-10-01 08:16:15] D [client-protocol.c:6294:notify] brick1: got  
>> GF_EVENT_CHILD_UP
>> [2009-10-01 08:16:15] N [client-protocol.c:5559:client_setvolume_cbk]  
>> brick1: Connected to 172.19.45.102:6996, attached to remote volume 
>> 'brick'.
>> [2009-10-01 08:16:15] N [afr.c:2203:notify] replicate: Subvolume  
>> 'brick1' came back up; going online.
>> [2009-10-01 08:16:15] N [client-protocol.c:5559:client_setvolume_cbk]  
>> brick1: Connected to 172.19.45.102:6996, attached to remote volume 
>> 'brick'.
>> [2009-10-01 08:16:15] N [afr.c:2203:notify] replicate: Subvolume  
>> 'brick1' came back up; going online.
>> [2009-10-01 08:16:15] D [client-protocol.c:6294:notify] brick2: got  
>> GF_EVENT_CHILD_UP
>> [2009-10-01 08:16:15] D [client-protocol.c:6294:notify] brick2: got  
>> GF_EVENT_CHILD_UP
>> [2009-10-01 08:16:15] N [client-protocol.c:5559:client_setvolume_cbk]  
>> brick2: Connected to 172.19.45.103:6996, attached to remote volume 
>> 'brick'.
>> [2009-10-01 08:16:15] N [client-protocol.c:5559:client_setvolume_cbk]  
>> brick2: Connected to 172.19.45.103:6996, attached to remote volume 
>> 'brick'.
>> [2009-10-01 08:17:24] N [client-protocol.c:6246:notify] brick1: disconnected
>> [2009-10-01 08:17:27] E [socket.c:745:socket_connect_finish] brick1:  
>> connection to 172.19.45.102:6996 failed (Connection refused)
>> [2009-10-01 08:17:27] E [socket.c:745:socket_connect_finish] brick1:  
>> connection to 172.19.45.102:6996 failed (Connection refused)
>>
>>
>>
>> Any ideas?
>>
>>