[Gluster-users] cluster.min-free-disk separate for each, brick

Wed Aug 17 15:19:46 UTC 2011

>
>
>
> Dan Bretherton wrote:
>>
>> On 15/08/11 20:00, gluster-users-request at gluster.org wrote:
>>> Message: 1
>>> Date: Sun, 14 Aug 2011 23:24:46 +0300
>>> From: "Deyan Chepishev - SuperHosting.BG"<dchepishev at superhosting.bg>
>>> Subject: [Gluster-users] cluster.min-free-disk  separate for each
>>>     brick
>>> To: gluster-users at gluster.org
>>> Message-ID:<4E482F0E.3030604 at superhosting.bg>
>>> Content-Type: text/plain; charset=UTF-8; format=flowed
>>>
>>> Hello,
>>>
>>> I have a gluster set up with very different brick sizes.
>>>
>>> brick1: 9T
>>> brick2: 9T
>>> brick3: 37T
>>>
>>> with this configuration if I set the parameter cluster.min-free-disk 
>>> to 10% it
>>> applies to all bricks which is quite uncomfortable with these brick 
>>> sizes,
>>> because 10% for the small bricks are ~ 1T but for the big brick it 
>>> is ~3.7T and
>>> what happens at the end is that if all brick go to 90% usage and I 
>>> continue
>>> writing, the small ones eventually fill up to 100% while the big one 
>>> has enough
>>> free space.
>>>
>>> My question is, is there a way to set cluster.min-free-disk per 
>>> brick instead
>>> setting it for the entire volume or any other way to work around 
>>> this problem ?
>>>
>>> Thank you in advance
>>>
>>> Regards,
>>> Deyan
>>>
>> Hello Deyan,
>>
>> I have exactly the same problem and I have asked about it before - 
>> see links below.
>>
>> http://community.gluster.org/q/in-version-3-1-4-how-can-i-set-the-minimum-amount-of-free-disk-space-on-the-bricks/ 
>>
>> http://gluster.org/pipermail/gluster-users/2011-May/007788.html
>>
>> My understanding is that the patch referred to in Amar's reply in the 
>> May thread prevents a "migrate-data" rebalance operation failing by 
>> running out of space on smaller bricks, but that doesn't solve the 
>> problem we are having.  Being able to set min-free-disk for each 
>> brick separately would be useful, as would being able to set this 
>> value as a number of bytes rather than a percentage.  However, even 
>> if these features were present we would still have a problem when the 
>> amount of free space becomes less than min-free-disk, because this 
>> just results in a warning message in the logs and doesn't actually 
>> prevent more files from being written.  In other words, min-free-disk 
>> is a soft limit rather than a hard limit.  When a volume is more than 
>> 90% full there may still be hundreds of gigabytes of free space 
>> spread over the large bricks, but the small bricks may each only have 
>> a few gigabytes left of even less.  Users do "df" and see lots of 
>> free space in the volume so they continue writing files.  However, 
>> when GlusterFS chooses to write a file to a small brick, the write 
>> fails with "device full" errors if the file grows too large, which is 
>> often the case here with files typically several gigabytes in size 
>> for some applications.
>>
>> I would really like to know if there is a way to make min-free-disk a 
>> hard limit.  Ideally, GlusterFS would chose a brick on which to write 
>> a file based on how much free space it has left rather than choosing 
>> a brick at random (or however it is done now).  That would solve the 
>> problem of non-uniform brick sizes without the need for a hard 
>> min-free-disk limit.
>>
>> Amar's comment in the May thread about QA testing being done only on 
>> volumes with uniform brick sizes prompted me to start standardising 
>> on a uniform brick size for each volume in my cluster.  My impression 
>> is that implementing the features needed for users with non-uniform 
>> brick sizes is not a priority for Gluster, and that users are all 
>> expected to use uniform brick sizes.  I really think this fact should 
>> be stated clearly in the GlusterFS documentation, in the sections on 
>> creating volumes in the Administration Guide for example.  That would 
>> stop other users from going down the path that I did initially, which 
>> has given me a real headache because I am now having to move tens of 
>> terabytes of data off bricks that are larger than the new standard size.
>>
>> Regards
>> Dan.
>>
> Hello,
>
> This is really bad news, because I already migrated my data and I just 
> realized that I am screwed because Gluster just does not care about 
> the brick sizes.
> It is impossible to move to uniform brick sizes.
>
> Currently we use 2TB  HDDs, but the disks are growing and soon we will 
> probably use 3TB hdds or whatever other larges sizes appear on the 
> market. So if we choose to use raid5 and some level of redundancy (for 
> example 6hdds in raid5, no matter what their size is) this sooner or 
> later will lead us to non uniform bricks which is a problem and it is 
> not correct to expect that we always can or want to provide uniform 
> size bricks.
>
> With this way of thinking if we currently have 10T from 6x2T in hdd5, 
> at some point when there is a 10T on a single disk we will have to use 
> no raid just because gluster can not handle non uniform bricks.
>
> Regards,
> Deyan
>

I think Amar might have provided the answer in his posting to the thread 
yesterday, which has just appeared in my autospam folder.

http://gluster.org/pipermail/gluster-users/2011-August/008579.html

> With size option, you can have a hardbound on min-free-disk
This means that you can set a hard limit on min-free-disk, and set a 
value in GB that is bigger than the biggest file that is ever likely to 
be written.  This looks likely to solve our problem and make non-uniform 
brick sizes a practical proposition.  I wish I had known about this back 
in May when I embarked on my cluster restructuring exercise; the issue 
was discussed in this thread in May as well:  
http://gluster.org/pipermail/gluster-users/2011-May/007794.html

Once I have moved all the data off the large bricks and standardised on 
a uniform brick size, it will be relatively easy to stick to this 
because I use LVM.  I create logical volumes for new bricks when a 
volume needs extending.  The only problem with this approach is what 
happens when the amount of free space left on a server is less than the 
size of the brick you want to create.  The only option then would be to 
use new servers, potentially wasting several TB of free space on 
existing servers.  The standard brick size for most of my volumes is 
3TB, which allows me to use a mixture of small servers and large servers 
in a volume and limits the amount of free space that would be wasted if 
there wasn't quite enough free space on a server to create another 
brick.  Another consequence of having 3TB bricks is that a single server 
typically has two more more bricks belonging to a the same volume, 
although I do my best to distribute the volumes across different servers 
in order to spread the load.  I am not aware of any problems associated 
with exporting multiple bricks from a single server and it has not 
caused me any problems so far that I am aware of.

-Dan.

-- 
Mr. D.A. Bretherton
Computer System Manager
Environmental Systems Science Centre
Harry Pitt Building
3 Earley Gate
University of Reading
Reading, RG6 6AL
UK

Tel. +44 118 378 5205
Fax: +44 118 378 6413