linux-ext4 - Re: [PATCH] e2fsprogs: don't set stripe/stride to 1 block in mkfs

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <4D9E55C0.5000607@redhat.com>
Date:	Thu, 07 Apr 2011 17:24:32 -0700
From:	Eric Sandeen <sandeen@...hat.com>
To:	Andreas Dilger <adilger@...ger.ca>
CC:	ext4 development <linux-ext4@...r.kernel.org>,
	Zeev Tarantov <zeev.tarantov@...il.com>,
	Alex Zhuravlev <bzzz@...mcloud.com>
Subject: Re: [PATCH] e2fsprogs: don't set stripe/stride to 1 block in mkfs

On 4/7/11 5:13 PM, Andreas Dilger wrote:
> 
> On 2011-04-05, at 10:56 AM, Eric Sandeen wrote:
> 
>> On 4/5/11 9:39 AM, Eric Sandeen wrote:
>>> Andreas Dilger wrote:
>>>> I don't think it is harmful to specify an mballoc alignment that is
>>>> an even multiple of the underlying device IO size (e.g. at least
>>>> 256kB or 512kB).
>>>>
>>>> If the underlying device (e.g. zram) is reporting 16kB or 64kB opt_io
>>>> size because that is PAGE_SIZE, but blocksize is 4kB, then we will
>>>> have the same performance problem again.> 
>>>> Cheers, Andreas
>>>
>>> I need to look into why ext4_mb_scan_aligned is so inefficient for a block-sized stripe.
>>>
>>> In practice I don't think we've seen this problem with stripe size at 4 or 8 or 16 blocks; it may just be less apparent.  I think the function steps through by stripe-sized units, and if that is 1 block, it's a lot of stepping.  
>>>
>>>        while (i < EXT4_BLOCKS_PER_GROUP(sb)) {
>>> ...
>>>                if (!mb_test_bit(i, bitmap)) {
>>
>> Offhand I think maybe mb_find_next_zero_bit would be more efficient.
>>
>> --- a/fs/ext4/mballoc.c
>> +++ b/fs/ext4/mballoc.c
>> @@ -1939,16 +1939,14 @@ void ext4_mb_scan_aligned(struct ext4_allocation_context *ac,
>>        i = (a * sbi->s_stripe) - first_group_block;
>>
>>        while (i < EXT4_BLOCKS_PER_GROUP(sb)) {
>> -               if (!mb_test_bit(i, bitmap)) {
>> -                       max = mb_find_extent(e4b, 0, i, sbi->s_stripe, &ex);
>> -                       if (max >= sbi->s_stripe) {
>> -                               ac->ac_found++;
>> -                               ac->ac_b_ex = ex;
>> -                               ext4_mb_use_best_found(ac, e4b);
>> -                               break;
>> -                       }
>> +               i = mb_find_next_zero_bit(bitmap, EXT4_BLOCKS_PER_GROUP(sb), i);
>> +               max = mb_find_extent(e4b, 0, i, sbi->s_stripe, &ex);
>> +               if (max >= sbi->s_stripe) {
>> +                       ac->ac_found++;
>> +                       ac->ac_b_ex = ex;
>> +                       ext4_mb_use_best_found(ac, e4b);
>> +                       break;
>>                }
>> -               i += sbi->s_stripe;
>>        }
>> }
>>
>> totally untested, but I think we have better ways to step through the bitmap.
> 
> This changes the allocation completely, AFAICS. Instead of doing
> checks for chunks of free space aligned on sbi->s_stripe boundaries,
> it is instead finding the first free space of size s_stripe
> regardless of alignment. That is not good for RAID back-ends, and is
> the primary reason for ext4_mb_scan_aligned() to exist.

Oh, er, right.  It's what I get for coding-at-conference, sorry.

I do wonder if test-bit/advance/test-bit/advance can be made a bit more efficient with something like find_next_bit.  I just did it wrong. :(

I'll revisit it when I get back home.

> I think my original assertion holds - that regardless of what the
> "optimal IO" size reported by the underlying device, doing larger
> allocations at the mballoc level that are even multiples of this size
> isn't harmful. That avoids not only the performance impact of
> 4kB-sized "optimal IO", but also the (lesser) impact of 8kB-64kB
> "optimal IO" allocations as well.> 
> Cheers, Andreas

I'll give that some thought; really, the whole align-on-a-stripe mechanism needs work, at least outside of the Lustre workload :)

Thanks,
-Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html