linux-ext4 - Re: [PATCH v3 4/9] ext4: fix slab-out-of-bounds in ext4_mb_find_good_group_avg_frag

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <469c58c5-1095-cb9d-bd1d-514476e262e0@huawei.com>
Date: Tue, 19 Mar 2024 18:05:53 +0800
From: Baokun Li <libaokun1@...wei.com>
To: Ojaswin Mujoo <ojaswin@...ux.ibm.com>, Jan Kara <jack@...e.cz>
CC: <linux-ext4@...r.kernel.org>, <tytso@....edu>, <adilger.kernel@...ger.ca>,
	<ritesh.list@...il.com>, <adobriyan@...il.com>,
	<linux-kernel@...r.kernel.org>, <yi.zhang@...wei.com>,
	<yangerkun@...wei.com>, <stable@...r.kernel.org>, Baokun Li
	<libaokun1@...wei.com>
Subject: Re: [PATCH v3 4/9] ext4: fix slab-out-of-bounds in
 ext4_mb_find_good_group_avg_frag_lists()

On 2024/3/18 20:39, Ojaswin Mujoo wrote:
> On Thu, Mar 14, 2024 at 10:09:01PM +0800, Baokun Li wrote:
>> --- a/fs/ext4/mballoc.c
>> +++ b/fs/ext4/mballoc.c
>> @@ -831,6 +831,8 @@ static int mb_avg_fragment_size_order(struct super_block *sb, ext4_grpblk_t len)
>>      return 0;
>>    if (order == MB_NUM_ORDERS(sb))
>>      order--;
>> + if (WARN_ON_ONCE(order > MB_NUM_ORDERS(sb)))
>> +   order = MB_NUM_ORDERS(sb) - 1;
> Hey Baokun,
Hi Ojaswin,
>
> Thanks for fixing this. This patch looks good to me, feel free to add:
>
> Reviewed-by: Ojaswin Mujoo <ojaswin@...ux.ibm.com>
Thanks for the review!
> my comments after this are less about the patch and more about some
> thoughts on the working of average fragment lists.
>
> So going through the v2 and this patch got me thinking about what really
> is going to happen when a user tries to allocate 32768 blocks which is also
> the maximum value we could have in say ac->ac_g_ex.fe_len.
>
> When this happens, ext4_mb_regular_allocator() will directly set the
> criteria as CR_GOAL_LEN_FAST. Now, we'll follow:
>
> ext4_mb_choose_next_group_goal_fast()
>    for (i=mb_avg_fragment_size_order(); i < MB_NUM_ORDERS; i++) { .. }
>
> Here, mb_avg_fragment_siz_order() will do something like:
>
>    order = fls(32768) - 2 = 14
>    ...
>    if (order == MB_NUM_ORDERS(sb))
>      order--;
>
>    return order;
>
> And we'll look in the fragment list[13] and since none of the groups
> there would have 32768 blocks free (since we dont track it here) we'll
> unnecessarily traverse the full list before falling to CR_BEST_AVAIL_LEN
> (this will become a noop due to the way order and min_order
> are calculated) and eventually to CR_GOAL_LEN_SLOW where we might get
> something or end up splitting.
That's not quite right, in ext4_mb_choose_next_group_goal_fast() even
though we're looking for the group with order 13, the group with 32768
free blocks is also in there. So after passing ext4_mb_good_group() in
ext4_mb_find_good_group_avg_frag_lists(), we get a group with 32768
free blocks. And in ext4_mb_choose_next_group_best_avail() we were
supposed to allocate blocks quickly by trim order, so it's necessary
here too. So there are no unnecessary loops here.

But this will trigger the freshly added WARN_ON_ONCE, so in the
new iteration I need to change it to:

if (WARN_ON_ONCE(order > MB_NUM_ORDERS(ac->ac_sb) + 1))
         order = MB_NUM_ORDERS(ac->ac_sb) - 1;

In addition, when the block size is 4k, there are these limitations:

1) Limit the maximum size of the data allocation estimate to 8M in
     ext4_mb_normalize_request().
2) #define MAX_WRITEPAGES_EXTENT_LEN 2048
3) #define DIO_MAX_BLOCKS 4096
4) Metadata is generally not allocated in many blocks at a time

So it seems that only group_prealloc will allocate more than 2048
blocks at a time.

And I've tried removing those 8M/2048/4096 limits before, but the
performance of DIO write barely changed, and it doesn't look like
the performance bottleneck is here in the number of blocks allocated
at a time at the moment.

Thanks,
Baokun
> I think something more optimal would be to:
>
> 1. Add another entry to average fragment lists for completely empty
> groups. (As a sidenote i think we should use something like MB_NUM_FRAG_ORDER
> instead of MB_NUM_ORDERS in calculating limits related to average
> fragment lists since the NUM_ORDERS seems to be the buddy max order ie
> 8192 blocks only valid for CR_POWER2 and shouldn't really limit the
> fragment size lists)
>
> 2. If we don't want to go with 1 (maybe there's some history for that),
> then probably should exit early from CR_GOAL_LEN_FAST so that we don't
> iterate there.
>
> Would like to hear your thoughts on it Baokun, Jan.
>
> Regards,
> ojaswin
>