lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Thu, 23 Apr 2009 15:02:05 -0700
From:	Curt Wohlgemuth <curtw@...gle.com>
To:	Andreas Dilger <adilger@....com>
Cc:	ext4 development <linux-ext4@...r.kernel.org>
Subject: Re: Question on block group allocation

Hi Andreas:

On Thu, Apr 23, 2009 at 12:08 PM, Andreas Dilger <adilger@....com> wrote:
> On Apr 23, 2009  09:41 -0700, Curt Wohlgemuth wrote:
>> I'm seeing a performance problem on ext4 vs ext2, and in trying to
>> narrow it down, I've got a question about block allocation in ext4
>> that I'm having trouble figuring out.
>>
>> Using dd, I created (in this order) two 4GB files and a 10GB file in
>> the mount directory.
>>
>> The extent blocks are reasonably close together for the two 4GB files,
>> but the extents for the 10GB file show a huge gap, which seems to hurt
>> the random read performance pretty substantially.  Here's the output
>> from debugfs:
>>
>> BLOCKS:
>> (IND):8396832, (0-106495):8282112-8388607,
>> (106496-399359):11241472-11534335, (399360-888831):20482048-20971519,
>> (888832-1116159):23889920-24117247, (1116160-1277951):71665664-
>> 71827455, (1277952-1767423):78678016-79167487,
>> (1767424-2125823):102402048-102760447,
>> (2125824-2148351):102768672-102791199,
>> (2148352-2621439):102793216-103266303
>> TOTAL: 2621441
>>
>> Note the gap between blocks 79167487 and 102402048.
>
> Well, there are other even larger gaps for other chunks of the file.

Really?  Not that it's important, but I'm not seeing them...

>> I was lucky enough to capture the mb_history from this 10GB create:
>>
>> 29109 14       735/30720/32758@...4112 735/30720/2048@...4112
>> 735/30720/2048@...4112  1     0     0  1568  M     0     0
>> 29109 14       736/0/32758@...6160     736/0/2048@...6160
>> 2187/2048/2048@...6160  1     1     0  1568        0     0
>> 29109 14       2187/4096/32758@...8208 2187/4096/2048@...8208
>> 2187/4096/2048@...8208  1     0     0  1568  M     2048  4096
>>
>> I've been staring at ext4_mb_regular_allocator() trying to understand
>> why an allocation with a goal block of 736 ends up with a best found
>> extent group of 2187, and I'm stuck -- at least without a lot of
>> printk messages.  It seems to me that we just cycle through the block
>> groups starting with the goal group until we find a group that fits.
>> Again, according to dumpe2fs, block groups 737, 738, 739, ... all have
>> 32768 free blocks.  So why we end up with a best fit group of 2187 is
>> a mystery to me.
>
> This is likely the "uninit_bg" feature that is causing the allocations
> to skip groups which are marked BLOCK_UNINIT.  In some sense the benefit
> of skipping the block bitmap read during e2fsck is probably not at all
> beneficial compared to the cost of the extra seeking during IO.  As the
> filesystem gets more full, the BLOCK_UNIIT flags would be cleared anyways,
> so we might as well just keep the early allocations contiguous.

Ah, thanks!  That's what I was missing.  Yes, I sort of skipped over
the "is this a good group?" question.

> A simple change to verify this would be something like the following,
> but it hasn't actually been tested.

Tell you what:  I'll try this out and see if it helps out my test case.

Thanks,
Curt

>
> --- ./fs/ext4/mballoc.c.uninit    2009-04-08 19:13:13.000000000 -0600
> +++ ./fs/ext4/mballoc.c 2009-04-23 13:02:22.000000000 -0600
> @@ -1742,10 +1723,6 @@ static int ext4_mb_good_group(struct ext
>        switch (cr) {
>        case 0:
>                BUG_ON(ac->ac_2order == 0);
> -               /* If this group is uninitialized, skip it initially */
> -               desc = ext4_get_group_desc(ac->ac_sb, group, NULL);
> -               if (desc->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT))
> -                       return 0;
>
>                bits = ac->ac_sb->s_blocksize_bits + 1;
>                for (i = ac->ac_2order; i <= bits; i++)
> @@ -2039,9 +2035,7 @@ repeat:
>                        ac->ac_groups_scanned++;
>                        desc = ext4_get_group_desc(sb, group, NULL);
> -                       if (cr == 0 || (desc->bg_flags &
> -                               cpu_to_le16(EXT4_BG_BLOCK_UNINIT) &&
> -                               ac->ac_2order != 0))
> +                       if (cr == 0)
>                                ext4_mb_simple_scan_group(ac, &e4b);
>                        else if (cr == 1 &&
>                                        ac->ac_g_ex.fe_len == sbi->s_stripe)
>
>
> Cheers, Andreas
> --
> Andreas Dilger
> Sr. Staff Engineer, Lustre Group
> Sun Microsystems of Canada, Inc.
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@...r.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ