linux-ext4 - Re: [PATCH v3] Add largedir feature

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-ID: <20170702233056.oommhqsip46fruuc@thunk.org>
Date:   Sun, 2 Jul 2017 19:30:56 -0400
From:   Theodore Ts'o <tytso@....edu>
To:     Благодаренко Артём 
        <artem.blagodarenko@...il.com>
Cc:     linux-ext4@...r.kernel.org
Subject: Re: [PATCH v3] Add largedir feature

Some more information about the failure that I'm seeing.

It reproduces *extremely* reliably using:

   gce-xfstests -c lustre_mds generic/027

I'm testing on the ext4 dev branch, and it's only show up with the
largedir setup.  The test in question is creating lots of 1k files in
separate directories to hit ENOSPC.  So I'm guessing it's some kind of
problem in an the error handling path.

>From looking at the console logs it looks like things are coming to a
dead halt due to a blocked wait_on_buffer() in jbd2_write_superblock()
in the commit thread.  Everything else ends up waiting for the commit
to finish, and it's all she wrote.

The generic/027 test passes on the 4k and 1k configuration.  It also
passes when run under kvm-xfstests with the same parameters, so it's
likely there is some kind of timing component as well.

I started doing some more digging, and it looks like it has nothing to
do with largedir.  Instead it seems to be something wierd with
lazy_itable initialization.   This works fine:

/sbin/mkfs.ext4 -F -b 4096 /dev/mapper/xt-vdc 65536
mount /dev/mapper/xt-vdc /xt-vdc
sleep 1 ; df ; sleep 1
umount /xt-vdc

Replace the first mkfs command with:

/sbin/mkfs.ext4 -F -I 2048 -b 4096 /dev/mapper/xt-vdc 65536

and the system locks up in the same way as generic/027 when run using
the lustre_mds configuration.

Replace the first mkfs with:

/sbin/mkfs.ext4 -F -I 2048 -b 4096 -E lazy_itable_init=0 /dev/mapper/xt-vdc 65536

there are no problems.  So, it looks like it's some combination of
using a 2048 inode size and lazy itable initialization.

I haven't figured out if this is a recent regression, or whether this
is something that we're only seeing recently.  It also seems to be
related to some SCSI tag aborts that we aren't seeing elsewhere, so it
may have to do with how we are issuing discards.  Whether this is a
GCE issue or something which doesn't show up because the KVM I am
handles discards differently is another unknown issue.  But I thought
I would at least ease your mind that this doesn't seem to be a
specifically a largedir issue.

Cheers,

						- Ted

Download attachment "console.out.gz" of type "application/gzip" (20510 bytes)