lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20200925172232.GA2180331@carbon.dhcp.thefacebook.com>
Date:   Fri, 25 Sep 2020 10:22:32 -0700
From:   Roman Gushchin <guro@...com>
To:     Shakeel Butt <shakeelb@...gle.com>
CC:     Ming Lei <ming.lei@...hat.com>, "Theodore Y. Ts'o" <tytso@....edu>,
        Jens Axboe <axboe@...nel.dk>, <linux-ext4@...r.kernel.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "open list:BLOCK LAYER" <linux-block@...r.kernel.org>,
        Linus Torvalds <torvalds@...ux-foundation.org>,
        Linux MM <linux-mm@...ck.org>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Johannes Weiner <hannes@...xchg.org>,
        Vlastimil Babka <vbabka@...e.cz>
Subject: Re: REGRESSION: 37f4a24c2469: blk-mq: centralise related handling
 into blk_mq_get_driver_tag

On Fri, Sep 25, 2020 at 09:47:43AM -0700, Shakeel Butt wrote:
> On Fri, Sep 25, 2020 at 9:32 AM Shakeel Butt <shakeelb@...gle.com> wrote:
> >
> > On Fri, Sep 25, 2020 at 9:19 AM Ming Lei <ming.lei@...hat.com> wrote:
> > >
> > > On Fri, Sep 25, 2020 at 03:31:45PM +0800, Ming Lei wrote:
> > > > On Thu, Sep 24, 2020 at 09:13:11PM -0400, Theodore Y. Ts'o wrote:
> > > > > On Thu, Sep 24, 2020 at 10:33:45AM -0400, Theodore Y. Ts'o wrote:
> > > > > > HOWEVER, thanks to a hint from a colleague at $WORK, and realizing
> > > > > > that one of the stack traces had virtio balloon in the trace, I
> > > > > > realized that when I switched the GCE VM type from e1-standard-2 to
> > > > > > n1-standard-2 (where e1 VM's are cheaper because they use
> > > > > > virtio-balloon to better manage host OS memory utilization), problem
> > > > > > has become, much, *much* rarer (and possibly has gone away, although
> > > > > > I'm going to want to run a lot more tests before I say that
> > > > > > conclusively) on my test setup.  At the very least, using an n1 VM
> > > > > > (which doesn't have virtio-balloon enabled in the hypervisor) is
> > > > > > enough to unblock ext4 development.
> > > > >
> > > > > .... and I spoke too soon.  A number of runs using -rc6 are now
> > > > > failing even with the n1-standard-2 VM, so virtio-ballon may not be an
> > > > > indicator.
> > > > >
> > > > > This is why debugging this is frustrating; it is very much a heisenbug
> > > > > --- although 5.8 seems to work completely reliably, as does commits
> > > > > before 37f4a24c2469.  Anything after that point will show random
> > > > > failures.  :-(
> > > >
> > > > It does not make sense to mention 37f4a24c2469, which is reverted in
> > > > 4e2f62e566b5. Later the patch in 37f4a24c2469 is fixed and re-commited
> > > > as 568f27006577.
> > > >
> > > > However, I can _not_ reproduce the issue by running the same test on
> > > > kernel built from 568f27006577 directly.
> > > >
> > > > Also you have confirmed that the issue can't be fixed after reverting
> > > > 568f27006577 against v5.9-rc4.
> > > >
> > > > Looks the real issue(slab list corruption) should be introduced between
> > > > 568f27006577 and v5.9-rc4.
> > >
> > > git bisect shows the first bad commit:
> > >
> > >         [10befea91b61c4e2c2d1df06a2e978d182fcf792] mm: memcg/slab: use a single set of
> > >                 kmem_caches for all allocations
> > >
> > > And I have double checked that the above commit is really the first bad
> > > commit for the list corruption issue of 'list_del corruption, ffffe1c241b00408->next
> > > is LIST_POISON1 (dead000000000100)', see the detailed stack trace and
> > > kernel oops log in the following link:
> > >
> > >         https://lore.kernel.org/lkml/20200916202026.GC38283@mit.edu/
> >
> > The failure signature is similar to
> > https://lore.kernel.org/lkml/20200901075321.GL4299@shao2-debian/
> >
> > >
> > > And the kernel config is the one(without KASAN) used by Theodore in GCE VM, see
> > > the following link:
> > >
> > >         https://lore.kernel.org/lkml/20200917143012.GF38283@mit.edu/
> > >
> > > The reproducer is xfstests generic/038. In my setting, test device is virtio-scsi, and
> > > scratch device is virtio-blk.
> 
> Is it possible to check SLUB as well to confirm that the issue is only
> happening on SLAB?

Can you also, please, check if passing cgroup.memory=nokmem as a boot argument
is fixing the issue?

Thanks!

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ