lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAHk-=wgbfGCKKn=RTX9gF9Q+FYOGn2kD1OLnQuJSD+A0Rvg9Pw@mail.gmail.com>
Date:   Wed, 23 Sep 2020 14:50:36 -0700
From:   Linus Torvalds <torvalds@...ux-foundation.org>
To:     Gerald Schaefer <gerald.schaefer@...ux.ibm.com>
Cc:     Peter Xu <peterx@...hat.com>, Heiko Carstens <hca@...ux.ibm.com>,
        Qian Cai <cai@...hat.com>,
        Alexander Gordeev <agordeev@...ux.ibm.com>,
        Vasily Gorbik <gor@...ux.ibm.com>,
        Christian Borntraeger <borntraeger@...ibm.com>,
        linux-s390 <linux-s390@...r.kernel.org>,
        Linux-MM <linux-mm@...ck.org>,
        Linux Kernel Mailing List <linux-kernel@...r.kernel.org>
Subject: Re: BUG: Bad page state in process dirtyc0w_child

On Wed, Sep 23, 2020 at 2:33 PM Gerald Schaefer
<gerald.schaefer@...ux.ibm.com> wrote:
>
> Thanks, very nice walk-through, need some time to digest this. The TLB
> aspect is interesting, and we do have our own __tlb_remove_page_size(),
> which directly calls free_page_and_swap_cache() instead of the generic
> batched approach.

So I don't think it's the free_page_and_swap_cache() itself that is the problem.

As mentioned, the actual pages themselves should be handled by the
reference counting being atomic.

The interrupt disable is really about just the page *tables* being
free'd - not the final page level.

So the issue is that at least on x86-64, we have the serialization
that we will only free the page tables after a cross-CPU IPI has
flushed the TLB.

I think s390 just RCU-free's the page tables instead, which should fix it.

So I think this is special, and s390 is very different from x86, but I
don't think it's the problem.

In fact, I think you pinpointed the real issue:

> Meanwhile, out of curiosity, while I still fail to comprehend commit
> 09854ba94c6a ("mm: do_wp_page() simplification") in its entirety, there
> is one detail that I find most confusing: the unlock_page() has moved
> behind the wp_page_reuse(), while it was the other way round before.

You know what? That was just a mistake, and I think you may actually
have hit the real cause of the problem.

It means that we keep the page locked until after we do the
pte_unmap_unlock(), so now we have no guarantees that we hold the page
referecne.

And then we unlock it - while somebody else might be freeing it.

So somebody is freeing a locked page just as we're unlocking it, and
that matches the problem you see exactly: the debug thing will hit
because the last free happened while locked, and then by the time the
printout happens it has become unlocked so it doesn't show any more.

Duh.

Would you mind testing just moving the unlock_page() back to before
the wp_page_reuse()?

Does that make your debug check go away?

              Linus

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ