linux-kernel - Re: [PATCH 01/13] mm: Update ptep_get

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAHk-=wiuVAHSqQdGX837J279WaXXmOk883PmYojRihYC4oCLNg@mail.gmail.com>
Date:   Mon, 31 Oct 2022 10:19:43 -0700
From:   Linus Torvalds <torvalds@...ux-foundation.org>
To:     Peter Zijlstra <peterz@...radead.org>
Cc:     Nadav Amit <nadav.amit@...il.com>, Jann Horn <jannh@...gle.com>,
        John Hubbard <jhubbard@...dia.com>, X86 ML <x86@...nel.org>,
        Matthew Wilcox <willy@...radead.org>,
        Andrew Morton <akpm@...ux-foundation.org>,
        kernel list <linux-kernel@...r.kernel.org>,
        Linux-MM <linux-mm@...ck.org>,
        Andrea Arcangeli <aarcange@...hat.com>,
        "Kirill A . Shutemov" <kirill.shutemov@...ux.intel.com>,
        jroedel@...e.de, ubizjak@...il.com,
        Alistair Popple <apopple@...dia.com>
Subject: Re: [PATCH 01/13] mm: Update ptep_get_lockless()s comment

On Mon, Oct 31, 2022 at 2:29 AM Peter Zijlstra <peterz@...radead.org> wrote:
>
> On Sun, Oct 30, 2022 at 03:47:36PM -0700, Linus Torvalds wrote:
>
> > + * This is the simplified form of page_remove_rmap(), that only
> > + * deals with last-level pages, so 'compound' is always false,
> > + * and the caller does 'munlock_vma_page(page, vma, compound)'
> > + * separately.
> > + *
> > + * This allows for a much simpler calling convention and code.
> > + *
> > + * The caller holds the pte lock.
> > + */
> > +void page_zap_pte_rmap(struct page *page)
> > +{
>
> One could consider adding something like:
>
> #ifdef USE_SPLIT_PTE_PTLOCKS
>         lockdep_assert_held(ptlock_ptr(page))
> #endif

Yes, except that the page lock goes away in the next few patches and
gets replaced by just using the safe dec_lruvec_page_state() instead,
so it's not really worth it.

> > +     if (!atomic_add_negative(-1, &page->_mapcount))
> > +             return;
> > +
> > +     lock_page_memcg(page);
> > +     __dec_lruvec_page_state(page,
> > +             PageAnon(page) ? NR_ANON_MAPPED : NR_FILE_MAPPED);
> > +     unlock_page_memcg(page);
> > +}
>
> Took me a little while, but yes, .compound=false seems to reduce to
> this.

Yeah - it's why I kept that thing as three separate patches, because
even if each of the patches isn't "immediately obvious", you can at
least go back and follow along and see what I did.

The *full* simplification end result just looks like magic.

Admittedly, I think a lot of that "looks like magic" is because the
rmap code has seriously cruftified over the years. We had that time
when we actually

Go back a decade, and we literally used to do pretty much exactly what
the simplified form does. The transformation to complexity hell starts
with commit 89c06bd52fb9 ("memcg: use new logic for page stat
accounting"), but just look at what it looked like before that:

  git show 89c06bd52fb9^:mm/rmap.c

gets you the state back when it was simple. And look at what it did:

        void page_remove_rmap(struct page *page)
        {
                /* page still mapped by someone else? */
                if (!atomic_add_negative(-1, &page->_mapcount))
                        return;
                ... statistics go here ..

so in the end the simplified form of this page_zap_pte_rmap() really
isn't *that* surprising.

In fact, that old code handled PageHuge() fairly naturally too, and
almost all the mess comes from the memcg accounting - and locking -
changes.

And I actually really wanted to one step further, and try to then
batch up the page state accounting too. It's kind of stupid how we do
all that memcg locking for each page, and with this new setup we have
one nice array of pages that we *could* try to just batch things with.

The pages in normal situations *probably* mostly all share the same
memcg and node, so just optimistically trying to do something like "as
long as it's the same memcg as last time, just keep the lock".

Instead of locking and unlocking for every single page.

But just looking at it exhausted me enough that I don't think I'll go there.

Put another way: after this series, it's not the 'rmap' code that
makes me go Ugh, it's the memcg tracking..

(But the hugepage rmap code is incredibly nasty stuff still, and I
think the whole 'compound=true' case would be better with somebody
taking a look at that case too, but that somebody won't be me).

                  Linus