linux-kernel - Re: [PATCH 01/13] mm: Update ptep_get

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-Id: <44A8D373-24CA-4777-AFC8-DB48F0DC4FAE@gmail.com>
Date:   Sun, 30 Oct 2022 12:34:51 -0700
From:   Nadav Amit <nadav.amit@...il.com>
To:     Linus Torvalds <torvalds@...ux-foundation.org>
Cc:     Peter Zijlstra <peterz@...radead.org>,
        Jann Horn <jannh@...gle.com>,
        John Hubbard <jhubbard@...dia.com>, X86 ML <x86@...nel.org>,
        Matthew Wilcox <willy@...radead.org>,
        Andrew Morton <akpm@...ux-foundation.org>,
        kernel list <linux-kernel@...r.kernel.org>,
        Linux-MM <linux-mm@...ck.org>,
        Andrea Arcangeli <aarcange@...hat.com>,
        "Kirill A . Shutemov" <kirill.shutemov@...ux.intel.com>,
        jroedel@...e.de, ubizjak@...il.com,
        Alistair Popple <apopple@...dia.com>
Subject: Re: [PATCH 01/13] mm: Update ptep_get_lockless()s comment

On Oct 30, 2022, at 11:19 AM, Linus Torvalds <torvalds@...ux-foundation.org> wrote:

> And page_remove_rmap() could *almost* be called later, but it does
> have code that also depends on the page table lock, although it looks
> like realistically that's just because it "knows" that means that
> preemption is disabled, so it uses non-atomic statistics update.
> 
> I say "knows" in quotes, because that's what the comment says, but it
> turns out that __mod_node_page_state() has to deal with CONFIG_RT
> anyway and does that
> 
>        preempt_disable_nested();
>        ...
>        preempt_enable_nested();
> 
> thing.
> 
> And then it wants to see the vma, although that's actually only to see
> if it's 'mlock'ed, so we could just squirrel that away.
> 
> So we *could* move page_remove_rmap() later into the TLB flush region,
> but then we would have lost the page table lock anyway, so then
> folio_mkclean() can come in regardless.
> 
> So that doesn't even help.

Well, if you combine it with the per-page-table stale TLB detection
mechanism that I proposed, I think this could work.

Reminder (feel free to skip): you would have per-mm “completed
TLB-generation” in addition to the current one, which would be renamed to
“pending TLB-generation”. Whenever you update the page-tables in a manner
that might require a TLB flush, you would increase the “pending
TLB-generation” and save the pending TLB-generation in the page-table’s
page-struct. All of that is done once under the page-table lock. When you
finish a TLB-flush, you update the “completed TLB-generation”.

Then on page_vma_mkclean_one(), you would check if the page-table’s
TLB-generation is greater than the completed TLB-generation, which would
indicate that TLB entries for PTEs in this table might be stale. In that
case you would just flush the TLB. [ Of course you can instead just flush if
mm_tlb_flush_pending(), but nobody likes this mechanism that has a very
coarse granularity, and therefore can lead to many unnecessary TLB flushes.
]

Indeed, there would be potentially some overhead in extreme cases, since
mm's TLB-generation since its cache is already highly-contended in extreme
cases. But I think it worth it to have simple logic that allows to reason
about correctness.

My intuition is that although you appear to be right that we can just mark
this case as “extreme case nobody cares about”, it might have now or in the
future some other implications that are hard to predict and prevent.