linux-kernel - Re: [PATCH v2 00/35] Speculative page faults

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-Id: <20220131171434.89870a8f1ae294912e7ff19e@linux-foundation.org>
Date:   Mon, 31 Jan 2022 17:14:34 -0800
From:   Andrew Morton <akpm@...ux-foundation.org>
To:     Michel Lespinasse <michel@...pinasse.org>
Cc:     Linux-MM <linux-mm@...ck.org>, linux-kernel@...r.kernel.org,
        kernel-team@...com, Laurent Dufour <ldufour@...ux.ibm.com>,
        Jerome Glisse <jglisse@...gle.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Michal Hocko <mhocko@...e.com>,
        Vlastimil Babka <vbabka@...e.cz>,
        Davidlohr Bueso <dave@...olabs.net>,
        Matthew Wilcox <willy@...radead.org>,
        Liam Howlett <liam.howlett@...cle.com>,
        Rik van Riel <riel@...riel.com>,
        Paul McKenney <paulmck@...nel.org>,
        Song Liu <songliubraving@...com>,
        Suren Baghdasaryan <surenb@...gle.com>,
        Minchan Kim <minchan@...gle.com>,
        Joel Fernandes <joelaf@...gle.com>,
        David Rientjes <rientjes@...gle.com>,
        Axel Rasmussen <axelrasmussen@...gle.com>,
        Andy Lutomirski <luto@...nel.org>,
        Sebastian Andrzej Siewior <bigeasy@...utronix.de>
Subject: Re: [PATCH v2 00/35] Speculative page faults

On Fri, 28 Jan 2022 05:09:31 -0800 Michel Lespinasse <michel@...pinasse.org> wrote:

> Patchset summary:
> 
> Classical page fault processing takes the mmap read lock in order to
> prevent races with mmap writers. In contrast, speculative fault
> processing does not take the mmap read lock, and instead verifies,
> when the results of the page fault are about to get committed and
> become visible to other threads, that no mmap writers have been
> running concurrently with the page fault. If the check fails,
> speculative updates do not get committed and the fault is retried
> in the usual, non-speculative way (with the mmap read lock held).
> 
> The concurrency check is implemented using a per-mm mmap sequence count.
> The counter is incremented at the beginning and end of each mmap write
> operation. If the counter is initially observed to have an even value,
> and has the same value later on, the observer can deduce that no mmap
> writers have been running concurrently with it between those two times.
> This is similar to a seqlock, except that readers never spin on the
> counter value (they would instead revert to taking the mmap read lock),
> and writers are allowed to sleep. One benefit of this approach is that
> it requires no writer side changes, just some hooks in the mmap write
> lock APIs that writers already use.
> 
> The first step of a speculative page fault is to look up the vma and
> read its contents (currently by making a copy of the vma, though in
> principle it would be sufficient to only read the vma attributes that
> are used in page faults). The mmap sequence count is used to verify
> that there were no mmap writers concurrent to the lookup and copy steps.
> Note that walking rbtrees while there may potentially be concurrent
> writers is not an entirely new idea in linux, as latched rbtrees
> are already doing this. This is safe as long as the lookup is
> followed by a sequence check to verify that concurrency did not
> actually occur (and abort the speculative fault if it did).

I'm surprised that descending the rbtree locklessly doesn't flat-out
oops the kernel.  How are we assured that every pointer which is
encountered actually points at the right thing?  Against things
which tear that tree down?

> The next step is to walk down the existing page table tree to find the
> current pte entry. This is done with interrupts disabled to avoid
> races with munmap().

Sebastian, could you please comment on this from the CONFIG_PREEMPT_RT
point of view?

> Again, not an entirely new idea, as this repeats
> a pattern already present in fast GUP. Similar precautions are also
> taken when taking the page table lock.
> 
> Breaking COW on an existing mapping may require firing MMU notifiers.
> Some care is required to avoid racing with registering new notifiers.
> This patchset adds a new per-cpu rwsem to handle this situation.