linux-kernel - Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <X+PY3zQeYNnheZhT@redhat.com>
Date:   Wed, 23 Dec 2020 18:55:11 -0500
From:   Andrea Arcangeli <aarcange@...hat.com>
To:     Nadav Amit <nadav.amit@...il.com>
Cc:     Yu Zhao <yuzhao@...gle.com>, Peter Zijlstra <peterz@...radead.org>,
        Minchan Kim <minchan@...nel.org>,
        Linus Torvalds <torvalds@...ux-foundation.org>,
        Peter Xu <peterx@...hat.com>, linux-mm <linux-mm@...ck.org>,
        lkml <linux-kernel@...r.kernel.org>,
        Pavel Emelyanov <xemul@...nvz.org>,
        Mike Kravetz <mike.kravetz@...cle.com>,
        Mike Rapoport <rppt@...ux.vnet.ibm.com>,
        stable <stable@...r.kernel.org>,
        Andy Lutomirski <luto@...nel.org>,
        Will Deacon <will@...nel.org>
Subject: Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect

On Wed, Dec 23, 2020 at 02:45:59PM -0800, Nadav Amit wrote:
> I think it may be reasonable.

Whatever solution used, there will be 2 users of it: uffd-wp will use
whatever technique used by clear_refs_write to avoid the
mmap_write_lock.

My favorite is Yu's patch and not the group lock anymore. The cons is
it changes the VM rules (which kind of reminds me my initial proposal
of adding a spurious tlb flush if mm_tlb_flush_pending is set, except
I didn't correctly specify it'd need to go in the page fault), but it
still appears the simplest.

> Just a proposal: At some point we can also ask ourselves whether the
> “artificial" limitation of the number of software bits per PTE should really
> limit us, or do we want to hold some additional metadata per-PTE by either
> putting it in an adjacent page (holding 64-bits of additional software-bits
> per PTE) or by finding some place in the page-struct to link to this
> metadata (and have the liberty of number of bits per PTE). One of the PTE
> software-bits can be repurposed to say whether there is “extra-metadata”
> associated with the PTE.
> 
> I am fully aware that there will be some overhead associated, but it
> can be limited to less-common use-cases.

That's a good point, so far far we didn't run out so it's not an
immediate concern. (as opposed we run out in page->flags where the
PG_tail went to some LSB).

In general kicking the can down the road sounds like the best thing to
do for those bit shortage matters, until we can't anymore at
least.. There's no gain to the kernel runtime, in doing something
generically good here (again see where PG_tail rightfully went).

So before spending RAM and CPU, we'd need to find a more compact
encoding with the bits we already have available.

This reminds me again we could double check if we could make
VM_UFFD_WP mutually exclusive with VM_SOFTDIRTY.

I wasn't sure if it could ever happen in a legit way to use both at
the same time (CRIU on a app using uffd-wp for its own internal mm
management?).

Theoretically it's too late already for it, but VM_UFFD_WP is
relatively new, if we're sure it cannot ever happen in a legit way, it
would be possible to at least evaluate/discuss it. This is an
immediate matter.

What we'll do if we later run out, is not an immediate matter instead,
because it won't make our life any simpler to resolve it now.

Thanks,
Andrea