linux-kernel - Re: [PATCH v2] mm: page_alloc: move mlocked flag clearance into free_pages

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <ZxfZ_VSeOo2Vnmmg@casper.infradead.org>
Date: Tue, 22 Oct 2024 17:59:41 +0100
From: Matthew Wilcox <willy@...radead.org>
To: Sean Christopherson <seanjc@...gle.com>
Cc: Yosry Ahmed <yosryahmed@...gle.com>,
	Roman Gushchin <roman.gushchin@...ux.dev>,
	Andrew Morton <akpm@...ux-foundation.org>, linux-mm@...ck.org,
	Vlastimil Babka <vbabka@...e.cz>, linux-kernel@...r.kernel.org,
	stable@...r.kernel.org, Hugh Dickins <hughd@...gle.com>,
	kvm@...r.kernel.org, Paolo Bonzini <pbonzini@...hat.com>
Subject: Re: [PATCH v2] mm: page_alloc: move mlocked flag clearance into
 free_pages_prepare()

On Tue, Oct 22, 2024 at 08:39:34AM -0700, Sean Christopherson wrote:
> On Tue, Oct 22, 2024, Yosry Ahmed wrote:
> > Even if we don't want mlock() to err in this case, shouldn't we just do
> > nothing?
> 
> Ideally, yes.

Agreed.  There's no sense in having this count against the NR_MLOCK
stats, for example.

> > I see a lot of checks at the beginning of mlock_fixup() to check
> > whether we should operate on the vma, perhaps we should also check for
> > these KVM vmas?
> 
> Definitely not.  KVM may be doing something unexpected, but the VMA certainly
> isn't unique enough to warrant mm/ needing dedicated handling.
> 
> Focusing on KVM is likely a waste of time.  There are probably other subsystems
> and/or drivers that .mmap() kernel allocated memory in the same way.  Odds are
> good KVM is just the messenger, because syzkaller knows how to beat on KVM.  And
> even if there aren't any other existing cases, nothing would prevent them from
> coming along in the future.

They all need to be fixed.  How to do that is not an answer I have at
this point.  Ideally we can fix them without changing them all immediately
(but they will all need to be fixed eventually because pages will no
longer have a refcount and so get_page() will need to go away ...)

> > Trying to or maybe set VM_SPECIAL in kvm_vcpu_mmap()? I am not
> > sure tbh, but this doesn't seem right.
> 
> Agreed.  VM_DONTEXPAND is the only VM_SPECIAL flag that is remotely appropriate,
> but setting VM_DONTEXPAND could theoretically break userspace, and other than
> preventing mlock(), there is no reason why the VMA can't be expanded.  I doubt
> any userspace VMM is actually remapping and expanding a vCPU mapping, but trying
> to fudge around this outside of core mm/ feels kludgy and has the potential to
> turn into a game of whack-a-mole.

Actually, VM_PFNMAP is probably ideal.  We're not really mapping pages
here (I mean, they are pages, but they're not filesystem pages or
anonymous pages ... there's no rmap to them).  We're mapping blobs of
memory whose refcount is controlled by the vma that maps them.  We don't
particularly want to be able to splice() this memory, or do RDMA to it.
We probably do want gdb to be able to read it (... yes?) which might be
a complication with a PFNMAP VMA.

We've given a lot of flexibility to device drivers about how they
implement mmap() and I think that's now getting in the way of some
important improvements.  I want to see a simpler way of providing the
same functionality, and I'm not quite there yet.