linux-kernel - Re: [PATCH v1 12/12] mm/rmap: keep mapcount untouched for device-exclusive entries

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <887df26d-b8bb-48df-af2f-21b220ef22e6@redhat.com>
Date: Thu, 30 Jan 2025 12:42:26 +0100
From: David Hildenbrand <david@...hat.com>
To: linux-kernel@...r.kernel.org, linux-doc@...r.kernel.org,
 dri-devel@...ts.freedesktop.org, linux-mm@...ck.org,
 nouveau@...ts.freedesktop.org, Andrew Morton <akpm@...ux-foundation.org>,
 Jérôme Glisse <jglisse@...hat.com>,
 Jonathan Corbet <corbet@....net>, Alex Shi <alexs@...nel.org>,
 Yanteng Si <si.yanteng@...ux.dev>, Karol Herbst <kherbst@...hat.com>,
 Lyude Paul <lyude@...hat.com>, Danilo Krummrich <dakr@...nel.org>,
 David Airlie <airlied@...il.com>, Simona Vetter <simona@...ll.ch>,
 "Liam R. Howlett" <Liam.Howlett@...cle.com>,
 Lorenzo Stoakes <lorenzo.stoakes@...cle.com>,
 Vlastimil Babka <vbabka@...e.cz>, Jann Horn <jannh@...gle.com>,
 Pasha Tatashin <pasha.tatashin@...een.com>, Peter Xu <peterx@...hat.com>,
 Alistair Popple <apopple@...dia.com>, Jason Gunthorpe <jgg@...dia.com>
Subject: Re: [PATCH v1 12/12] mm/rmap: keep mapcount untouched for
 device-exclusive entries

On 30.01.25 11:37, Simona Vetter wrote:
> On Wed, Jan 29, 2025 at 12:54:10PM +0100, David Hildenbrand wrote:
>> Now that conversion to device-exclusive does no longer perform an
>> rmap walk and the main page_vma_mapped_walk() users were taught to
>> properly handle nonswap entries, let's treat device-exclusive entries just
>> as if they would be present, similar to how we handle device-private
>> entries already.
> 
> So the reason for handling device-private entries in rmap is so that
> drivers can rely on try_to_migrate and related code to invalidate all the
> various ptes even for device private memory. Otherwise no one should hit
> this path, at least if my understanding is correct.

Right, device-private probably only happen to be seen on the migration 
path so far.

> 
> So I'm very much worried about opening a can of worms here because I think
> this adds a genuine new case to all the various callers.

To be clear: it can all already happen.

Assume you have a THP (or any mTHP today). You can easily trigger the 
scenario that folio_mapcount() != 0 with active device-exclusive 
entries, and you start doing rmap walks and stumble over these 
device-exclusive entries and *not* handle them properly. Note that more 
and more systems are configured to just give you THP unless you 
explicitly opted-out using MADV_NOHUGEPAGE early.

Note that b756a3b5e7ea added that hunk that still walks these 
device-exclusive entries in rmap code, but didn't actually update the 
rmap walkers:

@@ -102,7 +104,8 @@ static bool check_pte(struct page_vma_mapped_walk *pvmw)

                 /* Handle un-addressable ZONE_DEVICE memory */
                 entry = pte_to_swp_entry(*pvmw->pte);
-               if (!is_device_private_entry(entry))
+               if (!is_device_private_entry(entry) &&
+                   !is_device_exclusive_entry(entry))
                         return false;

                 pfn = swp_offset(entry);

That was the right thing to do, because they resemble PROT_NONE entries 
and not migration entries or anything else that doesn't hold a folio 
reference).

Fortunately, it's only the page_vma_mapped_walk() callers that need care.

mm/rmap.c is handled with this series.

mm/page_vma_mapped.c should work already.

mm/migrate.c: does not apply

mm/page_idle.c: likely should just skip !pte_present().

mm/ksm.c might be fine, but likely we should just reject !pte_present().

kernel/events/uprobes.c likely should reject !pte_present().

mm/damon/paddr.c likely should reject !pte_present().

I briefly though about a flag to indicate if a page_vma_mapped_walk() 
supports these non-present entries, but likely just fixing them up is 
easier+cleaner.

Now that I looked at all, I might just write patches for them.

> 
>> This fixes swapout/migration of folios with device-exclusive entries.
>>
>> Likely there are still some page_vma_mapped_walk() callers that are not
>> fully prepared for these entries, and where we simply want to refuse
>> !pte_present() entries. They have to be fixed independently; the ones in
>> mm/rmap.c are prepared.
> 
> The other worry is that maybe breaking migration is a feature, at least in
> parts. 

Maybe breaking swap and migration is a feature in some reality, in this 
reality it's a BUG :)

If thp constantly reassembles a pmd entry because hey all the
> memory is contig and userspace allocated a chunk of memory to place
> atomics that alternate between cpu and gpu nicely separated by 4k pages,
> then we'll thrash around invalidating ptes to no end. So might be more
> fallout here.

khugepaged will back off once it sees an exclusive entry, so collapsing 
could only happen once everything is non-exclusive. See 
__collapse_huge_page_isolate() as an example.

It's really only page_vma_mapped_walk() callers that are affected by 
this change, not any other page table walkers.

It's unfortunate that we now have to fix it all up, that original series 
should have never been merged that way.

-- 
Cheers,

David / dhildenb