linux-kernel - Re: [RFC PATCH 1/4] mm: use current as mmu notifier's owner

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <1e854923-c746-45ce-9f56-1c01a41992b3@redhat.com>
Date: Fri, 15 Aug 2025 10:11:41 +0300
From: Mika Penttilä <mpenttil@...hat.com>
To: Alistair Popple <apopple@...dia.com>
Cc: Jason Gunthorpe <jgg@...dia.com>, linux-mm@...ck.org,
 linux-kernel@...r.kernel.org, David Hildenbrand <david@...hat.com>,
 Leon Romanovsky <leonro@...dia.com>, Balbir Singh <balbirs@...dia.com>
Subject: Re: [RFC PATCH 1/4] mm: use current as mmu notifier's owner

On 8/15/25 08:23, Alistair Popple wrote:

> On Thu, Aug 14, 2025 at 08:45:43PM +0300, Mika Penttilä wrote:
>> On 8/14/25 20:20, Jason Gunthorpe wrote:
>>
>>> On Thu, Aug 14, 2025 at 08:00:01PM +0300, Mika Penttilä wrote:
>>>> as well as hmm test module with :
>>>>
>>>>          * Ignore invalidation callbacks for device private pages since
>>>>          * the invalidation is handled as part of the migration process.
>>>>          */
>>>>         if (range->event == MMU_NOTIFY_MIGRATE &&
>>>>             range->owner == dmirror->mdevice)
>>>>                 return true;
>>> If I recall this was about a very specific case where migration does a
>>> number of invalidations and some of the earlier ones are known to be
>>> redundant in this specific case. Redundant means it can be ignored
>>> without causing an inconsistency.
>>>
>>> Alistair would know, but I assumed this works OK because the above
>>> invalidation doesn't actually go on to free any pages but keeps them
>>> around until a later invalidation?

Thanks Alistair for your deep insights! 

> Right, the pages don't actually get freed because a reference is taken on them
> during migrate_vma_setup(). However other device MMU's still need invalidating
> because the driver will go on to copy the page after this step. It's just
> assumed that the driver is able to be consistent with itself (ie. it will unmap/
> invalidate it's own MMU prior to initiating the copy).

And reference is taken as well in migrate on fault during hmm_range_fault
if migrating.

>
> In practice I suspect what Mika is running into is that the page table
> synchronisation for migration works slightly differently for migrate_vma_*().
>
> Instead of using mmu_interval_notifier's which have a sequence number drivers
> typically use normal mmu_notifier's and take a device specific lock to block
> page table downgrades (eg. RW -> RO). This ensures it's safe to update the
> device page tables with the PFNs/permissions collected in migrate_vma_setup()
> (or the new PFN) by blocking other threads from updating the page table.
>
> The ususal problem with this approach is that when migrate_vma_setup() calls
> the mmu_notifier it deadlocks on the device specific lock in the notifier
> callback because it already holds the lock, which it can't drop before calling
> migrate_vma_setup().
>
> I think one of the main benefits of a series which consolidates these two
> page-table mirroring techniques into common code would also be to make the
> mirroring/invalidation logic the same for migration as hmm_range_fault(). Ie. to
> move to mmu_interval notifers with sequence numbers for migration, perhaps with
> filtering if required/safe and retries

Yes with the migrate_vma_setup() and collecting removed, the firing of mmu notifiers
and "collecting" are integral part of the hmm_range_fault() flow, so logical to use
interval notifiers for migrate also.

I have removed the commit with the owner games. I studied it more and seems it was added
to mitigate a bug in an early version, which led me to do wrong conclusion of the root cause
of the hang. That version had unbalanced mmu_notifier_invalidate_range_start()
after returning from hmm_range_fault() with EBUSY (after done a folio split).
With that fixed, driving the migrate on fault using the interval notifiers seems to work well, 
filtering MMU_NOTIFY_MIGRATE for device for retries.

>
>  - Alistair

--Mika