linux-kernel - Re: [RFC PATCH 0/4] KVM: x86/mmu: Rework marking folios dirty/accessed

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <36e1592a-e590-48a0-9a79-eeac6b41314b@redhat.com>
Date: Fri, 5 Apr 2024 08:53:49 +0200
From: David Hildenbrand <david@...hat.com>
To: Sean Christopherson <seanjc@...gle.com>
Cc: David Matlack <dmatlack@...gle.com>, Paolo Bonzini <pbonzini@...hat.com>,
 kvm@...r.kernel.org, linux-kernel@...r.kernel.org,
 David Stevens <stevensd@...omium.org>, Matthew Wilcox <willy@...radead.org>
Subject: Re: [RFC PATCH 0/4] KVM: x86/mmu: Rework marking folios
 dirty/accessed

On 05.04.24 00:02, Sean Christopherson wrote:
> On Thu, Apr 04, 2024, David Hildenbrand wrote:
>> On 04.04.24 19:31, Sean Christopherson wrote:
>>> On Thu, Apr 04, 2024, David Hildenbrand wrote:
>>>> On 04.04.24 00:19, Sean Christopherson wrote:
>>>>> Hmm, we essentially already have an mmu_notifier today, since secondary MMUs need
>>>>> to be invalidated before consuming dirty status.  Isn't the end result essentially
>>>>> a sane FOLL_TOUCH?
>>>>
>>>> Likely. As stated in my first mail, FOLL_TOUCH is a bit of a mess right now.
>>>>
>>>> Having something that makes sure the writable PTE/PMD is dirty (or
>>>> alternatively sets it dirty), paired with MMU notifiers notifying on any
>>>> mkclean would be one option that would leave handling how to handle dirtying
>>>> of folios completely to the core. It would behave just like a CPU writing to
>>>> the page table, which would set the pte dirty.
>>>>
>>>> Of course, if frequent clearing of the dirty PTE/PMD bit would be a problem
>>>> (like we discussed for the accessed bit), that would not be an option. But
>>>> from what I recall, only clearing the PTE/PMD dirty bit is rather rare.
>>>
>>> And AFAICT, all cases already invalidate secondary MMUs anyways, so if anything
>>> it would probably be a net positive, e.g. the notification could more precisely
>>> say that SPTEs need to be read-only, not blasted away completely.
>>
>> As discussed, I think at least madvise_free_pte_range() wouldn't do that.
> 
> I'm getting a bit turned around.  Are you talking about what madvise_free_pte_range()
> would do in this future world, or what madvise_free_pte_range() does today?  Because
> today, unless I'm really misreading the code, secondary MMUs are invalidated before
> the dirty bit is cleared.

Likely I missed it, sorry! I was talking about the possible future. :)

> 
> 	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm,
> 				range.start, range.end);
> 
> 	lru_add_drain();
> 	tlb_gather_mmu(&tlb, mm);
> 	update_hiwater_rss(mm);
> 
> 	mmu_notifier_invalidate_range_start(&range);
> 	tlb_start_vma(&tlb, vma);
> 	walk_page_range(vma->vm_mm, range.start, range.end,
> 			&madvise_free_walk_ops, &tlb);
> 	tlb_end_vma(&tlb, vma);
> 	mmu_notifier_invalidate_range_end(&range);
> 

Indeed, we do setup the MMU notifier invalidation. We do the start/end 
.. I was looking for PTE notifications.

I spotted the set_pte_at(), not a set_pte_at_notify() like we do in 
other code. Maybe that's not required here (digging through 
documentation I'm still left clueless).

"
set_pte_at_notify() sets the pte _after_ running the notifier. This is 
safe to start by updating the secondary MMUs, because the primary MMU 
pte invalidate must have already happened with a ptep_clear_flush() 
before set_pte_at_notify() has been invoked. ...
"

Absolutely unclear to me when we *must* to use it, or if it is. Likely 
its a pure optimization when we're *changing* a PTE.

KSM and wp_page_copy() uses it with MMU_NOTIFY_CLEAR, when replacing the 
mapped page by another page (or write-protecting an existing one). 
__replace_page() uses it with __replace_page() when replacing the mapped 
page. And migrate_vma_insert_page() uses it with MMU_NOTIFY_MIGRATE.

Other code (e.g., khugepaged) doesn't seem to use it as well.

.. so I guess it's fine? Confusing.

-- 
Cheers,

David / dhildenb