[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <a2fff462-dfe6-4979-a7b2-131c6e0b5017@redhat.com>
Date: Tue, 2 Apr 2024 20:31:35 +0200
From: David Hildenbrand <david@...hat.com>
To: David Matlack <dmatlack@...gle.com>
Cc: Sean Christopherson <seanjc@...gle.com>,
Paolo Bonzini <pbonzini@...hat.com>, kvm@...r.kernel.org,
linux-kernel@...r.kernel.org, David Stevens <stevensd@...omium.org>,
Matthew Wilcox <willy@...radead.org>
Subject: Re: [RFC PATCH 0/4] KVM: x86/mmu: Rework marking folios
dirty/accessed
On 02.04.24 19:38, David Matlack wrote:
> On Wed, Mar 20, 2024 at 5:56 AM David Hildenbrand <david@...hat.com> wrote:
>>
>> On 20.03.24 01:50, Sean Christopherson wrote:
>>> Rework KVM to mark folios dirty when creating shadow/secondary PTEs (SPTEs),
>>> i.e. when creating mappings for KVM guests, instead of when zapping or
>>> modifying SPTEs, e.g. when dropping mappings.
>>>
>>> The motivation is twofold:
>>>
>>> 1. Marking folios dirty and accessed when zapping can be extremely
>>> expensive and wasteful, e.g. if KVM shattered a 1GiB hugepage into
>>> 512*512 4KiB SPTEs for dirty logging, then KVM marks the huge folio
>>> dirty and accessed for all 512*512 SPTEs.
>>>
>>> 2. x86 diverges from literally every other architecture, which updates
>>> folios when mappings are created. AFAIK, x86 is unique in that it's
>>> the only KVM arch that prefetches PTEs, so it's not quite an apples-
>>> to-apples comparison, but I don't see any reason for the dirty logic
>>> in particular to be different.
>>>
>>
>> Already sorry for the lengthy reply.
>>
>>
>> On "ordinary" process page tables on x86, it behaves as follows:
>>
>> 1) A page might be mapped writable but the PTE might not be dirty. Once
>> written to, HW will set the PTE dirty bit.
>>
>> 2) A page might be mapped but the PTE might not be young. Once accessed,
>> HW will set the PTE young bit.
>>
>> 3) When zapping a page (zap_present_folio_ptes), we transfer the dirty
>> PTE bit to the folio (folio_mark_dirty()), and the young PTE bit to
>> the folio (folio_mark_accessed()). The latter is done conditionally
>> only (vma_has_recency()).
>>
>> BUT, when zapping an anon folio, we don't do that, because there zapping
>> implies "gone for good" and not "content must go to a file".
>>
>> 4) When temporarily unmapping a folio for migration/swapout, we
>> primarily only move the dirty PTE bit to the folio.
>>
>>
>> GUP is different, because the PTEs might change after we pinned the page
>> and wrote to it. We don't modify the PTEs and expect the GUP user to do
>> the right thing (set dirty/accessed). For example,
>> unpin_user_pages_dirty_lock() would mark the page dirty when unpinning,
>> where the PTE might long be gone.
>>
>> So GUP does not really behave like HW access.
>>
>>
>> Secondary page tables are different to ordinary GUP, and KVM ends up
>> using GUP to some degree to simulate HW access; regarding NUMA-hinting,
>> KVM already revealed to be very different to all other GUP users. [1]
>>
>> And I recall that at some point I raised that we might want to have a
>> dedicate interface for these "mmu-notifier" based page table
>> synchonization mechanism.
>>
>> But KVM ends up setting folio dirty/access flags itself, like other GUP
>> users. I do wonder if secondary page tables should be messing with folio
>> flags *at all*, and if there would be ways to to it differently using PTEs.
>>
>> We make sure to synchronize the secondary page tables to the process
>> page tables using MMU notifiers: when we write-protect/unmap a PTE, we
>> write-protect/unmap the SPTE. Yet, we handle accessed/dirty completely
>> different.
>
> Accessed bits have the test/clear young MMU-notifiers. But I agree
> there aren't any notifiers for dirty tracking.
>
Yes, and I am questioning if the "test" part should exist -- or if
having a spte in the secondary MMU should require the access bit to be
set (derived from the primary MMU). (again, my explanation about fake HW
page table walkers)
There might be a good reason to do it like that nowadays, so I'm only
raising it as something I was wondering. Likely, frequent clearing of
the access bit would result in many PTEs in the secondary MMU getting
invalidated, requiring a new GUP-fast lookup where we would set the
access bit in the primary MMU PTE. But I'm not an expert on the
implications with MMU notifiers and access bit clearing.
> Are there any cases where the primary MMU transfers the PTE dirty bit
> to the folio _other_ than zapping (which already has an MMU-notifier
> to KVM). If not then there might not be any reason to add a new
> notifier. Instead the contract should just be that secondary MMUs must
> also transfer their dirty bits to folios in sync (or before) the
> primary MMU zaps its PTE.
Grepping for pte_mkclean(), there might be some cases. Many cases use
MMU notifier, because they either clear the PTE or also remove write
permissions.
But these is madvise_free_pte_range() and
clean_record_shared_mapping_range()...->clean_record_pte(), that might
only clear the dirty bit without clearing/changing permissions and
consequently not calling MMU notifiers.
Getting a writable PTE without the dirty bit set should be possible.
So I am questioning whether having a writable PTE in the secondary MMU
with a clean PTE in the primary MMU should be valid to exist. It can
exist today, and I am not sure if that's the right approach.
>
>>
>>
>> I once had the following idea, but I am not sure about all implications,
>> just wanted to raise it because it matches the topic here:
>>
>> Secondary page tables kind-of behave like "HW" access. If there is a
>> write access, we would expect the original PTE to become dirty, not the
>> mapped folio.
>
> Propagating SPTE dirty bits to folios indirectly via the primary MMU
> PTEs won't work for guest_memfd where there is no primary MMU PTE. In
> order to avoid having two different ways to propagate SPTE dirty bits,
> KVM should probably be responsible for updating the folio directly.
>
But who really cares about access/dirty bits for guest_memfd?
guest_memfd already wants to disable/bypass all of core-MM, so different
rules are to be expected. This discussion is all about integration with
core-MM that relies on correct dirty bits, which does not really apply
to guest_memfd.
--
Cheers,
David / dhildenb
Powered by blists - more mailing lists