linux-kernel - Re: [RFC PATCH 0/4] KVM: x86/mmu: Rework marking folios dirty/accessed

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <a2fff462-dfe6-4979-a7b2-131c6e0b5017@redhat.com>
Date: Tue, 2 Apr 2024 20:31:35 +0200
From: David Hildenbrand <david@...hat.com>
To: David Matlack <dmatlack@...gle.com>
Cc: Sean Christopherson <seanjc@...gle.com>,
 Paolo Bonzini <pbonzini@...hat.com>, kvm@...r.kernel.org,
 linux-kernel@...r.kernel.org, David Stevens <stevensd@...omium.org>,
 Matthew Wilcox <willy@...radead.org>
Subject: Re: [RFC PATCH 0/4] KVM: x86/mmu: Rework marking folios
 dirty/accessed

On 02.04.24 19:38, David Matlack wrote:
> On Wed, Mar 20, 2024 at 5:56 AM David Hildenbrand <david@...hat.com> wrote:
>>
>> On 20.03.24 01:50, Sean Christopherson wrote:
>>> Rework KVM to mark folios dirty when creating shadow/secondary PTEs (SPTEs),
>>> i.e. when creating mappings for KVM guests, instead of when zapping or
>>> modifying SPTEs, e.g. when dropping mappings.
>>>
>>> The motivation is twofold:
>>>
>>>     1. Marking folios dirty and accessed when zapping can be extremely
>>>        expensive and wasteful, e.g. if KVM shattered a 1GiB hugepage into
>>>        512*512 4KiB SPTEs for dirty logging, then KVM marks the huge folio
>>>        dirty and accessed for all 512*512 SPTEs.
>>>
>>>     2. x86 diverges from literally every other architecture, which updates
>>>        folios when mappings are created.  AFAIK, x86 is unique in that it's
>>>        the only KVM arch that prefetches PTEs, so it's not quite an apples-
>>>        to-apples comparison, but I don't see any reason for the dirty logic
>>>        in particular to be different.
>>>
>>
>> Already sorry for the lengthy reply.
>>
>>
>> On "ordinary" process page tables on x86, it behaves as follows:
>>
>> 1) A page might be mapped writable but the PTE might not be dirty. Once
>>      written to, HW will set the PTE dirty bit.
>>
>> 2) A page might be mapped but the PTE might not be young. Once accessed,
>>      HW will set the PTE young bit.
>>
>> 3) When zapping a page (zap_present_folio_ptes), we transfer the dirty
>>      PTE bit to the folio (folio_mark_dirty()), and the young PTE bit to
>>      the folio (folio_mark_accessed()). The latter is done conditionally
>>      only (vma_has_recency()).
>>
>> BUT, when zapping an anon folio, we don't do that, because there zapping
>> implies "gone for good" and not "content must go to a file".
>>
>> 4) When temporarily unmapping a folio for migration/swapout, we
>>      primarily only move the dirty PTE bit to the folio.
>>
>>
>> GUP is different, because the PTEs might change after we pinned the page
>> and wrote to it. We don't modify the PTEs and expect the GUP user to do
>> the right thing (set dirty/accessed). For example,
>> unpin_user_pages_dirty_lock() would mark the page dirty when unpinning,
>> where the PTE might long be gone.
>>
>> So GUP does not really behave like HW access.
>>
>>
>> Secondary page tables are different to ordinary GUP, and KVM ends up
>> using GUP to some degree to simulate HW access; regarding NUMA-hinting,
>> KVM already revealed to be very different to all other GUP users. [1]
>>
>> And I recall that at some point I raised that we might want to have a
>> dedicate interface for these "mmu-notifier" based page table
>> synchonization mechanism.
>>
>> But KVM ends up setting folio dirty/access flags itself, like other GUP
>> users. I do wonder if secondary page tables should be messing with folio
>> flags *at all*, and if there would be ways to to it differently using PTEs.
>>
>> We make sure to synchronize the secondary page tables to the process
>> page tables using MMU notifiers: when we write-protect/unmap a PTE, we
>> write-protect/unmap the SPTE. Yet, we handle accessed/dirty completely
>> different.
> 
> Accessed bits have the test/clear young MMU-notifiers. But I agree
> there aren't any notifiers for dirty tracking.
> 

Yes, and I am questioning if the "test" part should exist -- or if 
having a spte in the secondary MMU should require the access bit to be 
set (derived from the primary MMU). (again, my explanation about fake HW 
page table walkers)

There might be a good reason to do it like that nowadays, so I'm only 
raising it as something I was wondering. Likely, frequent clearing of 
the access bit would result in many PTEs in the secondary MMU getting 
invalidated, requiring a new GUP-fast lookup where we would set the 
access bit in the primary MMU PTE. But I'm not an expert on the 
implications with MMU notifiers and access bit clearing.

> Are there any cases where the primary MMU transfers the PTE dirty bit
> to the folio _other_ than zapping (which already has an MMU-notifier
> to KVM). If not then there might not be any reason to add a new
> notifier. Instead the contract should just be that secondary MMUs must
> also transfer their dirty bits to folios in sync (or before) the
> primary MMU zaps its PTE.

Grepping for pte_mkclean(), there might be some cases. Many cases use 
MMU notifier, because they either clear the PTE or also remove write 
permissions.

But these is madvise_free_pte_range() and 
clean_record_shared_mapping_range()...->clean_record_pte(), that might 
only clear the dirty bit without clearing/changing permissions and 
consequently not calling MMU notifiers.

Getting a writable PTE without the dirty bit set should be possible.

So I am questioning whether having a writable PTE in the secondary MMU 
with a clean PTE in the primary MMU should be valid to exist. It can 
exist today, and I am not sure if that's the right approach.

> 
>>
>>
>> I once had the following idea, but I am not sure about all implications,
>> just wanted to raise it because it matches the topic here:
>>
>> Secondary page tables kind-of behave like "HW" access. If there is a
>> write access, we would expect the original PTE to become dirty, not the
>> mapped folio.
> 
> Propagating SPTE dirty bits to folios indirectly via the primary MMU
> PTEs won't work for guest_memfd where there is no primary MMU PTE. In
> order to avoid having two different ways to propagate SPTE dirty bits,
> KVM should probably be responsible for updating the folio directly.
> 

But who really cares about access/dirty bits for guest_memfd?

guest_memfd already wants to disable/bypass all of core-MM, so different 
rules are to be expected. This discussion is all about integration with 
core-MM that relies on correct dirty bits, which does not really apply 
to guest_memfd.

-- 
Cheers,

David / dhildenb