linux-kernel - Re: [RFC PATCH 0/4] KVM: x86/mmu: Rework marking folios dirty/accessed

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <42dbf562-5eab-4f82-ad77-5ee5b8c79285@redhat.com>
Date: Thu, 4 Apr 2024 17:44:18 +0200
From: David Hildenbrand <david@...hat.com>
To: Sean Christopherson <seanjc@...gle.com>
Cc: David Matlack <dmatlack@...gle.com>, Paolo Bonzini <pbonzini@...hat.com>,
 kvm@...r.kernel.org, linux-kernel@...r.kernel.org,
 David Stevens <stevensd@...omium.org>, Matthew Wilcox <willy@...radead.org>
Subject: Re: [RFC PATCH 0/4] KVM: x86/mmu: Rework marking folios
 dirty/accessed

On 04.04.24 00:19, Sean Christopherson wrote:
> On Wed, Apr 03, 2024, David Hildenbrand wrote:
>> On 03.04.24 02:17, Sean Christopherson wrote:
>>> On Tue, Apr 02, 2024, David Hildenbrand wrote:
>>> Aha!  But try_to_unmap_one() also checks that refcount==mapcount+1, i.e. will
>>> also keep the folio if it has been GUP'd.  And __remove_mapping() explicitly states
>>> that it needs to play nice with a GUP'd page being marked dirty before the
>>> reference is dropped.
>>
>>>
>>> 	 * Must be careful with the order of the tests. When someone has
>>> 	 * a ref to the folio, it may be possible that they dirty it then
>>> 	 * drop the reference. So if the dirty flag is tested before the
>>> 	 * refcount here, then the following race may occur:
>>>
>>> So while it's totally possible for KVM to get a W=1,D=0 PTE, if I'm reading the
>>> code correctly it's safe/legal so long as KVM either (a) marks the folio dirty
>>> while holding a reference or (b) marks the folio dirty before returning from its
>>> mmu_notifier_invalidate_range_start() hook, *AND* obviously if KVM drops its
>>> mappings in response to mmu_notifier_invalidate_range_start().
>>>
>>
>> Yes, I agree that it should work in the context of vmscan. But (b) is
>> certainly a bit harder to swallow than "ordinary" (a) :)
> 
> Heh, all the more reason to switch KVM x86 from (b) => (a).

:)

> 
>> As raised, if having a writable SPTE would imply having a writable+dirty
>> PTE, then KVM MMU code wouldn't have to worry about syncing any dirty bits
>> ever back to core-mm, so patch #2 would not be required. ... well, it would
>> be replaces by an MMU notifier that notifies about clearing the PTE dirty
>> bit :)
> 
> Hmm, we essentially already have an mmu_notifier today, since secondary MMUs need
> to be invalidated before consuming dirty status.  Isn't the end result essentially
> a sane FOLL_TOUCH?

Likely. As stated in my first mail, FOLL_TOUCH is a bit of a mess right now.

Having something that makes sure the writable PTE/PMD is dirty (or 
alternatively sets it dirty), paired with MMU notifiers notifying on any 
mkclean would be one option that would leave handling how to handle 
dirtying of folios completely to the core. It would behave just like a 
CPU writing to the page table, which would set the pte dirty.

Of course, if frequent clearing of the dirty PTE/PMD bit would be a 
problem (like we discussed for the accessed bit), that would not be an 
option. But from what I recall, only clearing the PTE/PMD dirty bit is 
rather rare.

I think this GUP / GUP-user setting of folio dirty bits is a bit of a 
mess, but it seems to be working so far, so ... who am I to judge :)

-- 
Cheers,

David / dhildenb