[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <e0e7e18f-6282-b95e-62d4-1f136649e2cb@amd.com>
Date: Tue, 22 Aug 2023 17:30:50 -0500
From: "Kalra, Ashish" <ashish.kalra@....com>
To: Mingwei Zhang <mizhang@...gle.com>,
Sean Christopherson <seanjc@...gle.com>,
Jacky Li <jackyli@...gle.com>
Cc: isaku.yamahata@...el.com, kvm@...r.kernel.org,
linux-kernel@...r.kernel.org, isaku.yamahata@...il.com,
Michael Roth <michael.roth@....com>,
Paolo Bonzini <pbonzini@...hat.com>, erdemaktas@...gle.com,
Sagi Shahar <sagis@...gle.com>,
David Matlack <dmatlack@...gle.com>,
Kai Huang <kai.huang@...el.com>,
Zhi Wang <zhi.wang.linux@...il.com>, chen.bo@...el.com,
linux-coco@...ts.linux.dev,
Chao Peng <chao.p.peng@...ux.intel.com>,
Ackerley Tng <ackerleytng@...gle.com>,
Vishal Annapurve <vannapurve@...gle.com>,
Yuan Yao <yuan.yao@...ux.intel.com>,
Jarkko Sakkinen <jarkko@...nel.org>,
Xu Yilun <yilun.xu@...el.com>,
Quentin Perret <qperret@...gle.com>, wei.w.wang@...el.com,
Fuad Tabba <tabba@...gle.com>
Subject: Re: [PATCH 4/8] KVM: gmem: protect kvm_mmu_invalidate_end()
On 8/21/2023 4:44 PM, Kalra, Ashish wrote:
> Hello Mingwei & Sean,
>
> On 8/18/2023 9:08 PM, Mingwei Zhang wrote:
>> +Jacky Li
>>
>> On Fri, Aug 18, 2023 at 3:45 PM Sean Christopherson
>> <seanjc@...gle.com> wrote:
>>>
>>> +Mingwei to correct me if I'm wrong
>>>
>>> On Fri, Aug 18, 2023, Ashish Kalra wrote:
>>>>
>>>> On 8/18/2023 12:55 PM, Sean Christopherson wrote:
>>>>> On Tue, Aug 15, 2023, isaku.yamahata@...el.com wrote:
>>>>>> From: Isaku Yamahata <isaku.yamahata@...el.com>
>>>>>>
>>>>>> kvm_mmu_invalidate_end() updates struct
>>>>>> kvm::mmu_invalidate_in_progress
>>>>>> and it's protected by kvm::mmu_lock. call
>>>>>> kvm_mmu_invalidate_end() before
>>>>>> unlocking it. Not after the unlock.
>>>>>>
>>>>>> Fixes: 8e9009ca6d14 ("KVM: Introduce per-page memory attributes")
>>>>>
>>>>> This fixes is wrong. It won't matter in the long run, but it makes
>>>>> my life that
>>>>> much harder.
>>>>>
>>>>>> Signed-off-by: Isaku Yamahata <isaku.yamahata@...el.com>
>>>>>> ---
>>>>>> virt/kvm/kvm_main.c | 15 ++++++++++++++-
>>>>>> 1 file changed, 14 insertions(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>>>>>> index 8bfeb615fc4d..49380cd62367 100644
>>>>>> --- a/virt/kvm/kvm_main.c
>>>>>> +++ b/virt/kvm/kvm_main.c
>>>>>> @@ -535,6 +535,7 @@ struct kvm_mmu_notifier_range {
>>>>>> } arg;
>>>>>> gfn_handler_t handler;
>>>>>> on_lock_fn_t on_lock;
>>>>>> + on_unlock_fn_t before_unlock;
>>>>>> on_unlock_fn_t on_unlock;
>>>>>
>>>>> Ugh, shame on my past me. Having on_lock and on_unlock be
>>>>> asymmetrical with respect
>>>>> to the lock is nasty.
>>>>>
>>>>> I would much rather we either (a) be explicit, e.g. before_(un)lock
>>>>> and after_(un)lock,
>>>>> or (b) have just on_(un)lock, make them symetrical, and handle the
>>>>> SEV mess a
>>>>> different way.
>>>>>
>>>>> The SEV hook doesn't actually care about running immediately after
>>>>> unlock, it just
>>>>> wants to know if there was an overlapping memslot. It can run
>>>>> after SRCU is dropped,
>>>>> because even if we make the behavior more precise (right now it
>>>>> blasts WBINVD),
>>>>> just having a reference to memslots isn't sufficient, the code
>>>>> needs to guarantee
>>>>> memslots are *stable*. And that is already guaranteed by the
>>>>> notifier code, i.e.
>>>>> the SEV code could just reacquire SRCU.
>>>>
>>>> On a separate note here, the SEV hook blasting WBINVD is still causing
>>>> serious performance degradation issues with SNP triggered via
>>>> AutoNUMA/numad/KSM, etc. With reference to previous discussions
>>>> related to
>>>> it, we have plans to replace WBINVD with CLFLUSHOPT.
>>>
>>> Isn't the flush unnecessary when freeing shared memory? My
>>> recollection is that
>>> the problematic scenario is when encrypted memory is freed back to
>>> the host,
>>> because KVM already flushes when potentially encrypted mapping memory
>>> into the
>>> guest.
>>>
>>> With SNP+guest_memfd, private/encrypted memory should be unreachabled
>>> via the
>>> hva-based mmu_notifiers. gmem should have full control of the page
>>> lifecycles,
>>> i.e. can get the kernel virtual address as appropriated, and so it
>>> SNP shouldn't
>>> need the nuclear option.
>>>
>>> E.g. something like this?
>>>
>>> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
>>> index 07756b7348ae..1c6828ae391d 100644
>>> --- a/arch/x86/kvm/svm/sev.c
>>> +++ b/arch/x86/kvm/svm/sev.c
>>> @@ -2328,7 +2328,7 @@ static void sev_flush_encrypted_page(struct
>>> kvm_vcpu *vcpu, void *va)
>>>
>>> void sev_guest_memory_reclaimed(struct kvm *kvm)
>>> {
>>> - if (!sev_guest(kvm))
>>> + if (!sev_guest(kvm) || sev_snp_guest(kvm))
>>> return;
>>>
>>> wbinvd_on_all_cpus();
>>
>> I hope this is the final solution :)
>>
>> So, short answer: no.
>>
>> SNP+guest_memfd prevent untrusted host user space from directly
>> modifying the data, this is good enough for CVE-2022-0171, but there
>> is no such guarantee that the host kernel in some scenarios could
>> access the data and generate dirty caches. In fact, AFAIC, SNP VM does
>> not track whether each page is previously shared, isn't it? If a page
>> was previously shared and was written by the host kernel or devices
>> before it was changed to private. No one tracks it and dirty caches
>> are there!
>>
>> So, to avoid any corner case situations like the above, it seems
>> currently we have to retain the property: flushing the cache when the
>> guest memory mapping leaves KVM NPT.
>>
>> Of course, this is fundamentally because SME_COHERENT only applies to
>> CPU cores, but not DMA. If SME_COHERENT is complete, flushing is no
>> longer needed. Alternatively, we need extra bookkeeping for KVM to
>> know whether each page has dirty cache lines. Another alternative is
>> to filter mmu_notifier reasons, which is the part that I am planning
>> to take. thoughts?
>>
Additionally looking at MMU notifier event filtering and the various
code paths (of interest) from where the MMU invalidation notifier gets
invoked:
For NUMA load balancing during #PF fault handling, try_to_migrate_one()
does MMU invalidation notifier invocation with MMU_NOTIFY_CLEAR event
and then looking at KSM code paths, try_to_merge_one_page() ->
write_protect_page() and try_to_merge_one_page() -> replace_page() do
the MMU invalidation notifier invocations also with MMU_NOTIFY_CLEAR event.
Now, i remember from previous discussions, that the CLEAR event is an
overloaded event used for page zapping, madvise, etc., so i don't think
we will be able to filter *out* this event and this event is triggering
most of the performance issues we are observing.
So considering what Sean mentioned earlier:
>What I'm saying is that for guests whose private memory is backed by
>guest_memfd(), which is all SNP guests, it should be impossible for
>memory that is reachable via mmu_notifiers to be mapped in KVM's MMU
as >private. So yes, KVM needs to flush when memory is freed from
>guest_memfd(), but not for memory that is reclaimed by mmu_notifiers,
i.e. not for sev_guest_memory_reclaimed().
I think the right solution for SNP guests should be:
>>> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
>>> index 07756b7348ae..1c6828ae391d 100644
>>> --- a/arch/x86/kvm/svm/sev.c
>>> +++ b/arch/x86/kvm/svm/sev.c
>>> @@ -2328,7 +2328,7 @@ static void sev_flush_encrypted_page(struct
>>> kvm_vcpu *vcpu, void *va)
>>>
>>> void sev_guest_memory_reclaimed(struct kvm *kvm)
>>> {
>>> - if (!sev_guest(kvm))
>>> + if (!sev_guest(kvm) || sev_snp_guest(kvm))
>>> return;
>>>
>>> wbinvd_on_all_cpus();
Thoughts?
Thanks,
Ashish
>
> Now running SNP+guest_memfd with discard=both option enabled:
>
> # bpftrace -e 'kprobe:sev_guest_memory_reclaimed {@[kstack]=count()}'
> Attaching 1 probe...
> ^C
>
> @[
> sev_guest_memory_reclaimed+5
> kvm_mmu_notifier_release+60
> __mmu_notifier_release+128
> exit_mmap+657
> __mmput+72
> mmput+49
> do_exit+752
> do_group_exit+57
> get_signal+2486
> arch_do_signal_or_restart+51
> exit_to_user_mode_prepare+257
> syscall_exit_to_user_mode+42
> do_syscall_64+109
> entry_SYSCALL_64_after_hwframe+114
> ]: 1
> @[
> sev_guest_memory_reclaimed+5
> kvm_mmu_notifier_invalidate_range_start+869
> __mmu_notifier_invalidate_range_start+152
> change_protection+4628
> change_prot_numa+93
> task_numa_work+588
> task_work_run+108
> exit_to_user_mode_prepare+337
> syscall_exit_to_user_mode+42
> do_syscall_64+109
> entry_SYSCALL_64_after_hwframe+114
> ]: 2
> @[
> sev_guest_memory_reclaimed+5
> kvm_mmu_notifier_invalidate_range_start+869
> __mmu_notifier_invalidate_range_start+152
> change_protection+4628
> change_prot_numa+93
> task_numa_work+588
> task_work_run+108
> xfer_to_guest_mode_handle_work+228
> kvm_arch_vcpu_ioctl_run+1572
> kvm_vcpu_ioctl+671
> __x64_sys_ioctl+153
> do_syscall_64+96
> entry_SYSCALL_64_after_hwframe+114
> ]: 2
> @[
> sev_guest_memory_reclaimed+5
> kvm_set_memslot+740
> __kvm_set_memory_region.part.0+411
> kvm_set_memory_region+89
> kvm_vm_ioctl+1482
> __x64_sys_ioctl+153
> do_syscall_64+96
> entry_SYSCALL_64_after_hwframe+114
> ]: 104
> @[
> sev_guest_memory_reclaimed+5
> kvm_mmu_notifier_invalidate_range_start+869
> __mmu_notifier_invalidate_range_start+152
> zap_page_range_single+384
> unmap_mapping_range+279
> shmem_fallocate+932
> vfs_fallocate+345
> __x64_sys_fallocate+71
> do_syscall_64+96
> entry_SYSCALL_64_after_hwframe+114
> ]: 5465
> @[
> sev_guest_memory_reclaimed+5
> kvm_mmu_notifier_invalidate_range_start+869
> __mmu_notifier_invalidate_range_start+152
> zap_page_range_single+384
> madvise_vma_behavior+1967
> madvise_walk_vmas+190
> do_madvise.part.0+264
> __x64_sys_madvise+98
> do_syscall_64+96
> entry_SYSCALL_64_after_hwframe+114
> ]: 69677
>
> The maximum hits are seen with shmem_fallocate and madvise, which we
> believe are response to shared<->private
> GHCB page-state-chage requests. discard=both handles discard both for
> private and shared memory, so freeing shared memory
> via fallocate(shared_memfd, FALLOC_FL_PUNCH_HOLE, ...) would trigger the
> notifiers when freeing shared pages after guest converts a GPA to
> private.
>
> Now, as with SNP+guest_memfd, guest private memory is not mapped in host
> anymore, so i added a generic fix (instead of Sean's proposed patch of
> checking for SNP guest inside sev_guest_memory_reclaimed()):
>
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -593,6 +593,9 @@ static __always_inline int
> __kvm_handle_hva_range(struct kvm *kvm,
> unsigned long hva_start, hva_end;
>
> slot = container_of(node, struct
> kvm_memory_slot, hva_node[slots->node_idx]);
> + if (kvm_slot_can_be_private(slot)) {
> + continue;
> + }
> hva_start = max(range->start,
> slot->userspace_addr);
> hva_end = min(range->end, slot->userspace_addr +
> (slot->npages <<
> PAGE_SHIFT));
>
> With this fix added, the traces are as follows:
>
> # bpftrace -e 'kprobe:sev_guest_memory_reclaimed {@[kstack]=count()}'
> Attaching 1 probe...
> ^C
>
> @[
> sev_guest_memory_reclaimed+5
> kvm_mmu_notifier_invalidate_range_start+812
> __mmu_notifier_invalidate_range_start+152
> change_protection+4628
> change_prot_numa+93
> task_numa_work+588
> task_work_run+108
> exit_to_user_mode_prepare+337
> syscall_exit_to_user_mode+42
> do_syscall_64+109
> entry_SYSCALL_64_after_hwframe+114
> ]: 1
> @[
> sev_guest_memory_reclaimed+5
> kvm_mmu_notifier_release+60
> __mmu_notifier_release+128
> exit_mmap+657
> __mmput+72
> mmput+49
> do_exit+752
> do_group_exit+57
> get_signal+2486
> arch_do_signal_or_restart+51
> exit_to_user_mode_prepare+257
> syscall_exit_to_user_mode+42
> do_syscall_64+109
> entry_SYSCALL_64_after_hwframe+114
> ]: 1
> @[
> sev_guest_memory_reclaimed+5
> kvm_mmu_notifier_invalidate_range_start+812
> __mmu_notifier_invalidate_range_start+152
> change_protection+4628
> change_prot_numa+93
> task_numa_work+588
> task_work_run+108
> xfer_to_guest_mode_handle_work+228
> kvm_arch_vcpu_ioctl_run+1572
> kvm_vcpu_ioctl+671
> __x64_sys_ioctl+153
> do_syscall_64+96
> entry_SYSCALL_64_after_hwframe+114
> ]:
> @[
> sev_guest_memory_reclaimed+5
> kvm_set_memslot+740
> __kvm_set_memory_region.part.0+411
> kvm_set_memory_region+89
> kvm_vm_ioctl+1482
> __x64_sys_ioctl+153
> do_syscall_64+96
> entry_SYSCALL_64_after_hwframe+114
> ]: 104
> #
>
> As expected, the SEV hook is not invoked for the guest private memory
> pages (no more invalidation from shmem_fallocate() + madvise()).
>
> Isn't it better to skip invoking the KVM MMU invalidation notifier when
> the invalidated range belongs to guest private memory ?
>
> > In fact, AFAIC, SNP VM does
> > not track whether each page is previously shared, isn't it? If a page
> > was previously shared and was written by the host kernel or devices
> > before it was changed to private. No one tracks it and dirty caches
> > are there!
>
> The skipped invalidation here covered the case Mingwei mentioned above,
> where the pages are changed from private->shared and subsequent freeing
> of shared pages triggered the invalidation.
>
> But, then why are we concerned about this, i thought we have concerns
> about the case where the dirty cache lines contain encrypted guest data ?
>
> Thanks,
> Ashish
Powered by blists - more mailing lists