[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <df49bbb2-92c0-7792-ab90-e748be570b5d@amd.com>
Date: Mon, 21 Aug 2023 16:44:37 -0500
From: "Kalra, Ashish" <ashish.kalra@....com>
To: Mingwei Zhang <mizhang@...gle.com>,
Sean Christopherson <seanjc@...gle.com>,
Jacky Li <jackyli@...gle.com>
Cc: isaku.yamahata@...el.com, kvm@...r.kernel.org,
linux-kernel@...r.kernel.org, isaku.yamahata@...il.com,
Michael Roth <michael.roth@....com>,
Paolo Bonzini <pbonzini@...hat.com>, erdemaktas@...gle.com,
Sagi Shahar <sagis@...gle.com>,
David Matlack <dmatlack@...gle.com>,
Kai Huang <kai.huang@...el.com>,
Zhi Wang <zhi.wang.linux@...il.com>, chen.bo@...el.com,
linux-coco@...ts.linux.dev,
Chao Peng <chao.p.peng@...ux.intel.com>,
Ackerley Tng <ackerleytng@...gle.com>,
Vishal Annapurve <vannapurve@...gle.com>,
Yuan Yao <yuan.yao@...ux.intel.com>,
Jarkko Sakkinen <jarkko@...nel.org>,
Xu Yilun <yilun.xu@...el.com>,
Quentin Perret <qperret@...gle.com>, wei.w.wang@...el.com,
Fuad Tabba <tabba@...gle.com>
Subject: Re: [PATCH 4/8] KVM: gmem: protect kvm_mmu_invalidate_end()
Hello Mingwei & Sean,
On 8/18/2023 9:08 PM, Mingwei Zhang wrote:
> +Jacky Li
>
> On Fri, Aug 18, 2023 at 3:45 PM Sean Christopherson <seanjc@...gle.com> wrote:
>>
>> +Mingwei to correct me if I'm wrong
>>
>> On Fri, Aug 18, 2023, Ashish Kalra wrote:
>>>
>>> On 8/18/2023 12:55 PM, Sean Christopherson wrote:
>>>> On Tue, Aug 15, 2023, isaku.yamahata@...el.com wrote:
>>>>> From: Isaku Yamahata <isaku.yamahata@...el.com>
>>>>>
>>>>> kvm_mmu_invalidate_end() updates struct kvm::mmu_invalidate_in_progress
>>>>> and it's protected by kvm::mmu_lock. call kvm_mmu_invalidate_end() before
>>>>> unlocking it. Not after the unlock.
>>>>>
>>>>> Fixes: 8e9009ca6d14 ("KVM: Introduce per-page memory attributes")
>>>>
>>>> This fixes is wrong. It won't matter in the long run, but it makes my life that
>>>> much harder.
>>>>
>>>>> Signed-off-by: Isaku Yamahata <isaku.yamahata@...el.com>
>>>>> ---
>>>>> virt/kvm/kvm_main.c | 15 ++++++++++++++-
>>>>> 1 file changed, 14 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>>>>> index 8bfeb615fc4d..49380cd62367 100644
>>>>> --- a/virt/kvm/kvm_main.c
>>>>> +++ b/virt/kvm/kvm_main.c
>>>>> @@ -535,6 +535,7 @@ struct kvm_mmu_notifier_range {
>>>>> } arg;
>>>>> gfn_handler_t handler;
>>>>> on_lock_fn_t on_lock;
>>>>> + on_unlock_fn_t before_unlock;
>>>>> on_unlock_fn_t on_unlock;
>>>>
>>>> Ugh, shame on my past me. Having on_lock and on_unlock be asymmetrical with respect
>>>> to the lock is nasty.
>>>>
>>>> I would much rather we either (a) be explicit, e.g. before_(un)lock and after_(un)lock,
>>>> or (b) have just on_(un)lock, make them symetrical, and handle the SEV mess a
>>>> different way.
>>>>
>>>> The SEV hook doesn't actually care about running immediately after unlock, it just
>>>> wants to know if there was an overlapping memslot. It can run after SRCU is dropped,
>>>> because even if we make the behavior more precise (right now it blasts WBINVD),
>>>> just having a reference to memslots isn't sufficient, the code needs to guarantee
>>>> memslots are *stable*. And that is already guaranteed by the notifier code, i.e.
>>>> the SEV code could just reacquire SRCU.
>>>
>>> On a separate note here, the SEV hook blasting WBINVD is still causing
>>> serious performance degradation issues with SNP triggered via
>>> AutoNUMA/numad/KSM, etc. With reference to previous discussions related to
>>> it, we have plans to replace WBINVD with CLFLUSHOPT.
>>
>> Isn't the flush unnecessary when freeing shared memory? My recollection is that
>> the problematic scenario is when encrypted memory is freed back to the host,
>> because KVM already flushes when potentially encrypted mapping memory into the
>> guest.
>>
>> With SNP+guest_memfd, private/encrypted memory should be unreachabled via the
>> hva-based mmu_notifiers. gmem should have full control of the page lifecycles,
>> i.e. can get the kernel virtual address as appropriated, and so it SNP shouldn't
>> need the nuclear option.
>>
>> E.g. something like this?
>>
>> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
>> index 07756b7348ae..1c6828ae391d 100644
>> --- a/arch/x86/kvm/svm/sev.c
>> +++ b/arch/x86/kvm/svm/sev.c
>> @@ -2328,7 +2328,7 @@ static void sev_flush_encrypted_page(struct kvm_vcpu *vcpu, void *va)
>>
>> void sev_guest_memory_reclaimed(struct kvm *kvm)
>> {
>> - if (!sev_guest(kvm))
>> + if (!sev_guest(kvm) || sev_snp_guest(kvm))
>> return;
>>
>> wbinvd_on_all_cpus();
>
> I hope this is the final solution :)
>
> So, short answer: no.
>
> SNP+guest_memfd prevent untrusted host user space from directly
> modifying the data, this is good enough for CVE-2022-0171, but there
> is no such guarantee that the host kernel in some scenarios could
> access the data and generate dirty caches. In fact, AFAIC, SNP VM does
> not track whether each page is previously shared, isn't it? If a page
> was previously shared and was written by the host kernel or devices
> before it was changed to private. No one tracks it and dirty caches
> are there!
>
> So, to avoid any corner case situations like the above, it seems
> currently we have to retain the property: flushing the cache when the
> guest memory mapping leaves KVM NPT.
>
> Of course, this is fundamentally because SME_COHERENT only applies to
> CPU cores, but not DMA. If SME_COHERENT is complete, flushing is no
> longer needed. Alternatively, we need extra bookkeeping for KVM to
> know whether each page has dirty cache lines. Another alternative is
> to filter mmu_notifier reasons, which is the part that I am planning
> to take. thoughts?
>
Now running SNP+guest_memfd with discard=both option enabled:
# bpftrace -e 'kprobe:sev_guest_memory_reclaimed {@[kstack]=count()}'
Attaching 1 probe...
^C
@[
sev_guest_memory_reclaimed+5
kvm_mmu_notifier_release+60
__mmu_notifier_release+128
exit_mmap+657
__mmput+72
mmput+49
do_exit+752
do_group_exit+57
get_signal+2486
arch_do_signal_or_restart+51
exit_to_user_mode_prepare+257
syscall_exit_to_user_mode+42
do_syscall_64+109
entry_SYSCALL_64_after_hwframe+114
]: 1
@[
sev_guest_memory_reclaimed+5
kvm_mmu_notifier_invalidate_range_start+869
__mmu_notifier_invalidate_range_start+152
change_protection+4628
change_prot_numa+93
task_numa_work+588
task_work_run+108
exit_to_user_mode_prepare+337
syscall_exit_to_user_mode+42
do_syscall_64+109
entry_SYSCALL_64_after_hwframe+114
]: 2
@[
sev_guest_memory_reclaimed+5
kvm_mmu_notifier_invalidate_range_start+869
__mmu_notifier_invalidate_range_start+152
change_protection+4628
change_prot_numa+93
task_numa_work+588
task_work_run+108
xfer_to_guest_mode_handle_work+228
kvm_arch_vcpu_ioctl_run+1572
kvm_vcpu_ioctl+671
__x64_sys_ioctl+153
do_syscall_64+96
entry_SYSCALL_64_after_hwframe+114
]: 2
@[
sev_guest_memory_reclaimed+5
kvm_set_memslot+740
__kvm_set_memory_region.part.0+411
kvm_set_memory_region+89
kvm_vm_ioctl+1482
__x64_sys_ioctl+153
do_syscall_64+96
entry_SYSCALL_64_after_hwframe+114
]: 104
@[
sev_guest_memory_reclaimed+5
kvm_mmu_notifier_invalidate_range_start+869
__mmu_notifier_invalidate_range_start+152
zap_page_range_single+384
unmap_mapping_range+279
shmem_fallocate+932
vfs_fallocate+345
__x64_sys_fallocate+71
do_syscall_64+96
entry_SYSCALL_64_after_hwframe+114
]: 5465
@[
sev_guest_memory_reclaimed+5
kvm_mmu_notifier_invalidate_range_start+869
__mmu_notifier_invalidate_range_start+152
zap_page_range_single+384
madvise_vma_behavior+1967
madvise_walk_vmas+190
do_madvise.part.0+264
__x64_sys_madvise+98
do_syscall_64+96
entry_SYSCALL_64_after_hwframe+114
]: 69677
The maximum hits are seen with shmem_fallocate and madvise, which we
believe are response to shared<->private
GHCB page-state-chage requests. discard=both handles discard both for
private and shared memory, so freeing shared memory
via fallocate(shared_memfd, FALLOC_FL_PUNCH_HOLE, ...) would trigger the
notifiers when freeing shared pages after guest converts a GPA to
private.
Now, as with SNP+guest_memfd, guest private memory is not mapped in host
anymore, so i added a generic fix (instead of Sean's proposed patch of
checking for SNP guest inside sev_guest_memory_reclaimed()):
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -593,6 +593,9 @@ static __always_inline int
__kvm_handle_hva_range(struct kvm *kvm,
unsigned long hva_start, hva_end;
slot = container_of(node, struct
kvm_memory_slot, hva_node[slots->node_idx]);
+ if (kvm_slot_can_be_private(slot)) {
+ continue;
+ }
hva_start = max(range->start,
slot->userspace_addr);
hva_end = min(range->end, slot->userspace_addr +
(slot->npages <<
PAGE_SHIFT));
With this fix added, the traces are as follows:
# bpftrace -e 'kprobe:sev_guest_memory_reclaimed {@[kstack]=count()}'
Attaching 1 probe...
^C
@[
sev_guest_memory_reclaimed+5
kvm_mmu_notifier_invalidate_range_start+812
__mmu_notifier_invalidate_range_start+152
change_protection+4628
change_prot_numa+93
task_numa_work+588
task_work_run+108
exit_to_user_mode_prepare+337
syscall_exit_to_user_mode+42
do_syscall_64+109
entry_SYSCALL_64_after_hwframe+114
]: 1
@[
sev_guest_memory_reclaimed+5
kvm_mmu_notifier_release+60
__mmu_notifier_release+128
exit_mmap+657
__mmput+72
mmput+49
do_exit+752
do_group_exit+57
get_signal+2486
arch_do_signal_or_restart+51
exit_to_user_mode_prepare+257
syscall_exit_to_user_mode+42
do_syscall_64+109
entry_SYSCALL_64_after_hwframe+114
]: 1
@[
sev_guest_memory_reclaimed+5
kvm_mmu_notifier_invalidate_range_start+812
__mmu_notifier_invalidate_range_start+152
change_protection+4628
change_prot_numa+93
task_numa_work+588
task_work_run+108
xfer_to_guest_mode_handle_work+228
kvm_arch_vcpu_ioctl_run+1572
kvm_vcpu_ioctl+671
__x64_sys_ioctl+153
do_syscall_64+96
entry_SYSCALL_64_after_hwframe+114
]:
@[
sev_guest_memory_reclaimed+5
kvm_set_memslot+740
__kvm_set_memory_region.part.0+411
kvm_set_memory_region+89
kvm_vm_ioctl+1482
__x64_sys_ioctl+153
do_syscall_64+96
entry_SYSCALL_64_after_hwframe+114
]: 104
#
As expected, the SEV hook is not invoked for the guest private memory
pages (no more invalidation from shmem_fallocate() + madvise()).
Isn't it better to skip invoking the KVM MMU invalidation notifier when
the invalidated range belongs to guest private memory ?
> In fact, AFAIC, SNP VM does
> not track whether each page is previously shared, isn't it? If a page
> was previously shared and was written by the host kernel or devices
> before it was changed to private. No one tracks it and dirty caches
> are there!
The skipped invalidation here covered the case Mingwei mentioned above,
where the pages are changed from private->shared and subsequent freeing
of shared pages triggered the invalidation.
But, then why are we concerned about this, i thought we have concerns
about the case where the dirty cache lines contain encrypted guest data ?
Thanks,
Ashish
Powered by blists - more mailing lists