linux-kernel - Re: [PATCH 4/8] KVM: gmem: protect kvm_mmu_invalidate

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <df49bbb2-92c0-7792-ab90-e748be570b5d@amd.com>
Date:   Mon, 21 Aug 2023 16:44:37 -0500
From:   "Kalra, Ashish" <ashish.kalra@....com>
To:     Mingwei Zhang <mizhang@...gle.com>,
        Sean Christopherson <seanjc@...gle.com>,
        Jacky Li <jackyli@...gle.com>
Cc:     isaku.yamahata@...el.com, kvm@...r.kernel.org,
        linux-kernel@...r.kernel.org, isaku.yamahata@...il.com,
        Michael Roth <michael.roth@....com>,
        Paolo Bonzini <pbonzini@...hat.com>, erdemaktas@...gle.com,
        Sagi Shahar <sagis@...gle.com>,
        David Matlack <dmatlack@...gle.com>,
        Kai Huang <kai.huang@...el.com>,
        Zhi Wang <zhi.wang.linux@...il.com>, chen.bo@...el.com,
        linux-coco@...ts.linux.dev,
        Chao Peng <chao.p.peng@...ux.intel.com>,
        Ackerley Tng <ackerleytng@...gle.com>,
        Vishal Annapurve <vannapurve@...gle.com>,
        Yuan Yao <yuan.yao@...ux.intel.com>,
        Jarkko Sakkinen <jarkko@...nel.org>,
        Xu Yilun <yilun.xu@...el.com>,
        Quentin Perret <qperret@...gle.com>, wei.w.wang@...el.com,
        Fuad Tabba <tabba@...gle.com>
Subject: Re: [PATCH 4/8] KVM: gmem: protect kvm_mmu_invalidate_end()

Hello Mingwei & Sean,

On 8/18/2023 9:08 PM, Mingwei Zhang wrote:
> +Jacky Li
> 
> On Fri, Aug 18, 2023 at 3:45 PM Sean Christopherson <seanjc@...gle.com> wrote:
>>
>> +Mingwei to correct me if I'm wrong
>>
>> On Fri, Aug 18, 2023, Ashish Kalra wrote:
>>>
>>> On 8/18/2023 12:55 PM, Sean Christopherson wrote:
>>>> On Tue, Aug 15, 2023, isaku.yamahata@...el.com wrote:
>>>>> From: Isaku Yamahata <isaku.yamahata@...el.com>
>>>>>
>>>>> kvm_mmu_invalidate_end() updates struct kvm::mmu_invalidate_in_progress
>>>>> and it's protected by kvm::mmu_lock.  call kvm_mmu_invalidate_end() before
>>>>> unlocking it. Not after the unlock.
>>>>>
>>>>> Fixes: 8e9009ca6d14 ("KVM: Introduce per-page memory attributes")
>>>>
>>>> This fixes is wrong.  It won't matter in the long run, but it makes my life that
>>>> much harder.
>>>>
>>>>> Signed-off-by: Isaku Yamahata <isaku.yamahata@...el.com>
>>>>> ---
>>>>>    virt/kvm/kvm_main.c | 15 ++++++++++++++-
>>>>>    1 file changed, 14 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>>>>> index 8bfeb615fc4d..49380cd62367 100644
>>>>> --- a/virt/kvm/kvm_main.c
>>>>> +++ b/virt/kvm/kvm_main.c
>>>>> @@ -535,6 +535,7 @@ struct kvm_mmu_notifier_range {
>>>>>            } arg;
>>>>>            gfn_handler_t handler;
>>>>>            on_lock_fn_t on_lock;
>>>>> + on_unlock_fn_t before_unlock;
>>>>>            on_unlock_fn_t on_unlock;
>>>>
>>>> Ugh, shame on my past me.  Having on_lock and on_unlock be asymmetrical with respect
>>>> to the lock is nasty.
>>>>
>>>> I would much rather we either (a) be explicit, e.g. before_(un)lock and after_(un)lock,
>>>> or (b) have just on_(un)lock, make them symetrical, and handle the SEV mess a
>>>> different way.
>>>>
>>>> The SEV hook doesn't actually care about running immediately after unlock, it just
>>>> wants to know if there was an overlapping memslot.  It can run after SRCU is dropped,
>>>> because even if we make the behavior more precise (right now it blasts WBINVD),
>>>> just having a reference to memslots isn't sufficient, the code needs to guarantee
>>>> memslots are *stable*.  And that is already guaranteed by the notifier code, i.e.
>>>> the SEV code could just reacquire SRCU.
>>>
>>> On a separate note here, the SEV hook blasting WBINVD is still causing
>>> serious performance degradation issues with SNP triggered via
>>> AutoNUMA/numad/KSM, etc. With reference to previous discussions related to
>>> it, we have plans to replace WBINVD with CLFLUSHOPT.
>>
>> Isn't the flush unnecessary when freeing shared memory?  My recollection is that
>> the problematic scenario is when encrypted memory is freed back to the host,
>> because KVM already flushes when potentially encrypted mapping memory into the
>> guest.
>>
>> With SNP+guest_memfd, private/encrypted memory should be unreachabled via the
>> hva-based mmu_notifiers.  gmem should have full control of the page lifecycles,
>> i.e. can get the kernel virtual address as appropriated, and so it SNP shouldn't
>> need the nuclear option.
>>
>> E.g. something like this?
>>
>> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
>> index 07756b7348ae..1c6828ae391d 100644
>> --- a/arch/x86/kvm/svm/sev.c
>> +++ b/arch/x86/kvm/svm/sev.c
>> @@ -2328,7 +2328,7 @@ static void sev_flush_encrypted_page(struct kvm_vcpu *vcpu, void *va)
>>
>>   void sev_guest_memory_reclaimed(struct kvm *kvm)
>>   {
>> -       if (!sev_guest(kvm))
>> +       if (!sev_guest(kvm) || sev_snp_guest(kvm))
>>                  return;
>>
>>          wbinvd_on_all_cpus();
> 
> I hope this is the final solution :)
> 
> So, short answer: no.
> 
> SNP+guest_memfd prevent untrusted host user space from directly
> modifying the data, this is good enough for CVE-2022-0171, but there
> is no such guarantee that the host kernel in some scenarios could
> access the data and generate dirty caches. In fact, AFAIC, SNP VM does
> not track whether each page is previously shared, isn't it? If a page
> was previously shared and was written by the host kernel or devices
> before it was changed to private. No one tracks it and dirty caches
> are there!
> 
> So, to avoid any corner case situations like the above, it seems
> currently we have to retain the property: flushing the cache when the
> guest memory mapping leaves KVM NPT.
> 
> Of course, this is fundamentally because SME_COHERENT only applies to
> CPU cores, but not DMA. If SME_COHERENT is complete, flushing is no
> longer needed. Alternatively, we need extra bookkeeping for KVM to
> know whether each page has dirty cache lines. Another alternative is
> to filter mmu_notifier reasons, which is the part that I am planning
> to take. thoughts?
> 

Now running SNP+guest_memfd with discard=both option enabled:

# bpftrace -e 'kprobe:sev_guest_memory_reclaimed {@[kstack]=count()}'
Attaching 1 probe...
^C

@[
     sev_guest_memory_reclaimed+5
     kvm_mmu_notifier_release+60
     __mmu_notifier_release+128
     exit_mmap+657
     __mmput+72
     mmput+49
     do_exit+752
     do_group_exit+57
     get_signal+2486
     arch_do_signal_or_restart+51
     exit_to_user_mode_prepare+257
     syscall_exit_to_user_mode+42
     do_syscall_64+109
     entry_SYSCALL_64_after_hwframe+114
]: 1
@[
     sev_guest_memory_reclaimed+5
     kvm_mmu_notifier_invalidate_range_start+869
     __mmu_notifier_invalidate_range_start+152
     change_protection+4628
     change_prot_numa+93
     task_numa_work+588
     task_work_run+108
     exit_to_user_mode_prepare+337
     syscall_exit_to_user_mode+42
     do_syscall_64+109
     entry_SYSCALL_64_after_hwframe+114
]: 2
@[
     sev_guest_memory_reclaimed+5
     kvm_mmu_notifier_invalidate_range_start+869
     __mmu_notifier_invalidate_range_start+152
     change_protection+4628
     change_prot_numa+93
     task_numa_work+588
     task_work_run+108
     xfer_to_guest_mode_handle_work+228
     kvm_arch_vcpu_ioctl_run+1572
     kvm_vcpu_ioctl+671
     __x64_sys_ioctl+153
     do_syscall_64+96
     entry_SYSCALL_64_after_hwframe+114
]: 2
@[
     sev_guest_memory_reclaimed+5
     kvm_set_memslot+740
     __kvm_set_memory_region.part.0+411
     kvm_set_memory_region+89
     kvm_vm_ioctl+1482
     __x64_sys_ioctl+153
     do_syscall_64+96
     entry_SYSCALL_64_after_hwframe+114
]: 104
@[
     sev_guest_memory_reclaimed+5
     kvm_mmu_notifier_invalidate_range_start+869
     __mmu_notifier_invalidate_range_start+152
     zap_page_range_single+384
     unmap_mapping_range+279
     shmem_fallocate+932
     vfs_fallocate+345
     __x64_sys_fallocate+71
     do_syscall_64+96
     entry_SYSCALL_64_after_hwframe+114
]: 5465
@[
     sev_guest_memory_reclaimed+5
     kvm_mmu_notifier_invalidate_range_start+869
     __mmu_notifier_invalidate_range_start+152
     zap_page_range_single+384
     madvise_vma_behavior+1967
     madvise_walk_vmas+190
     do_madvise.part.0+264
     __x64_sys_madvise+98
     do_syscall_64+96
     entry_SYSCALL_64_after_hwframe+114
]: 69677

The maximum hits are seen with shmem_fallocate and madvise, which we 
believe are response to shared<->private
GHCB page-state-chage requests. discard=both handles discard both for
private and shared memory, so freeing shared memory
via fallocate(shared_memfd, FALLOC_FL_PUNCH_HOLE, ...) would trigger the
notifiers when freeing shared pages after guest converts a GPA to
private.

Now, as with SNP+guest_memfd, guest private memory is not mapped in host 
anymore, so i added a generic fix (instead of Sean's proposed patch of 
checking for SNP guest inside sev_guest_memory_reclaimed()):

--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -593,6 +593,9 @@ static __always_inline int 
__kvm_handle_hva_range(struct kvm *kvm,
                         unsigned long hva_start, hva_end;

                         slot = container_of(node, struct 
kvm_memory_slot, hva_node[slots->node_idx]);
+                       if (kvm_slot_can_be_private(slot)) {
+                               continue;
+                       }
                         hva_start = max(range->start, 
slot->userspace_addr);
                         hva_end = min(range->end, slot->userspace_addr +
                                                   (slot->npages << 
PAGE_SHIFT));

With this fix added, the traces are as follows:

# bpftrace -e 'kprobe:sev_guest_memory_reclaimed {@[kstack]=count()}'
Attaching 1 probe...
^C

@[
     sev_guest_memory_reclaimed+5
     kvm_mmu_notifier_invalidate_range_start+812
     __mmu_notifier_invalidate_range_start+152
     change_protection+4628
     change_prot_numa+93
     task_numa_work+588
     task_work_run+108
     exit_to_user_mode_prepare+337
     syscall_exit_to_user_mode+42
     do_syscall_64+109
     entry_SYSCALL_64_after_hwframe+114
]: 1
@[
     sev_guest_memory_reclaimed+5
     kvm_mmu_notifier_release+60
     __mmu_notifier_release+128
     exit_mmap+657
     __mmput+72
     mmput+49
     do_exit+752
     do_group_exit+57
     get_signal+2486
     arch_do_signal_or_restart+51
     exit_to_user_mode_prepare+257
     syscall_exit_to_user_mode+42
     do_syscall_64+109
     entry_SYSCALL_64_after_hwframe+114
]: 1
@[
     sev_guest_memory_reclaimed+5
     kvm_mmu_notifier_invalidate_range_start+812
     __mmu_notifier_invalidate_range_start+152
     change_protection+4628
     change_prot_numa+93
     task_numa_work+588
     task_work_run+108
     xfer_to_guest_mode_handle_work+228
     kvm_arch_vcpu_ioctl_run+1572
     kvm_vcpu_ioctl+671
     __x64_sys_ioctl+153
     do_syscall_64+96
     entry_SYSCALL_64_after_hwframe+114
]:
@[
     sev_guest_memory_reclaimed+5
     kvm_set_memslot+740
     __kvm_set_memory_region.part.0+411
     kvm_set_memory_region+89
     kvm_vm_ioctl+1482
     __x64_sys_ioctl+153
     do_syscall_64+96
     entry_SYSCALL_64_after_hwframe+114
]: 104
#

As expected, the SEV hook is not invoked for the guest private memory 
pages (no more invalidation from shmem_fallocate() + madvise()).

Isn't it better to skip invoking the KVM MMU invalidation notifier when 
the invalidated range belongs to guest private memory ?

 > In fact, AFAIC, SNP VM does
 > not track whether each page is previously shared, isn't it? If a page
 > was previously shared and was written by the host kernel or devices
 > before it was changed to private. No one tracks it and dirty caches
 > are there!

The skipped invalidation here covered the case Mingwei mentioned above, 
where the pages are changed from private->shared and subsequent freeing 
of shared pages triggered the invalidation.

But, then why are we concerned about this, i thought we have concerns 
about the case where the dirty cache lines contain encrypted guest data ?

Thanks,
Ashish