linux-kernel - Re: [PATCH 4/8] KVM: gmem: protect kvm_mmu_invalidate

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <e0e7e18f-6282-b95e-62d4-1f136649e2cb@amd.com>
Date:   Tue, 22 Aug 2023 17:30:50 -0500
From:   "Kalra, Ashish" <ashish.kalra@....com>
To:     Mingwei Zhang <mizhang@...gle.com>,
        Sean Christopherson <seanjc@...gle.com>,
        Jacky Li <jackyli@...gle.com>
Cc:     isaku.yamahata@...el.com, kvm@...r.kernel.org,
        linux-kernel@...r.kernel.org, isaku.yamahata@...il.com,
        Michael Roth <michael.roth@....com>,
        Paolo Bonzini <pbonzini@...hat.com>, erdemaktas@...gle.com,
        Sagi Shahar <sagis@...gle.com>,
        David Matlack <dmatlack@...gle.com>,
        Kai Huang <kai.huang@...el.com>,
        Zhi Wang <zhi.wang.linux@...il.com>, chen.bo@...el.com,
        linux-coco@...ts.linux.dev,
        Chao Peng <chao.p.peng@...ux.intel.com>,
        Ackerley Tng <ackerleytng@...gle.com>,
        Vishal Annapurve <vannapurve@...gle.com>,
        Yuan Yao <yuan.yao@...ux.intel.com>,
        Jarkko Sakkinen <jarkko@...nel.org>,
        Xu Yilun <yilun.xu@...el.com>,
        Quentin Perret <qperret@...gle.com>, wei.w.wang@...el.com,
        Fuad Tabba <tabba@...gle.com>
Subject: Re: [PATCH 4/8] KVM: gmem: protect kvm_mmu_invalidate_end()



On 8/21/2023 4:44 PM, Kalra, Ashish wrote:
> Hello Mingwei & Sean,
> 
> On 8/18/2023 9:08 PM, Mingwei Zhang wrote:
>> +Jacky Li
>>
>> On Fri, Aug 18, 2023 at 3:45 PM Sean Christopherson 
>> <seanjc@...gle.com> wrote:
>>>
>>> +Mingwei to correct me if I'm wrong
>>>
>>> On Fri, Aug 18, 2023, Ashish Kalra wrote:
>>>>
>>>> On 8/18/2023 12:55 PM, Sean Christopherson wrote:
>>>>> On Tue, Aug 15, 2023, isaku.yamahata@...el.com wrote:
>>>>>> From: Isaku Yamahata <isaku.yamahata@...el.com>
>>>>>>
>>>>>> kvm_mmu_invalidate_end() updates struct 
>>>>>> kvm::mmu_invalidate_in_progress
>>>>>> and it's protected by kvm::mmu_lock.  call 
>>>>>> kvm_mmu_invalidate_end() before
>>>>>> unlocking it. Not after the unlock.
>>>>>>
>>>>>> Fixes: 8e9009ca6d14 ("KVM: Introduce per-page memory attributes")
>>>>>
>>>>> This fixes is wrong.  It won't matter in the long run, but it makes 
>>>>> my life that
>>>>> much harder.
>>>>>
>>>>>> Signed-off-by: Isaku Yamahata <isaku.yamahata@...el.com>
>>>>>> ---
>>>>>>    virt/kvm/kvm_main.c | 15 ++++++++++++++-
>>>>>>    1 file changed, 14 insertions(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>>>>>> index 8bfeb615fc4d..49380cd62367 100644
>>>>>> --- a/virt/kvm/kvm_main.c
>>>>>> +++ b/virt/kvm/kvm_main.c
>>>>>> @@ -535,6 +535,7 @@ struct kvm_mmu_notifier_range {
>>>>>>            } arg;
>>>>>>            gfn_handler_t handler;
>>>>>>            on_lock_fn_t on_lock;
>>>>>> + on_unlock_fn_t before_unlock;
>>>>>>            on_unlock_fn_t on_unlock;
>>>>>
>>>>> Ugh, shame on my past me.  Having on_lock and on_unlock be 
>>>>> asymmetrical with respect
>>>>> to the lock is nasty.
>>>>>
>>>>> I would much rather we either (a) be explicit, e.g. before_(un)lock 
>>>>> and after_(un)lock,
>>>>> or (b) have just on_(un)lock, make them symetrical, and handle the 
>>>>> SEV mess a
>>>>> different way.
>>>>>
>>>>> The SEV hook doesn't actually care about running immediately after 
>>>>> unlock, it just
>>>>> wants to know if there was an overlapping memslot.  It can run 
>>>>> after SRCU is dropped,
>>>>> because even if we make the behavior more precise (right now it 
>>>>> blasts WBINVD),
>>>>> just having a reference to memslots isn't sufficient, the code 
>>>>> needs to guarantee
>>>>> memslots are *stable*.  And that is already guaranteed by the 
>>>>> notifier code, i.e.
>>>>> the SEV code could just reacquire SRCU.
>>>>
>>>> On a separate note here, the SEV hook blasting WBINVD is still causing
>>>> serious performance degradation issues with SNP triggered via
>>>> AutoNUMA/numad/KSM, etc. With reference to previous discussions 
>>>> related to
>>>> it, we have plans to replace WBINVD with CLFLUSHOPT.
>>>
>>> Isn't the flush unnecessary when freeing shared memory?  My 
>>> recollection is that
>>> the problematic scenario is when encrypted memory is freed back to 
>>> the host,
>>> because KVM already flushes when potentially encrypted mapping memory 
>>> into the
>>> guest.
>>>
>>> With SNP+guest_memfd, private/encrypted memory should be unreachabled 
>>> via the
>>> hva-based mmu_notifiers.  gmem should have full control of the page 
>>> lifecycles,
>>> i.e. can get the kernel virtual address as appropriated, and so it 
>>> SNP shouldn't
>>> need the nuclear option.
>>>
>>> E.g. something like this?
>>>
>>> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
>>> index 07756b7348ae..1c6828ae391d 100644
>>> --- a/arch/x86/kvm/svm/sev.c
>>> +++ b/arch/x86/kvm/svm/sev.c
>>> @@ -2328,7 +2328,7 @@ static void sev_flush_encrypted_page(struct 
>>> kvm_vcpu *vcpu, void *va)
>>>
>>>   void sev_guest_memory_reclaimed(struct kvm *kvm)
>>>   {
>>> -       if (!sev_guest(kvm))
>>> +       if (!sev_guest(kvm) || sev_snp_guest(kvm))
>>>                  return;
>>>
>>>          wbinvd_on_all_cpus();
>>
>> I hope this is the final solution :)
>>
>> So, short answer: no.
>>
>> SNP+guest_memfd prevent untrusted host user space from directly
>> modifying the data, this is good enough for CVE-2022-0171, but there
>> is no such guarantee that the host kernel in some scenarios could
>> access the data and generate dirty caches. In fact, AFAIC, SNP VM does
>> not track whether each page is previously shared, isn't it? If a page
>> was previously shared and was written by the host kernel or devices
>> before it was changed to private. No one tracks it and dirty caches
>> are there!
>>
>> So, to avoid any corner case situations like the above, it seems
>> currently we have to retain the property: flushing the cache when the
>> guest memory mapping leaves KVM NPT.
>>
>> Of course, this is fundamentally because SME_COHERENT only applies to
>> CPU cores, but not DMA. If SME_COHERENT is complete, flushing is no
>> longer needed. Alternatively, we need extra bookkeeping for KVM to
>> know whether each page has dirty cache lines. Another alternative is
>> to filter mmu_notifier reasons, which is the part that I am planning
>> to take. thoughts?
>>

Additionally looking at MMU notifier event filtering and the various 
code paths (of interest) from where the MMU invalidation notifier gets 
invoked:

For NUMA load balancing during #PF fault handling, try_to_migrate_one() 
does MMU invalidation notifier invocation with MMU_NOTIFY_CLEAR event 
and then looking at KSM code paths, try_to_merge_one_page() -> 
write_protect_page() and try_to_merge_one_page() -> replace_page() do 
the MMU invalidation notifier invocations also with MMU_NOTIFY_CLEAR event.

Now, i remember from previous discussions, that the CLEAR event is an 
overloaded event used for page zapping, madvise, etc., so i don't think 
we will be able to filter *out* this event and this event is triggering 
most of the performance issues we are observing.

So considering what Sean mentioned earlier:

 >What I'm saying is that for guests whose private memory is backed by 
 >guest_memfd(), which is all SNP guests, it should be impossible for 
 >memory that is reachable via mmu_notifiers to be mapped in KVM's MMU 
as >private.  So yes, KVM needs to flush when memory is freed from 
 >guest_memfd(), but not for memory that is reclaimed by mmu_notifiers, 
i.e. not for sev_guest_memory_reclaimed().

I think the right solution for SNP guests should be:

 >>> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
 >>> index 07756b7348ae..1c6828ae391d 100644
 >>> --- a/arch/x86/kvm/svm/sev.c
 >>> +++ b/arch/x86/kvm/svm/sev.c
 >>> @@ -2328,7 +2328,7 @@ static void sev_flush_encrypted_page(struct
 >>> kvm_vcpu *vcpu, void *va)
 >>>
 >>>   void sev_guest_memory_reclaimed(struct kvm *kvm)
 >>>   {
 >>> -       if (!sev_guest(kvm))
 >>> +       if (!sev_guest(kvm) || sev_snp_guest(kvm))
 >>>                  return;
 >>>
 >>>          wbinvd_on_all_cpus();

Thoughts?

Thanks,
Ashish

> 
> Now running SNP+guest_memfd with discard=both option enabled:
> 
> # bpftrace -e 'kprobe:sev_guest_memory_reclaimed {@[kstack]=count()}'
> Attaching 1 probe...
> ^C
> 
> @[
>      sev_guest_memory_reclaimed+5
>      kvm_mmu_notifier_release+60
>      __mmu_notifier_release+128
>      exit_mmap+657
>      __mmput+72
>      mmput+49
>      do_exit+752
>      do_group_exit+57
>      get_signal+2486
>      arch_do_signal_or_restart+51
>      exit_to_user_mode_prepare+257
>      syscall_exit_to_user_mode+42
>      do_syscall_64+109
>      entry_SYSCALL_64_after_hwframe+114
> ]: 1
> @[
>      sev_guest_memory_reclaimed+5
>      kvm_mmu_notifier_invalidate_range_start+869
>      __mmu_notifier_invalidate_range_start+152
>      change_protection+4628
>      change_prot_numa+93
>      task_numa_work+588
>      task_work_run+108
>      exit_to_user_mode_prepare+337
>      syscall_exit_to_user_mode+42
>      do_syscall_64+109
>      entry_SYSCALL_64_after_hwframe+114
> ]: 2
> @[
>      sev_guest_memory_reclaimed+5
>      kvm_mmu_notifier_invalidate_range_start+869
>      __mmu_notifier_invalidate_range_start+152
>      change_protection+4628
>      change_prot_numa+93
>      task_numa_work+588
>      task_work_run+108
>      xfer_to_guest_mode_handle_work+228
>      kvm_arch_vcpu_ioctl_run+1572
>      kvm_vcpu_ioctl+671
>      __x64_sys_ioctl+153
>      do_syscall_64+96
>      entry_SYSCALL_64_after_hwframe+114
> ]: 2
> @[
>      sev_guest_memory_reclaimed+5
>      kvm_set_memslot+740
>      __kvm_set_memory_region.part.0+411
>      kvm_set_memory_region+89
>      kvm_vm_ioctl+1482
>      __x64_sys_ioctl+153
>      do_syscall_64+96
>      entry_SYSCALL_64_after_hwframe+114
> ]: 104
> @[
>      sev_guest_memory_reclaimed+5
>      kvm_mmu_notifier_invalidate_range_start+869
>      __mmu_notifier_invalidate_range_start+152
>      zap_page_range_single+384
>      unmap_mapping_range+279
>      shmem_fallocate+932
>      vfs_fallocate+345
>      __x64_sys_fallocate+71
>      do_syscall_64+96
>      entry_SYSCALL_64_after_hwframe+114
> ]: 5465
> @[
>      sev_guest_memory_reclaimed+5
>      kvm_mmu_notifier_invalidate_range_start+869
>      __mmu_notifier_invalidate_range_start+152
>      zap_page_range_single+384
>      madvise_vma_behavior+1967
>      madvise_walk_vmas+190
>      do_madvise.part.0+264
>      __x64_sys_madvise+98
>      do_syscall_64+96
>      entry_SYSCALL_64_after_hwframe+114
> ]: 69677
> 
> The maximum hits are seen with shmem_fallocate and madvise, which we 
> believe are response to shared<->private
> GHCB page-state-chage requests. discard=both handles discard both for
> private and shared memory, so freeing shared memory
> via fallocate(shared_memfd, FALLOC_FL_PUNCH_HOLE, ...) would trigger the
> notifiers when freeing shared pages after guest converts a GPA to
> private.
> 
> Now, as with SNP+guest_memfd, guest private memory is not mapped in host 
> anymore, so i added a generic fix (instead of Sean's proposed patch of 
> checking for SNP guest inside sev_guest_memory_reclaimed()):
> 
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -593,6 +593,9 @@ static __always_inline int 
> __kvm_handle_hva_range(struct kvm *kvm,
>                          unsigned long hva_start, hva_end;
> 
>                          slot = container_of(node, struct 
> kvm_memory_slot, hva_node[slots->node_idx]);
> +                       if (kvm_slot_can_be_private(slot)) {
> +                               continue;
> +                       }
>                          hva_start = max(range->start, 
> slot->userspace_addr);
>                          hva_end = min(range->end, slot->userspace_addr +
>                                                    (slot->npages << 
> PAGE_SHIFT));
> 
> With this fix added, the traces are as follows:
> 
> # bpftrace -e 'kprobe:sev_guest_memory_reclaimed {@[kstack]=count()}'
> Attaching 1 probe...
> ^C
> 
> @[
>      sev_guest_memory_reclaimed+5
>      kvm_mmu_notifier_invalidate_range_start+812
>      __mmu_notifier_invalidate_range_start+152
>      change_protection+4628
>      change_prot_numa+93
>      task_numa_work+588
>      task_work_run+108
>      exit_to_user_mode_prepare+337
>      syscall_exit_to_user_mode+42
>      do_syscall_64+109
>      entry_SYSCALL_64_after_hwframe+114
> ]: 1
> @[
>      sev_guest_memory_reclaimed+5
>      kvm_mmu_notifier_release+60
>      __mmu_notifier_release+128
>      exit_mmap+657
>      __mmput+72
>      mmput+49
>      do_exit+752
>      do_group_exit+57
>      get_signal+2486
>      arch_do_signal_or_restart+51
>      exit_to_user_mode_prepare+257
>      syscall_exit_to_user_mode+42
>      do_syscall_64+109
>      entry_SYSCALL_64_after_hwframe+114
> ]: 1
> @[
>      sev_guest_memory_reclaimed+5
>      kvm_mmu_notifier_invalidate_range_start+812
>      __mmu_notifier_invalidate_range_start+152
>      change_protection+4628
>      change_prot_numa+93
>      task_numa_work+588
>      task_work_run+108
>      xfer_to_guest_mode_handle_work+228
>      kvm_arch_vcpu_ioctl_run+1572
>      kvm_vcpu_ioctl+671
>      __x64_sys_ioctl+153
>      do_syscall_64+96
>      entry_SYSCALL_64_after_hwframe+114
> ]:
> @[
>      sev_guest_memory_reclaimed+5
>      kvm_set_memslot+740
>      __kvm_set_memory_region.part.0+411
>      kvm_set_memory_region+89
>      kvm_vm_ioctl+1482
>      __x64_sys_ioctl+153
>      do_syscall_64+96
>      entry_SYSCALL_64_after_hwframe+114
> ]: 104
> #
> 
> As expected, the SEV hook is not invoked for the guest private memory 
> pages (no more invalidation from shmem_fallocate() + madvise()).
> 
> Isn't it better to skip invoking the KVM MMU invalidation notifier when 
> the invalidated range belongs to guest private memory ?
> 
>  > In fact, AFAIC, SNP VM does
>  > not track whether each page is previously shared, isn't it? If a page
>  > was previously shared and was written by the host kernel or devices
>  > before it was changed to private. No one tracks it and dirty caches
>  > are there!
> 
> The skipped invalidation here covered the case Mingwei mentioned above, 
> where the pages are changed from private->shared and subsequent freeing 
> of shared pages triggered the invalidation.
> 
> But, then why are we concerned about this, i thought we have concerns 
> about the case where the dirty cache lines contain encrypted guest data ?
> 
> Thanks,
> Ashish