linux-kernel - Re: [RFC PATCH 0/9] kvm: implement atomic memslot updates

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <a412085f-9391-8a4c-916c-513c800c35b1@redhat.com>
Date:   Thu, 13 Oct 2022 10:44:20 +0200
From:   David Hildenbrand <david@...hat.com>
To:     Emanuele Giuseppe Esposito <eesposit@...hat.com>,
        Sean Christopherson <seanjc@...gle.com>,
        Paolo Bonzini <pbonzini@...hat.com>
Cc:     Maxim Levitsky <mlevitsk@...hat.com>, kvm@...r.kernel.org,
        Vitaly Kuznetsov <vkuznets@...hat.com>,
        Wanpeng Li <wanpengli@...cent.com>,
        Jim Mattson <jmattson@...gle.com>,
        Joerg Roedel <joro@...tes.org>,
        Thomas Gleixner <tglx@...utronix.de>,
        Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>,
        Dave Hansen <dave.hansen@...ux.intel.com>, x86@...nel.org,
        "H. Peter Anvin" <hpa@...or.com>, linux-kernel@...r.kernel.org,
        Like Xu <like.xu.linux@...il.com>
Subject: Re: [RFC PATCH 0/9] kvm: implement atomic memslot updates

On 13.10.22 09:43, Emanuele Giuseppe Esposito wrote:
> 
> 
> Am 29/09/2022 um 23:39 schrieb Sean Christopherson:
>> If we really want to provide a better experience for userspace, why not provide
>> more primitives to address those specific use cases?  E.g. fix KVM's historic wart
>> of disallowing toggling of KVM_MEM_READONLY, and then add one or more ioctls to:
>>
>>    - Merge 2+ memory regions that are contiguous in GPA and HVA
>>    - Split a memory region into 2+ memory regions
>>    - Truncate a memory region
>>    - Grow a memory region
> 
> I looked again at the code and specifically the use case that triggers
> the crash in bugzilla. I *think* (correct me if I am wrong), that the
> only operation that QEMU performs right now is "grow/shrink".

I remember that there were BUG reports where we'd actually split and run 
into that problem. Just don't have them at hand. I think they happened 
during early boot when the OS re-configured some PCI thingies.

> 
> So *if* we want to go this way, we could start with a simple grow/shrink
> API.
> 
> Even though we need to consider that this could bring additional
> complexity in QEMU. Currently, DELETE+CREATE (grow/shrink) is not
> performed one after the other (in my innocent fantasy I was expecting to
> find 2 subsequent ioctls in the code), but in QEMU's
> address_space_set_flatview(), which seems to first remove all regions
> and then add them when changing flatviews.
> 
> address_space_update_topology_pass(as, old_view2, new_view, adding=false);
> address_space_update_topology_pass(as, old_view2, new_view, adding=true);
> 
> I don't think we can change this, as other listeners also rely on such
> ordering, but we can still batch all callback requests like I currently
> do and process them in kvm_commit(), figuring there which operation is
> which.
> 
> In other words, we would have something like:
> 
> region_del() --> DELETE memslot X -> add it to the queue of operations
> region_del() --> DELETE memslot Y -> add it to the queue of operations
> region_add() --> CREATE memslot X (size doubled) -> add it to the queue
> of operations
> region_add() --> CREATE memslot Y (size halved) -> add it to the queue
> of operations
> ...
> commit() --> scan QUEUE and figure what to do -> GROW X (+size), SHRINK
> Y (-size) -> issue 2 ioctls, GROW and SHRINK.

I communicated resizes (region_resize()) to the notifier in patch #3 of 
https://lore.kernel.org/qemu-devel/20200312161217.3590-1-david@redhat.com/

You could use that independently of the remainder. It should be less 
controversial ;)

But I think it only tackles part of the more generic problem (split/merge).

> 
>> That would probably require more KVM code overall, but each operation would be more
>> tightly bounded and thus simpler to define.  And I think more precise APIs would
>> provide other benefits, e.g. growing a region wouldn't require first deleting the
>> current region, and so could avoid zapping SPTEs and destroying metadata.  Merge,
>> split, and truncate would have similar benefits, though they might be more
>> difficult to realize in practice.
> 
> So essentially grow would not require INVALIDATE. Makes sense, but would
> it work also with shrink? I guess so, as the memslot is still present
> (but shrinked) right?
> 
> Paolo, would you be ok with this smaller API? Probably just starting
> with grow and shrink first.
> 
> I am not against any of the two approaches:
> - my approach has the disadvantage that the list could be arbitrarily
> long, and it is difficult to rollback the intermediate changes if
> something breaks during the request processing (but could be simplified
> by making kvm exit or crash).
> 
> - Sean approach could potentially provide more burden to the userspace,
> as we need to figure which operation is which. Also from my
> understanding split and merge won't be really straightforward to
> implement, especially in userspace.
> 
> David, any concern from userspace prospective on this "CISC" approach?

In contrast to resizes in QEMU that only affect a single memory 
region/slot, splitting/merging is harder to factor out and communicate 
to a notifier. As an alternative, we could handle it in the commit stage 
in the notifier itself, similar to what my prototype does, and figure 
out what needs to be done in there and how to call the proper KVM 
interface (and which interface to call).

With virtio-mem (in the future) we might see merges of 2 slots into a 
single one, by closing a gap in-between them. In "theory" we could 
combine some updates into a single transaction. But it won't be 100s ...

I think I'd prefer an API that doesn't require excessive ad-hoc 
extensions later once something in QEMU changes.


I think in essence, all we care about is performing atomic changes that 
*have to be atomic*, because something we add during a transaction 
overlaps with something we remove during a transaction. Not necessarily 
all updates in a transaction!

My prototype essentially detects that scenario, but doesn't call new KVM 
interfaces to deal with these.

I assume once we take that into consideration, we can mostly assume that 
any such atomic updates (only of the regions that really have to be part 
of an atomic update) won't involve more than a handful of memory 
regions. We could add a sane KVM API limit.

And if we run into that API limit in QEMU, we can print a warning and do 
it non-atomically.

-- 
Thanks,

David / dhildenb