[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <F3B77ECE-8C70-47AA-98F8-0C032CB5F568@amd.com>
Date: Wed, 10 Mar 2021 03:42:04 +0000
From: "Kalra, Ashish" <Ashish.Kalra@....com>
To: Steve Rutherford <srutherford@...gle.com>
CC: "Singh, Brijesh" <brijesh.singh@....com>,
Sean Christopherson <seanjc@...gle.com>,
"pbonzini@...hat.com" <pbonzini@...hat.com>,
"joro@...tes.org" <joro@...tes.org>,
"Lendacky, Thomas" <Thomas.Lendacky@....com>,
"kvm@...r.kernel.org" <kvm@...r.kernel.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"venu.busireddy@...cle.com" <venu.busireddy@...cle.com>,
Will Deacon <will@...nel.org>,
Quentin Perret <qperret@...gle.com>
Subject: Re: [PATCH v10 10/16] KVM: x86: Introduce KVM_GET_SHARED_PAGES_LIST
ioctl
> On Mar 9, 2021, at 3:22 AM, Steve Rutherford <srutherford@...gle.com> wrote:
>
> On Mon, Mar 8, 2021 at 1:11 PM Brijesh Singh <brijesh.singh@....com> wrote:
>>
>>
>>> On 3/8/21 1:51 PM, Sean Christopherson wrote:
>>> On Mon, Mar 08, 2021, Ashish Kalra wrote:
>>>> On Fri, Feb 26, 2021 at 09:44:41AM -0800, Sean Christopherson wrote:
>>>>> +Will and Quentin (arm64)
>>>>>
>>>>> Moving the non-KVM x86 folks to bcc, I don't they care about KVM details at this
>>>>> point.
>>>>>
>>>>> On Fri, Feb 26, 2021, Ashish Kalra wrote:
>>>>>> On Thu, Feb 25, 2021 at 02:59:27PM -0800, Steve Rutherford wrote:
>>>>>>> On Thu, Feb 25, 2021 at 12:20 PM Ashish Kalra <ashish.kalra@....com> wrote:
>>>>>>> Thanks for grabbing the data!
>>>>>>>
>>>>>>> I am fine with both paths. Sean has stated an explicit desire for
>>>>>>> hypercall exiting, so I think that would be the current consensus.
>>>>> Yep, though it'd be good to get Paolo's input, too.
>>>>>
>>>>>>> If we want to do hypercall exiting, this should be in a follow-up
>>>>>>> series where we implement something more generic, e.g. a hypercall
>>>>>>> exiting bitmap or hypercall exit list. If we are taking the hypercall
>>>>>>> exit route, we can drop the kvm side of the hypercall.
>>>>> I don't think this is a good candidate for arbitrary hypercall interception. Or
>>>>> rather, I think hypercall interception should be an orthogonal implementation.
>>>>>
>>>>> The guest, including guest firmware, needs to be aware that the hypercall is
>>>>> supported, and the ABI needs to be well-defined. Relying on userspace VMMs to
>>>>> implement a common ABI is an unnecessary risk.
>>>>>
>>>>> We could make KVM's default behavior be a nop, i.e. have KVM enforce the ABI but
>>>>> require further VMM intervention. But, I just don't see the point, it would
>>>>> save only a few lines of code. It would also limit what KVM could do in the
>>>>> future, e.g. if KVM wanted to do its own bookkeeping _and_ exit to userspace,
>>>>> then mandatory interception would essentially make it impossible for KVM to do
>>>>> bookkeeping while still honoring the interception request.
>>>>>
>>>>> However, I do think it would make sense to have the userspace exit be a generic
>>>>> exit type. But hey, we already have the necessary ABI defined for that! It's
>>>>> just not used anywhere.
>>>>>
>>>>> /* KVM_EXIT_HYPERCALL */
>>>>> struct {
>>>>> __u64 nr;
>>>>> __u64 args[6];
>>>>> __u64 ret;
>>>>> __u32 longmode;
>>>>> __u32 pad;
>>>>> } hypercall;
>>>>>
>>>>>
>>>>>>> Userspace could also handle the MSR using MSR filters (would need to
>>>>>>> confirm that). Then userspace could also be in control of the cpuid bit.
>>>>> An MSR is not a great fit; it's x86 specific and limited to 64 bits of data.
>>>>> The data limitation could be fudged by shoving data into non-standard GPRs, but
>>>>> that will result in truly heinous guest code, and extensibility issues.
>>>>>
>>>>> The data limitation is a moot point, because the x86-only thing is a deal
>>>>> breaker. arm64's pKVM work has a near-identical use case for a guest to share
>>>>> memory with a host. I can't think of a clever way to avoid having to support
>>>>> TDX's and SNP's hypervisor-agnostic variants, but we can at least not have
>>>>> multiple KVM variants.
>>>>>
>>>> Potentially, there is another reason for in-kernel hypercall handling
>>>> considering SEV-SNP. In case of SEV-SNP the RMP table tracks the state
>>>> of each guest page, for instance pages in hypervisor state, i.e., pages
>>>> with C=0 and pages in guest valid state with C=1.
>>>>
>>>> Now, there shouldn't be a need for page encryption status hypercalls on
>>>> SEV-SNP as KVM can track & reference guest page status directly using
>>>> the RMP table.
>>> Relying on the RMP table itself would require locking the RMP table for an
>>> extended duration, and walking the entire RMP to find shared pages would be
>>> very inefficient.
>>>
>>>> As KVM maintains the RMP table, therefore we will need SET/GET type of
>>>> interfaces to provide the guest page encryption status to userspace.
>>> Hrm, somehow I temporarily forgot about SNP and TDX adding their own hypercalls
>>> for converting between shared and private. And in the case of TDX, the hypercall
>>> can't be trusted, i.e. is just a hint, otherwise the guest could induce a #MC in
>>> the host.
>>>
>>> But, the different guest behavior doesn't require KVM to maintain a list/tree,
>>> e.g. adding a dedicated KVM_EXIT_* for notifying userspace of page encryption
>>> status changes would also suffice.
>>>
>>> Actually, that made me think of another argument against maintaining a list in
>>> KVM: there's no way to notify userspace that a page's status has changed.
>>> Userspace would need to query KVM to do GET_LIST after every GET_DIRTY.
>>> Obviously not a huge issue, but it does make migration slightly less efficient.
>>>
>>> On a related topic, there are fatal race conditions that will require careful
>>> coordination between guest and host, and will effectively be wired into the ABI.
>>> SNP and TDX don't suffer these issues because host awareness of status is atomic
>>> with respect to the guest actually writing the page with the new encryption
>>> status.
>>>
>>> For SEV live migration...
>>>
>>> If the guest does the hypercall after writing the page, then the guest is hosed
>>> if it gets migrated while writing the page (scenario #1):
>>>
>>> vCPU Userspace
>>> zero_bytes[0:N]
>>> <transfers written bytes as private instead of shared>
>>> <migrates vCPU>
>>> zero_bytes[N+1:4095]
>>> set_shared (dest)
>>> kaboom!
>>
>>
>> Maybe I am missing something, this is not any different from a normal
>> operation inside a guest. Making a page shared/private in the page table
>> does not update the content of the page itself. In your above case, I
>> assume zero_bytes[N+1:4095] are written by the destination VM. The
>> memory region was private in the source VM page table, so, those writes
>> will be performed encrypted. The destination VM later changed the memory
>> to shared, but nobody wrote to the memory after it has been transitioned
>> to the shared, so a reader of the memory should get ciphertext and
>> unless there was a write after the set_shared (dest).
>>
>>
>>> If userspace does GET_DIRTY after GET_LIST, then the host would transfer bad
>>> data by consuming a stale list (scenario #2):
>>>
>>> vCPU Userspace
>>> get_list (from KVM or internally)
>>> set_shared (src)
>>> zero_page (src)
>>> get_dirty
>>> <transfers private data instead of shared>
>>> <migrates vCPU>
>>> kaboom!
>>
>>
>> I don't remember how things are done in recent Ashish Qemu/KVM patches
>> but in previous series, the get_dirty() happens before the querying the
>> encrypted state. There was some logic in VMM to resync the encrypted
>> bitmap during the final migration stage and perform any additional data
>> transfer since last sync.
>>
>>
>>> If both guest and host order things to avoid #1 and #2, the host can still
>>> migrate the wrong data (scenario #3):
>>>
>>> vCPU Userspace
>>> set_private
>>> zero_bytes[0:4096]
>>> get_dirty
>>> set_shared (src)
>>> get_list
>>> <transfers as shared instead of private>
>>> <migrates vCPU>
>>> set_private (dest)
>>> kaboom!
>>
>>
>> Since there was no write to the memory after the set_shared (src), so
>> the content of the page should not have changed. After the set_private
>> (dest), the caller should be seeing the same content written by the
>> zero_bytes[0:4096]
> I think Sean was going for the situation where the VM has moved to the
> destination, which would have changed the VEK. That way the guest
> would be decrypting the old ciphertext with the new (wrong) key.
>>
But how can this happen, if a page is migrated as private , when it is received it will be decrypted using the transport key TEK and then re-encrypted using the destination VM’s VEK on the destination VM.
Thanks,
Ashish
>>
>>> Scenario #3 is unlikely, but plausible, e.g. if the guest bails from its
>>> conversion flow for whatever reason, after making the initial hypercall. Maybe
>>> it goes without saying, but to address #3, the guest must consider existing data
>>> as lost the instant it tells the host the page has been converted to a different
>>> type.
>>>
>>>> For the above reason if we do in-kernel hypercall handling for page
>>>> encryption status (which we probably won't require for SEV-SNP &
>>>> correspondingly there will be no hypercall exiting),
>>> As above, that doesn't preclude KVM from exiting to userspace on conversion.
>>>
>>>> then we can implement a standard GET/SET ioctl interface to get/set the guest
>>>> page encryption status for userspace, which will work across SEV, SEV-ES and
>>>> SEV-SNP.
Powered by blists - more mailing lists