netdev - Re: How to implement message forwarding from one CID to another in vhost driver

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5b3b1b08-1dc2-4110-98d4-c3bb5f090437@amazon.com>
Date: Mon, 27 May 2024 09:08:00 +0200
From: Alexander Graf <graf@...zon.com>
To: Stefano Garzarella <sgarzare@...hat.com>, Alexander Graf <agraf@...raf.de>
CC: Dorjoy Chowdhury <dorjoychy111@...il.com>,
	<virtualization@...ts.linux.dev>, <kvm@...r.kernel.org>,
	<netdev@...r.kernel.org>, <stefanha@...hat.com>
Subject: Re: How to implement message forwarding from one CID to another in
 vhost driver

Hey Stefano,

On 23.05.24 10:45, Stefano Garzarella wrote:
> On Tue, May 21, 2024 at 08:50:22AM GMT, Alexander Graf wrote:
>> Howdy,
>>
>> On 20.05.24 14:44, Dorjoy Chowdhury wrote:
>>> Hey Stefano,
>>>
>>> Thanks for the reply.
>>>
>>>
>>> On Mon, May 20, 2024, 2:55 PM Stefano Garzarella 
>>> <sgarzare@...hat.com> wrote:
>>>> Hi Dorjoy,
>>>>
>>>> On Sat, May 18, 2024 at 04:17:38PM GMT, Dorjoy Chowdhury wrote:
>>>>> Hi,
>>>>>
>>>>> Hope you are doing well. I am working on adding AWS Nitro Enclave[1]
>>>>> emulation support in QEMU. Alexander Graf is mentoring me on this 
>>>>> work. A v1
>>>>> patch series has already been posted to the qemu-devel mailing 
>>>>> list[2].
>>>>>
>>>>> AWS nitro enclaves is an Amazon EC2[3] feature that allows 
>>>>> creating isolated
>>>>> execution environments, called enclaves, from Amazon EC2 
>>>>> instances, which are
>>>>> used for processing highly sensitive data. Enclaves have no 
>>>>> persistent storage
>>>>> and no external networking. The enclave VMs are based on 
>>>>> Firecracker microvm
>>>>> and have a vhost-vsock device for communication with the parent 
>>>>> EC2 instance
>>>>> that spawned it and a Nitro Secure Module (NSM) device for 
>>>>> cryptographic
>>>>> attestation. The parent instance VM always has CID 3 while the 
>>>>> enclave VM gets
>>>>> a dynamic CID. The enclave VMs can communicate with the parent 
>>>>> instance over
>>>>> various ports to CID 3, for example, the init process inside an 
>>>>> enclave sends a
>>>>> heartbeat to port 9000 upon boot, expecting a heartbeat reply, 
>>>>> letting the
>>>>> parent instance know that the enclave VM has successfully booted.
>>>>>
>>>>> The plan is to eventually make the nitro enclave emulation in QEMU 
>>>>> standalone
>>>>> i.e., without needing to run another VM with CID 3 with proper vsock
>>>> If you don't have to launch another VM, maybe we can avoid vhost-vsock
>>>> and emulate virtio-vsock in user-space, having complete control 
>>>> over the
>>>> behavior.
>>>>
>>>> So we could use this opportunity to implement virtio-vsock in QEMU [4]
>>>> or use vhost-user-vsock [5] and customize it somehow.
>>>> (Note: vhost-user-vsock already supports sibling communication, so 
>>>> maybe
>>>> with a few modifications it fits your case perfectly)
>>>>
>>>> [4] https://gitlab.com/qemu-project/qemu/-/issues/2095
>>>> [5] 
>>>> https://github.com/rust-vmm/vhost-device/tree/main/vhost-device-vsock
>>>
>>>
>>> Thanks for letting me know. Right now I don't have a complete picture
>>> but I will look into them. Thank you.
>>>>
>>>>
>>>>> communication support. For this to work, one approach could be to 
>>>>> teach the
>>>>> vhost driver in kernel to forward CID 3 messages to another CID N
>>>> So in this case both CID 3 and N would be assigned to the same QEMU
>>>> process?
>>>
>>>
>>> CID N is assigned to the enclave VM. CID 3 was supposed to be the
>>> parent VM that spawns the enclave VM (this is how it is in AWS, where
>>> an EC2 instance VM spawns the enclave VM from inside it and that
>>> parent EC2 instance always has CID 3). But in the QEMU case as we
>>> don't want a parent VM (we want to run enclave VMs standalone) we
>>> would need to forward the CID 3 messages to host CID. I don't know if
>>> it means CID 3 and CID N is assigned to the same QEMU process. Sorry.
>>
>>
>> There are 2 use cases here:
>>
>> 1) Enclave wants to treat host as parent (default). In this scenario,
>> the "parent instance" that shows up as CID 3 in the Enclave doesn't
>> really exist. Instead, when the Enclave attempts to talk to CID 3, it
>> should really land on CID 0 (hypervisor). When the hypervisor tries to
>> connect to the Enclave on port X, it should look as if it originates
>> from CID 3, not CID 0.
>>
>> 2) Multiple parent VMs. Think of an actual cloud hosting scenario.
>> Here, we have multiple "parent instances". Each of them thinks it's
>> CID 3. Each can spawn an Enclave that talks to CID 3 and reach the
>> parent. For this case, I think implementing all of virtio-vsock in
>> user space is the best path forward. But in theory, you could also
>> swizzle CIDs to make random "real" CIDs appear as CID 3.
>>
>
> Thank you for clarifying the use cases!
>
> Also for case 1, vhost-vsock doesn't support CID 0, so in my opinion
> it's easier to go into user-space with vhost-user-vsock or the built-in
> device.


Sorry, I believe I meant CID 2. Effectively for case 1, when a process 
on the hypervisor listens on port 1234, it should be visible as 3:1234 
from the VM and when the hypervisor process connects to <VM CID>:1234, 
it should look as if that connection came from CID 3.


> Maybe initially with vhost-user-vsock it's easier because we already
> have some thing that works and supports sibling communication (for case
> 2).


The problem with vhost-user-vsock is that you don't get to use AF_VSOCK 
as a host process.

A typical Nitro Enclaves application is split into 2 parts: An 
in-Enclave component that listens/connects to vsock and a parent process 
that listens/connects to vsock. The experience of launching an Enclave 
is very similar to launching a QEMU VM: You run nitro-cli and tell it to 
pop up the Enclave based on an EIF file. Nitro-cli then tells you the 
CID that was allocated for the Enclave and you communicate to it using that.

What I would ideally like to have as development experience is that you 
run QEMU with unmodified Enclave components (the EIF file) and run your 
parent application unmodified on the host.

For that to work, the host applications needs to be able to use AF_VSOCK.


I agree that for this conversation, we should just ignore case 2 and 
consider it as "solved" through vhost-user-vsock, as that can create its 
own CID namespace between different VMs.


>
>>
>>>
>>>> Do you have to allocate 2 separate virtio-vsock devices, one for the
>>>> parent and one for the enclave?
>>>
>>>
>>> If there is a parent VM, then I guess both parent and enclave VMs need
>>> virtio-vsock devices.
>>>
>>>>> (set to CID 2 for host) i.e., it patches CID from 3 to N on 
>>>>> incoming messages
>>>>> and from N to 3 on responses. This will enable users of the
>>>> Will these messages have the VMADDR_FLAG_TO_HOST flag set?
>>>>
>>>> We don't support this in vhost-vsock yet, if supporting it helps, we
>>>> might, but we need to better understand how to avoid security 
>>>> issues, so
>>>> maybe each device needs to explicitly enable the feature and specify
>>>> from which CIDs it accepts packets.
>>>
>>>
>>> I don't know about the flag. So I don't know if it will be set. Sorry.
>>
>>
>> From the guest's point of view, the parent (CID 3) is just another VM.
>> Since Linux as of
>>
>>  https://patchwork.ozlabs.org/project/netdev/patch/20201204170235.84387-4-andraprs@amazon.com/#2594117 
>>
>>
>> always sets VMADDR_FLAG_TO_HOST when local_CID > 0 && remote_CID > 0, I
>> would say the message has the flag set.
>>
>> How would you envision the host to implement the flag? Would the host
>> allow user space to listen on any CID and hence receive the respective
>> target connections? And wouldn't listening on CID 0 then mean you're
>> effectively listening to "any" other CID? Thinking about that a bit
>> more, that may be just what we need, yes :)
>
> No, wait. The flag I had guessed only to implement sibling
> communication, so the host doesn't re-forward those packets to sockets
> opened by applications in the host, but only to other VMs in the same
> host. So the host would always only have CID 2 assigned (CID 0 is not
> supported by vhost-vsock).
>
>>
>>
>>>
>>>
>>>>> nitro-enclave machine
>>>>> type in QEMU to run the necessary vsock server/clients in the host 
>>>>> machine
>>>>> (some defaults can be implemented in QEMU as well, for example, 
>>>>> sending a reply
>>>>> to the heartbeat) which will rid them of the cumbersome way of 
>>>>> running another
>>>>> whole VM with CID 3. This way, users of nitro-enclave machine in 
>>>>> QEMU, could
>>>>> potentially also run multiple enclaves with their messages for CID 
>>>>> 3 forwarded
>>>>> to different CIDs which, in QEMU side, could then be specified 
>>>>> using a new
>>>>> machine type option (parent-cid) if implemented. I guess in the 
>>>>> QEMU side, this
>>>>> will be an ioctl call (or some other way) to indicate to the host 
>>>>> kernel that
>>>>> the CID 3 messages need to be forwarded. Does this approach of
>>>> What if there is already a VM with CID = 3 in the system?
>>>
>>>
>>> Good question! I don't know what should happen in this case.
>>
>>
>> See case 2 above :). In a nutshell, I don't think it'd be legal to
>> have a real CID 3 in that scenario.
>
> Yeah, with vhost-vsock we can't, but with vhost-user-vsock I think is
> fine since the guest CID is local for each instance. The host only sees
> the unix socket (like with firecracker).


See above why a unix socket is not really great CX :)


>
>>
>>
>>>
>>>
>>>>> forwarding CID 3 messages to another CID sound good?
>>>> It seems too specific a case, if we can generalize it maybe we could
>>>> make this change, but we would like to avoid complicating vhost-vsock
>>>> and keep it as simple as possible to avoid then having to implement
>>>> firewalls, etc.
>>>>
>>>> So first I would see if vhost-user-vsock or the QEMU built-in 
>>>> device is
>>>> right for this use-case.
>>> Thanks you! I will check everything out and reach out if I need
>>> further guidance about what needs to be done. And sorry as I wasn't
>>> able to answer some of your questions.
>>
>>
>> As mentioned above, I think there is merit for both. I personally care
>> a lot more for case 1 over case 2: We already have a working
>> implementation of Nitro Enclaves in a Cloud setup. What is missing is
>> a way to easily run a Nitro Enclave locally for development.
>
> If both are fine, then I would go more on modifying vhost-user-vsock or
> adding a built-in device in QEMU.
> We have more freedom and also easier to update/debug.


I agree on those points, but if we go down that route users can't simply 
reuse their existing code, no? At that point, they're probably better 
off just spawning another (micro)-VM on CID 3, as that at least gives 
them the ability to reuse their existing parent code.


Alex





Amazon Web Services Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B
Sitz: Berlin
Ust-ID: DE 365 538 597