lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Thu, 4 Aug 2022 11:27:35 +0200
From:   Alexandra Winter <wintera@...ux.ibm.com>
To:     Stephen Hemminger <stephen@...workplumber.org>,
        Matthew Rosato <mjrosato@...ux.ibm.com>
Cc:     Tony Lu <tonylu@...ux.alibaba.com>, kgraul@...ux.ibm.com,
        wenjia@...ux.ibm.com, davem@...emloft.net, edumazet@...gle.com,
        kuba@...nel.org, pabeni@...hat.com, netdev@...r.kernel.org,
        linux-s390@...r.kernel.org, zmlcc@...ux.alibaba.com,
        hans@...ux.alibaba.com, zhiyuan2048@...ux.alibaba.com,
        herongguang@...ux.alibaba.com
Subject: Re: [RFC net-next 1/1] net/smc: SMC for inter-VM communication



On 04.08.22 01:41, Stephen Hemminger wrote:
> On Wed, 3 Aug 2022 16:27:54 -0400
> Matthew Rosato <mjrosato@...ux.ibm.com> wrote:
> 
>> On 7/20/22 1:00 PM, Tony Lu wrote:
>>> Hi all,
>>>
>>> # Background
>>>
>>> We (Alibaba Cloud) have already used SMC in cloud environment to
>>> transparently accelerate TCP applications with ERDMA [1]. Nowadays,
>>> there is a common scenario that deploy containers (which runtime is
>>> based on lightweight virtual machine) on ECS (Elastic Compute Service),
>>> and the containers may want to be scheduled on the same host in order to
>>> get higher performance of network, such as AI, big data or other
>>> scenarios that are sensitive with bandwidth and latency. Currently, the
>>> performance of inter-VM is poor and CPU resource is wasted (see
>>> #Benchmark virtio). This scenario has been discussed many times, but a
>>> solution for a common scenario for applications is missing [2] [3] [4].
>>>
>>> # Design
>>>
>>> In inter-VM scenario, we use ivshmem (Inter-VM shared memory device)
>>> which is modeled by QEMU [5]. With it, multiple VMs can access one
>>> shared memory. This shared memory device is statically created by host
>>> and shared to desired guests. The device exposes as a PCI BAR, and can
>>> interrupt its peers (ivshmem-doorbell).
>>>
>>> In order to use ivshmem in SMC, we write a draft device driver as a
>>> bridge between SMC and ivshmem PCI device. To make it easier, this
>>> driver acts like a SMC-D device in order to fit in SMC without modifying
>>> the code, which is named ivpci (see patch #1).
>>>
>>>    ┌───────────────────────────────────────┐
>>>    │  ┌───────────────┐ ┌───────────────┐  │
>>>    │  │      VM1      │ │      VM2      │  │
>>>    │  │┌─────────────┐│ │┌─────────────┐│  │
>>>    │  ││ Application ││ ││ Application ││  │
>>>    │  │├─────────────┤│ │├─────────────┤│  │
>>>    │  ││     SMC     ││ ││     SMC     ││  │
>>>    │  │├─────────────┤│ │├─────────────┤│  │
>>>    │  ││    ivpci    ││ ││    ivpci    ││  │
>>>    │  └└─────────────┘┘ └└─────────────┘┘  │
>>>    │        x  *               x  *        │
>>>    │        x  ****************x* *        │
>>>    │        x  xxxxxxxxxxxxxxxxx* *        │
>>>    │        x  x                * *        │
>>>    │  ┌───────────────┐ ┌───────────────┐  │
>>>    │  │shared memories│ │ivshmem-server │  │
>>>    │  └───────────────┘ └───────────────┘  │
>>>    │                HOST A                 │
>>>    └───────────────────────────────────────┘
>>>     *********** Control flow (interrupt)
>>>     xxxxxxxxxxx Data flow (memory access)
>>>
>>> Inside ivpci driver, it implements almost all the operations of SMC-D
>>> device. It can be divided into two parts:
>>>
>>> - control flow, most of it is same with SMC-D, use ivshmem trigger
>>>    interruptions in ivpci and process CDC flow.
>>>
>>> - data flow, the shared memory of each connection is one large region
>>>    and divided into two part for local and remote RMB. Every writer
>>>    syscall copies data to sndbuf and calls ISM's move_data() to move data
>>>    to remote RMB in ivshmem and interrupt remote. And reader then
>>>    receives interruption and check CDC message, consume data if cursor is
>>>    updated.
>>>
>>> # Benchmark
>>>
>>> Current POC of ivpci is unstable and only works for single SMC
>>> connection. Here is the brief data:
>>>
>>> Items         Latency (pingpong)    Throughput (64KB)
>>> TCP (virtio)   19.3 us                3794.185 MBps
>>> TCP (SR-IOV)   13.2 us                3948.792 MBps
>>> SMC (ivshmem)   6.3 us               11900.269 MBps
>>>
>>> Test environments:
>>>
>>> - CPU Intel Xeon Platinum 8 core, mem 32 GiB
>>> - NIC Mellanox CX4 with 2 VFs in two different guests
>>> - using virsh to setup virtio-net + vhost
>>> - using sockperf and single connection
>>> - SMC + ivshmem throughput uses one-copy (userspace -> kernel copy)
>>>    with intrusive modification of SMC (see patch #1), latency (pingpong)
>>>    use two-copy (user -> kernel and move_data() copy, patch version).
>>>
>>> With the comparison, SMC with ivshmem gets 3-4x bandwidth and a half
>>> latency.
>>>
>>> TCP + virtio is the most usage solution for guest, it gains lower
>>> performance. Moreover, it consumes extra thread with full CPU core
>>> occupied in host to transfer data, wastes more CPU resource. If the host
>>> is very busy, the performance will be worse.
>>>   
>>
>> Hi Tony,
>>
>> Quite interesting!  FWIW for s390x we are also looking at passthrough of 
>> host ISM devices to enable SMC-D in QEMU guests:
>> https://lore.kernel.org/kvm/20220606203325.110625-1-mjrosato@linux.ibm.com/
>> https://lore.kernel.org/kvm/20220606203614.110928-1-mjrosato@linux.ibm.com/
>>
>> But seems to me an 'emulated ISM' of sorts could still be interesting 
>> even on s390x e.g. for scenarios where host device passthrough is not 
>> possible/desired.
>>
>> Out of curiosity I tried this ivpci module on s390x but the device won't 
>> probe -- This is possibly an issue with the s390x PCI emulation layer in 
>> QEMU, I'll have to look into that.
>>
>>> # Discussion
>>>
>>> This RFC and solution is still in early stage, so we want to come it up
>>> as soon as possible and fully discuss with IBM and community. We have
>>> some topics putting on the table:
>>>
>>> 1. SMC officially supports this scenario.
>>>
>>> SMC + ivshmem shows huge improvement when communicating inter VMs. SMC-D
>>> and mocking ISM device might not be the official solution, maybe another
>>> extension for SMC besides SMC-R and SMC-D. So we are wondering if SMC
>>> would accept this idea to fix this scenario? Are there any other
>>> possibilities?  
>>
>> I am curious about ivshmem and its current state though -- e.g. looking 
>> around I see mention of v2 which you also referenced but don't see any 
>> activity on it for a few years?  And as far as v1 ivshmem -- server "not 
>> for production use", etc.
>>
>> Thanks,
>> Matt
>>
>>>
>>> 2. Implementation of SMC for inter-VM.
>>>
>>> SMC is used in container and cloud environment, maybe we can propose a
>>> new device and new protocol if possible in these new scenarios to solve
>>> this problem.
>>>
>>> 3. Standardize this new protocol and device.
>>>
>>> SMC-R has an open RFC 7609, so can this new device or protocol like
>>> SMC-D can be standardized. There is a possible option that proposing a
>>> new device model in QEMU + virtio ecosystem and SMC supports this
>>> standard virtio device, like [6].
>>>
>>> If there are any problems, please point them out.
>>>
>>> Hope to hear from you, thank you.
>>>
>>> [1] https://lwn.net/Articles/879373/
>>> [2] https://projectacrn.github.io/latest/tutorials/enable_ivshmem.html
>>> [3] https://dl.acm.org/doi/10.1145/2847562
>>> [4] https://hal.archives-ouvertes.fr/hal-00368622/document
>>> [5] https://github.com/qemu/qemu/blob/master/docs/specs/ivshmem-spec.txt
>>> [6] https://github.com/siemens/jailhouse/blob/master/Documentation/ivshmem-v2-specification.md
>>>
>>> Signed-off-by: Tony Lu <tonylu@...ux.alibaba.com>  
> 
> 
> Also looks a lot like existing VSOCK which has transports for Virtio, HyperV and VMWare already.

To have it documented in this thread:
As Wenjia Zhang <wenjia@...ux.ibm.com> mentioned in
https://lore.kernel.org/netdev/Yt9Xfv0bN0AGMdGP@TonyMac-Alibaba/t/#mcfaa50f7142f923d2b570dc19b70c73ceddc1270
we are working on some patches to cleanup the interface between the ism device driver and the SMC-D protocol
layer. They may simplify a project like the one described in this RFC. Stay tuned.

Powered by blists - more mailing lists