lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <bc4e55b9-ac35-3c71-104e-862fec958403@linux.ibm.com>
Date:   Mon, 16 Jan 2023 12:01:13 +0100
From:   Wenjia Zhang <wenjia@...ux.ibm.com>
To:     Wen Gu <guwen@...ux.alibaba.com>
Cc:     Alexandra Winter <wintera@...ux.ibm.com>,
        Niklas Schnelle <schnelle@...ux.ibm.com>, kgraul@...ux.ibm.com,
        jaka@...ux.ibm.com, davem@...emloft.net, edumazet@...gle.com,
        kuba@...nel.org, pabeni@...hat.com, linux-s390@...r.kernel.org,
        netdev@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [RFC PATCH net-next v2 0/5] net/smc:Introduce SMC-D based
 loopback acceleration



On 12.01.23 13:12, Wen Gu wrote:
> 
> 
> On 2023/1/5 00:09, Alexandra Winter wrote:
>>
>>
>> On 21.12.22 14:14, Wen Gu wrote:
>>>
>>>
>>> On 2022/12/20 22:02, Niklas Schnelle wrote:
>>>
>>>> On Tue, 2022-12-20 at 11:21 +0800, Wen Gu wrote:
>>>>> Hi, all
>>>>>
>>>>> # Background
>>>>>
>>>>> As previously mentioned in [1], we (Alibaba Cloud) are trying to 
>>>>> use SMC
>>>>> to accelerate TCP applications in cloud environment, improving 
>>>>> inter-host
>>>>> or inter-VM communication.
>>>>>
>>>>> In addition of these, we also found the value of SMC-D in scenario 
>>>>> of local
>>>>> inter-process communication, such as accelerate communication 
>>>>> between containers
>>>>> within the same host. So this RFC tries to provide a SMC-D loopback 
>>>>> solution
>>>>> in such scenario, to bring a significant improvement in latency and 
>>>>> throughput
>>>>> compared to TCP loopback.
>>>>>
>>>>> # Design
>>>>>
>>>>> This patch set provides a kind of SMC-D loopback solution.
>>>>>
>>>>> Patch #1/5 and #2/5 provide an SMC-D based dummy device, preparing 
>>>>> for the
>>>>> inter-process communication acceleration. Except for loopback 
>>>>> acceleration,
>>>>> the dummy device can also meet the requirements mentioned in [2], 
>>>>> which is
>>>>> providing a way to test SMC-D logic for broad community without ISM 
>>>>> device.
>>>>>
>>>>>    +------------------------------------------+
>>>>>    |  +-----------+           +-----------+   |
>>>>>    |  | process A |           | process B |   |
>>>>>    |  +-----------+           +-----------+   |
>>>>>    |       ^                        ^         |
>>>>>    |       |    +---------------+   |         |
>>>>>    |       |    |   SMC stack   |   |         |
>>>>>    |       +--->| +-----------+ |<--|         |
>>>>>    |            | |   dummy   | |             |
>>>>>    |            | |   device  | |             |
>>>>>    |            +-+-----------+-+             |
>>>>>    |                   VM                     |
>>>>>    +------------------------------------------+
>>>>>
>>>>> Patch #3/5, #4/5, #5/5 provides a way to avoid data copy from 
>>>>> sndbuf to RMB
>>>>> and improve SMC-D loopback performance. Through extending smcd_ops 
>>>>> with two
>>>>> new semantic: attach_dmb and detach_dmb, sender's sndbuf shares the 
>>>>> same
>>>>> physical memory region with receiver's RMB. The data copied from 
>>>>> userspace
>>>>> to sender's sndbuf directly reaches the receiver's RMB without 
>>>>> unnecessary
>>>>> memory copy in the same kernel.
>>>>>
>>>>>    +----------+                     +----------+
>>>>>    | socket A |                     | socket B |
>>>>>    +----------+                     +----------+
>>>>>          |                               ^
>>>>>          |         +---------+           |
>>>>>     regard as      |         | ----------|
>>>>>     local sndbuf   |  B's    |     regard as
>>>>>          |         |  RMB    |     local RMB
>>>>>          |-------> |         |
>>>>>                    +---------+
>>>>
>>>> Hi Wen Gu,
>>>>
>>>> I maintain the s390 specific PCI support in Linux and would like to
>>>> provide a bit of background on this. You're surely wondering why we
>>>> even have a copy in there for our ISM virtual PCI device. To understand
>>>> why this copy operation exists and why we need to keep it working, one
>>>> needs a bit of s390 aka mainframe background.
>>>>
>>>> On s390 all (currently supported) native machines have a mandatory
>>>> machine level hypervisor. All OSs whether z/OS or Linux run either on
>>>> this machine level hypervisor as so called Logical Partitions (LPARs)
>>>> or as second/third/… level guests on e.g. a KVM or z/VM hypervisor that
>>>> in turn runs in an LPAR. Now, in terms of memory this machine level
>>>> hypervisor sometimes called PR/SM unlike KVM, z/VM, or VMWare is a
>>>> partitioning hypervisor without paging. This is one of the main reasons
>>>> for the very-near-native performance of the machine hypervisor as the
>>>> memory of its guests acts just like native RAM on other systems. It is
>>>> never paged out and always accessible to IOMMU translated DMA from
>>>> devices without the need for pinning pages and besides a trivial
>>>> offset/limit adjustment an LPAR's MMU does the same amount of work as
>>>> an MMU on a bare metal x86_64/ARM64 box.
>>>>
>>>> It also means however that when SMC-D is used to communicate between
>>>> LPARs via an ISM device there is  no way of mapping the DMBs to the
>>>> same physical memory as there exists no MMU-like layer spanning
>>>> partitions that could do such a mapping. Meanwhile for machine level
>>>> firmware including the ISM virtual PCI device it is still possible to
>>>> _copy_ memory between different memory partitions. So yeah while I do
>>>> see the appeal of skipping the memcpy() for loopback or even between
>>>> guests of a paging hypervisor such as KVM, which can map the DMBs on
>>>> the same physical memory, we must keep in mind this original use case
>>>> requiring a copy operation.
>>>>
>>>> Thanks,
>>>> Niklas
>>>>
>>>
>>> Hi Niklas,
>>>
>>> Thank you so much for the complete and detailed explanation! This 
>>> provides
>>> me a brand new perspective of s390 device that we hadn't dabbled in 
>>> before.
>>> Now I understand why shared memory is unavailable between different 
>>> LPARs.
>>>
>>> Our original intention of proposing loopback device and the incoming 
>>> device
>>> (virtio-ism) for inter-VM is to use SMC-D to accelerate communication 
>>> in the
>>> case with no existing s390 ISM devices. In our conception, s390 ISM 
>>> device,
>>> loopback device and virtio-ism device are parallel and are abstracted 
>>> by smcd_ops.
>>>
>>>   +------------------------+
>>>   |          SMC-D         |
>>>   +------------------------+
>>>   -------- smcd_ops ---------
>>>   +------+ +------+ +------+
>>>   | s390 | | loop | |virtio|
>>>   | ISM  | | back | | -ism |
>>>   | dev  | | dev  | | dev  |
>>>   +------+ +------+ +------+
>>>
>>> We also believe that keeping the existing design and behavior of s390 
>>> ISM
>>> device is unshaken. What we want to get support for is some smcd_ops 
>>> extension
>>> for devices with optional beneficial capability, such as nocopy here 
>>> (Let's call
>>> it this for now), which is really helpful for us in inter-process and 
>>> inter-VM
>>> scenario.
>>>
>>> And coincided with IBM's intention to add APIs between SMC-D and 
>>> devices to
>>> support various devices for SMC-D, as mentioned in [2], we send out 
>>> this RFC and
>>> the incoming virio-ism RFC, to provide some examples.
>>>
>>>>>
>>>>> # Benchmark Test
>>>>>
>>>>>    * Test environments:
>>>>>         - VM with Intel Xeon Platinum 8 core 2.50GHz, 16 GiB mem.
>>>>>         - SMC sndbuf/RMB size 1MB.
>>>>>
>>>>>    * Test object:
>>>>>         - TCP: run on TCP loopback.
>>>>>         - domain: run on UNIX domain.
>>>>>         - SMC lo: run on SMC loopback device with patch #1/5 ~ #2/5.
>>>>>         - SMC lo-nocpy: run on SMC loopback device with patch #1/5 
>>>>> ~ #5/5.
>>>>>
>>>>> 1. ipc-benchmark (see [3])
>>>>>
>>>>>    - ./<foo> -c 1000000 -s 100
>>>>>
>>>>>                          TCP              domain              
>>>>> SMC-lo             SMC-lo-nocpy
>>>>> Message
>>>>> rate (msg/s)         75140      129548(+72.41)    
>>>>> 152266(+102.64%)         151914(+102.17%)
>>>>
>>>> Interesting that it does beat UNIX domain sockets. Also, see my below
>>>> comment for nginx/wrk as this seems very similar.
>>>>
>>>>>
>>>>> 2. sockperf
>>>>>
>>>>>    - serv: <smc_run> taskset -c <cpu> sockperf sr --tcp
>>>>>    - clnt: <smc_run> taskset -c <cpu> sockperf { tp | pp } --tcp 
>>>>> --msg-size={ 64000 for tp | 14 for pp } -i 127.0.0.1 -t 30
>>>>>
>>>>>                          TCP                  SMC-lo             
>>>>> SMC-lo-nocpy
>>>>> Bandwidth(MBps)   4943.359        4936.096(-0.15%)        
>>>>> 8239.624(+66.68%)
>>>>> Latency(us)          6.372          3.359(-47.28%)            
>>>>> 3.25(-49.00%)
>>>>>
>>>>> 3. iperf3
>>>>>
>>>>>    - serv: <smc_run> taskset -c <cpu> iperf3 -s
>>>>>    - clnt: <smc_run> taskset -c <cpu> iperf3 -c 127.0.0.1 -t 15
>>>>>
>>>>>                          TCP                  SMC-lo             
>>>>> SMC-lo-nocpy
>>>>> Bitrate(Gb/s)         40.5            41.4(+2.22%)            
>>>>> 76.4(+88.64%)
>>>>>
>>>>> 4. nginx/wrk
>>>>>
>>>>>    - serv: <smc_run> nginx
>>>>>    - clnt: <smc_run> wrk -t 8 -c 500 -d 30 http://127.0.0.1:80
>>>>>
>>>>>                          TCP                  SMC-lo             
>>>>> SMC-lo-nocpy
>>>>> Requests/s       154643.22      220894.03(+42.84%)        
>>>>> 226754.3(+46.63%)
>>>>
>>>>
>>>> This result is very interesting indeed. So with the much more realistic
>>>> nginx/wrk workload it seems to copy hurts much less than the
>>>> iperf3/sockperf would suggest while SMC-D itself seems to help more.
>>>> I'd hope that this translates to actual applications as well. Maybe
>>>> this makes SMC-D based loopback interesting even while keeping the
>>>> copy, at least until we can come up with a sane way to work a no-copy
>>>> variant into SMC-D?
>>>>
>>>
>>> I agree, nginx/wrk workload is much more realistic for many 
>>> applications.
>>>
>>> But we also encounter many other cases similar to sockperf on the 
>>> cloud, which
>>> requires high throughput, such as AI training and big data.
>>>
>>> So avoidance of copying between DMBs can help these cases a lot :)
>>>
>>>>>
>>>>>
>>>>> # Discussion
>>>>>
>>>>> 1. API between SMC-D and ISM device
>>>>>
>>>>> As Jan mentioned in [2], IBM are working on placing an API between 
>>>>> SMC-D
>>>>> and the ISM device for easier use of different "devices" for SMC-D.
>>>>>
>>>>> So, considering that the introduction of attach_dmb or detach_dmb can
>>>>> effectively avoid data copying from sndbuf to RMB and brings obvious
>>>>> throughput advantages in inter-VM or inter-process scenarios, can the
>>>>> attach/detach semantics be taken into consideration when designing the
>>>>> API to make it a standard ISM device behavior?
>>>            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>>>
>>>> Due to the reasons explained above this behavior can't be emulated by
>>>> ISM devices at least not when crossing partitions. Not sure if we can
>>>> still incorporate it in the API and allow for both copying and
>>>> remapping SMC-D like devices, it definitely needs careful consideration
>>>> and I think also a better understanding of the benefit for real world
>>>> workloads.
>>>>
>>>
>>> Here I am not rigorous.
>>>
>>> Nocopy shouldn't be a standard ISM device behavior indeed. Actually 
>>> we hope it be a
>>> standard optional _SMC-D_ device behavior and defined by smcd_ops.
>>>
>>> For devices don't support these options, like ISM device on s390 
>>> architecture,
>>> .attach_dmb/.detach_dmb and other reasonable extensions (which will 
>>> be proposed to
>>> discuss in incoming virtio-ism RFC) can be set to NULL or return 
>>> invalid. And for
>>> devices do support, they may be used for improving performance in 
>>> some cases.
>>>
>>> In addition, can I know more latest news about the API design? :) , 
>>> like its scale, will
>>> it be a almost refactor of existing interface or incremental 
>>> patching? and its object,
>>> will it be tailored for exact ISM behavior or to reserve some options 
>>> for other devices,
>>> like nocopy here? From my understanding of [2], it might be the latter?
>>>
>>>>>
>>>>> Maybe our RFC of SMC-D based inter-process acceleration (this one) and
>>>>> inter-VM acceleration (will coming soon, which is the update of [1])
>>>>> can provide some examples for new API design. And we are very glad to
>>>>> discuss this on the mail list.
>>>>>
>>>>> 2. Way to select different ISM-like devices
>>>>>
>>>>> With the proposal of SMC-D loopback 'device' (this RFC) and incoming
>>>>> device used for inter-VM acceleration as update of [1], SMC-D has more
>>>>> options to choose from. So we need to consider that how to indicate
>>>>> supported devices, how to determine which one to use, and their 
>>>>> priority...
>>>>
>>>> Agree on this part, though it is for the SMC maintainers to decide, I
>>>> think we would definitely want to be able to use any upcoming inter-VM
>>>> devices on s390 possibly also in conjunction with ISM devices for
>>>> communication across partitions.
>>>>
>>>
>>> Yes, this part needs to be discussed with SMC maintainers. And thank 
>>> you, we are very glad
>>> if our devices can be applied on s390 through the efforts.
>>>
>>>
>>> Best Regards,
>>> Wen Gu
>>>
>>>>>
>>>>> IMHO, this may require an update of CLC message and negotiation 
>>>>> mechanism.
>>>>> Again, we are very glad to discuss this with you on the mailing list.
>>
>> As described in
>> SMC protocol (including SMC-D): 
>> https://www.ibm.com/support/pages/system/files/inline-files/IBM%20Shared%20Memory%20Communications%20Version%202_2.pdf
>> the CLC messages provide a list of up to 8 ISM devices to chose from.
>> So I would hope that we can use the existing protocol.
>>
>> The challenge will be to define GID (Global Interface ID) and CHID (a 
>> fabric ID) in
>> a meaningful way for the new devices.
>> There is always smcd_ops->query_remote_gid()  as a safety net. But the 
>> idea is that
>> a CHID mismatch is a fast way to tell that these 2 interfaces do match.
>>
>>
> 

FYI, we just sent the rest part of the API to the net-next 
https://lore.kernel.org/netdev/20230116092712.10176-1-jaka@linux.ibm.com/T/#t,
which should answer some questions in your patch series.


> Hi Winter and all,
> 
> Thanks for your reply and suggestions! And sorry for my late reply 
> because it took me
> some time to understand SMC-Dv2 protocol and implementation.
> 
> I agree with your opinion. The existing SMC-Dv2 protocol whose CLC 
> messages include
> ism_dev[] list can solve the devices negotiation problem. And I am very 
> willing to use
> the existing protocol, because we all know that the protocol update is a 
> long and complex
> process.
> 
> If I understand correctly, SMC-D loopback(dummy) device can coordinate 
> with existing
> SMC-Dv2 protocol as follows. If there is any mistake, please point out.
> 
> 
> # Initialization
> 
> - Initialize the loopback device with unique GID [Q-1].
> 
> - Register the loopback device as SMC-Dv2-capable device with a 
> system_eid whose 24th
>     or 28th byte is non-zero [Q-2], so that this system's 
> smc_ism_v2_capable will be set
>     to TRUE and SMC-Dv2 is available.
> 
The decision point is the VLAN_ID, if it is x1FFF, the device will 
support V2. i.e. If you can have subnet with VLAN_ID x1FFF, then the 
SEID is necessary, so that the series or types is non-zero. (*1)
> 
> # Proposal
> 
> - Find the loopback device from the smcd_dev_list in 
> smc_find_ism_v2_device_clnt();
> 
> - Record the SEID, GID and CHID[Q-3] of loopback device in the v2 
> extension part of CLC
>     proposal message.
> 
> 
> # Accept
> 
> - Check the GID/CHID list and SEID in CLC proposal message, and find 
> local matched ISM
>     device from smcd_dev_list in smc_find_ism_v2_device_serv(). If both 
> sides of the
>     communication are in the same VM and share the same loopback device, 
> the SEID, GID and
>     CHID will match and loopback device will be chosen [Q-4].
> 
> - Record the loopback device's GID/CHID and matched SEID into CLC accept 
> message.
> 
> 
> # Confirm
> 
> - Confirm the server-selected device (loopback device) accordingto CLC 
> accept messages.
> 
> - Record the loopback device's GID/CHID and server-selected SEID in CLC 
> confirm message.
> 
> 
> Follow the above process, I supplement a patch based on this RFC in the 
> email attachment.
> With the attachment patch, SMC-D loopback will switch to use SMC-Dv2 
> protocol.
> 
> 
> 
> And in the above process, there are something I want to consult and 
> discuss, which is marked
> with '[Q-*]' in the above description.
> 
> # [Q-1]:
> 
> The GID of loopback device is randomly generated in this RFC patch set, 
> but I will find a way
> to unique the GID in formal patches. Any suggestions are welcome.
> 
I think the randowmly generated GID is fine in your case, which is 
equivalent to the IP address.
> 
> # [Q-2]:
> 
> In Linux implementation, the system_eid of the first registered smcd 
> device will determinate
> system's smc_ism_v2_capable (see smcd_register_dev()).
> 
> And I wonder that
> 
> 1) How to define the system_eid? It can be inferred from the code that 
> the 24th and 28th byte
>      are special for SMC-Dv2. So in attachment patch, I define the 
> loopback device SEID as
> 
>      static struct smc_lo_systemeid LO_SYSTEM_EID = {
>              .seid_string = "SMC-SYSZ-LOSEID000000000",
>              .serial_number = "1000",
>              .type = "1000",
>      };
> 
>      Is there anything else I need to pay attention to?
> 
If you just want to use V2, such defination looks good.
e.g. you can use some unique information from "lshw"
> 
> 2) Seems only the first added smcd device determinate the system 
> smc_ism_v2_capable? If two
>      different smcd devices respectively with v1-indicated and 
> v2-indicated system_eid, will
>      the order in which they are registered affects the result of 
> smc_ism_v2_capable ?
> 
see (*1)
> 
> # [Q-3]:
> 
> In attachment patch, I define a special CHID (0xFFFF) for loopback 
> device, as a kind of
> 'unassociated ISM CHID' that not associated with any IP (OSA or 
> HiperSockets) interfaces.
> 
> What's your opinion about this?
> 
It looks good to me
> 
> # [Q-4]:
> 
> In current Linux implementation, server will select the first 
> successfully initialized device
> from the candidates as the final selected one in 
> smc_find_ism_v2_device_serv().
> 
> for (i = 0; i < matches; i++) {
>      ini->smcd_version = SMC_V2;
>      ini->is_smcd = true;
>      ini->ism_selected = i;
>      rc = smc_listen_ism_init(new_smc, ini);
>      if (rc) {
>          smc_find_ism_store_rc(rc, ini);
>          /* try next active ISM device */
>          continue;
>      }
>      return; /* matching and usable V2 ISM device found */
> }
> 
> IMHO, maybe candidate devices should have different priorities? For 
> example, the loopback device
> may be preferred to use if loopback is available.
> 
IMO, I'd prefer such a order: ISM -> loopback -> RoCE
Because ISM for SMC-D is our standard user case, not loopback.
> 
> Best Regards,
> Wen Gu
> 
>>>>>
>>>>> [1] 
>>>>> https://lore.kernel.org/netdev/20220720170048.20806-1-tonylu@linux.alibaba.com/
>>>>> [2] 
>>>>> https://lore.kernel.org/netdev/35d14144-28f7-6129-d6d3-ba16dae7a646@linux.ibm.com/
>>>>> [3] https://github.com/goldsborough/ipc-bench
>>>>>
>>>>> v1->v2
>>>>>    1. Fix some build WARNINGs complained by kernel test rebot
>>>>>       Reported-by: kernel test robot <lkp@...el.com>
>>>>>    2. Add iperf3 test data.
>>>>>
>>>>> Wen Gu (5):
>>>>>     net/smc: introduce SMC-D loopback device
>>>>>     net/smc: choose loopback device in SMC-D communication
>>>>>     net/smc: add dmb attach and detach interface
>>>>>     net/smc: avoid data copy from sndbuf to peer RMB in SMC-D loopback
>>>>>     net/smc: logic of cursors update in SMC-D loopback connections
>>>>>
>>>>>    include/net/smc.h      |   3 +
>>>>>    net/smc/Makefile       |   2 +-
>>>>>    net/smc/af_smc.c       |  88 +++++++++++-
>>>>>    net/smc/smc_cdc.c      |  59 ++++++--
>>>>>    net/smc/smc_cdc.h      |   1 +
>>>>>    net/smc/smc_clc.c      |   4 +-
>>>>>    net/smc/smc_core.c     |  62 +++++++++
>>>>>    net/smc/smc_core.h     |   2 +
>>>>>    net/smc/smc_ism.c      |  39 +++++-
>>>>>    net/smc/smc_ism.h      |   2 +
>>>>>    net/smc/smc_loopback.c | 358 
>>>>> +++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>    net/smc/smc_loopback.h |  63 +++++++++
>>>>>    12 files changed, 662 insertions(+), 21 deletions(-)
>>>>>    create mode 100644 net/smc/smc_loopback.c
>>>>>    create mode 100644 net/smc/smc_loopback.h
>>>>>

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ