[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <375d169e-5612-f75e-f219-ec981108dcbe@linux.alibaba.com>
Date: Fri, 24 Feb 2023 17:25:37 +0800
From: Wen Gu <guwen@...ux.alibaba.com>
To: Wenjia Zhang <wenjia@...ux.ibm.com>, kgraul@...ux.ibm.com,
jaka@...ux.ibm.com, davem@...emloft.net, edumazet@...gle.com,
kuba@...nel.org, pabeni@...hat.com
Cc: linux-s390@...r.kernel.org, netdev@...r.kernel.org,
linux-kernel@...r.kernel.org, Alexandra Winter <WINTERA@...ibm.com>
Subject: Re: [RFC PATCH net-next v3 0/9] net/smc: Introduce SMC-D-based OS
internal communication acceleration
On 2023/2/22 21:08, Wenjia Zhang wrote:
>
>
> On 22.02.23 13:00, Wen Gu wrote:
>>
>>
>> On 2023/2/16 00:18, Wen Gu wrote:
>>
>>> Hi, all
>>>
>>> # Background
>>>
>>> The background and previous discussion can be referred from [1].
>>>
>>> We found SMC-D can be used to accelerate OS internal communication, such as
>>> loopback or between two containers within the same OS instance. So this patch
>>> set provides a kind of SMC-D dummy device (we call it the SMC-D loopback device)
>>> to emulate an ISM device, so that SMC-D can also be used on architectures
>>> other than s390. The SMC-D loopback device are designed as a system global
>>> device, visible to all containers.
>>>
>>> This version is implemented based on the generalized interface provided by [2].
>>> And there is an open issue of this version, which will be mentioned later.
>>>
>>> # Design
>>>
>>> This patch set basically follows the design of the previous version.
>>>
>>> Patch #1/9 ~ #3/9 attempt to decouple ISM-related structures from the SMC-D
>>> generalized code and extract some helpers to make SMC-D protocol compatible
>>> with devices other than s390 ISM device.
>>>
>>> Patch #4/9 introduces a kind of loopback device, which is defined as SMC-D v2
>>> device and designed to provide communication between SMC sockets in the same OS
>>> instance.
>>>
>>> +-------------------------------------------+
>>> | +--------------+ +--------------+ |
>>> | | SMC socket A | | SMC socket B | |
>>> | +--------------+ +--------------+ |
>>> | ^ ^ |
>>> | | +----------------+ | |
>>> | | | SMC stack | | |
>>> | +--->| +------------+ |<--| |
>>> | | | dummy | | |
>>> | | | device | | |
>>> | +-+------------+-+ |
>>> | OS |
>>> +-------------------------------------------+
>>>
>>> Patch #5/9 ~ #8/9 expand SMC-D protocol interface (smcd_ops) for scenarios where
>>> SMC-D is used to communicate within VM (loopback here) or between VMs on the same
>>> host (based on virtio-ism device, see [3]). What these scenarios have in common
>>> is that the local sndbuf and peer RMB can be mapped to same physical memory region,
>>> so the data copy between the local sndbuf and peer RMB can be omitted. Performance
>>> improvement brought by this extension can be found in # Benchmark Test.
>>>
>>> +----------+ +----------+
>>> | socket A | | socket B |
>>> +----------+ +----------+
>>> | ^
>>> | +---------+ |
>>> regard as | | ----------|
>>> local sndbuf | B's | regard as
>>> | | RMB | local RMB
>>> |-------> | |
>>> +---------+
>>>
>>> Patch #9/9 realizes the support of loopback device for the above-mentioned expanded
>>> SMC-D protocol interface.
>>>
>>> # Benchmark Test
>>>
>>> * Test environments:
>>> - VM with Intel Xeon Platinum 8 core 2.50GHz, 16 GiB mem.
>>> - SMC sndbuf/RMB size 1MB.
>>>
>>> * Test object:
>>> - TCP lo: run on TCP loopback.
>>> - domain: run on UNIX domain.
>>> - SMC lo: run on SMC loopback device with patch #1/9 ~ #4/9.
>>> - SMC lo-nocpy: run on SMC loopback device with patch #1/9 ~ #9/9.
>>>
>>> 1. ipc-benchmark (see [4])
>>>
>>> - ./<foo> -c 1000000 -s 100
>>>
>>> TCP-lo domain SMC-lo SMC-lo-nocpy
>>> Message
>>> rate (msg/s) 79025 115736(+46.45%) 146760(+85.71%) 149800(+89.56%)
>>>
>>> 2. sockperf
>>>
>>> - serv: <smc_run> taskset -c <cpu> sockperf sr --tcp
>>> - clnt: <smc_run> taskset -c <cpu> sockperf { tp | pp } --tcp --msg-size={ 64000 for tp | 14 for pp } -i 127.0.0.1
>>> -t 30
>>>
>>> TCP-lo SMC-lo SMC-lo-nocpy
>>> Bandwidth(MBps) 4822.388 4940.918(+2.56%) 8086.67(+67.69%)
>>> Latency(us) 6.298 3.352(-46.78%) 3.35(-46.81%)
>>>
>>> 3. iperf3
>>>
>>> - serv: <smc_run> taskset -c <cpu> iperf3 -s
>>> - clnt: <smc_run> taskset -c <cpu> iperf3 -c 127.0.0.1 -t 15
>>>
>>> TCP-lo SMC-lo SMC-lo-nocpy
>>> Bitrate(Gb/s) 40.7 40.5(-0.49%) 72.4(+77.89%)
>>>
>>> 4. nginx/wrk
>>>
>>> - serv: <smc_run> nginx
>>> - clnt: <smc_run> wrk -t 8 -c 500 -d 30 http://127.0.0.1:80
>>>
>>> TCP-lo SMC-lo SMC-lo-nocpy
>>> Requests/s 155994.57 214544.79(+37.53%) 215538.55(+38.17%)
>>>
>>>
>>> # Open issue
>>>
>>> The open issue has not been resolved now is about how to detect that the source
>>> and target of CLC proposal are within the same OS instance and can communicate
>>> through the SMC-D loopback device. Similar issue also exists when using virtio-ism
>>> devices (the background and details of virtio-ism device can be referred from [3]).
>>> In previous discussions, multiple options were proposed (see [5]). Thanks again for
>>> the help of the community. cc Alexandra Winter :)
>>>
>>> But as we discussed, these solutions have some imperfection. So this version of RFC
>>> continues to use previous workaround, that is, a 64-bit random GID is generated for
>>> SMC-D loopback device. If the GIDs of the devices found by two peers are the same,
>>> then they are considered to be in the same OS instance and can communicate with each
>>> other by the loopback device.
>>>
>>> This approach has very small risk. Assume the following situations:
>>>
>>> (1) Assume that the SMC-D loopback devices of the two OS instances happen to
>>> generate the same 64-bit GID.
>>>
>>> For the convenience of description, we refer to the sockets on these two
>>> different OS instance as server A and client B.
>>>
>>> A will misjudge that the two are on the same OS instance because the same GID
>>> in CLC proposal message. Then A creates its RMB and sends 64-bit token-A to B
>>> in CLC accept message.
>>>
>>> B receives the CLC accept message. And according to patch #7/9, B tries to
>>> attach its sndbuf to A's RMB by token-A.
>>>
>>> (2) Assume that the OS instance where B is located happens to have an unattached
>>> RMB whose 64-bit token is same as token-A.
>>>
>>> Then B successfully attaches its sndbuf to the wrong RMB, and creates its RMB,
>>> sends token-B to A in CLC confirm message.
>>>
>>> Similarly, A receives the message and tries to attach its sndbuf to B's RMB by
>>> token-B.
>>>
>>> (3) Similar to (2), assume that the OS instance where A is located happens to have
>>> an unattached RMB whose 64-bit token is same as token-B.
>>>
>>> Then A successfully attach its sndbuf to the wrong RMB. Both sides mistakenly
>>> believe that an SMC-D connection based on the loopback device is established
>>> between them.
>>>
>>> If the above 3 coincidences all happen, that is, 64-bit random number conflicts occur
>>> 3 times, then an unreachable SMC-D connection will be established, which is nasty.
>>> If one of above is not satisfied, it will safely fallback to TCP.
>>>
>>> Since the chances of these happening are very small, I wonder if this risk of 1/2^(64*3)
>>> probability can be tolerated ?
>>
>> Hi,
>>
>> Any comments about this open issue or other parts of this RFC patch set? :)
>>
>> Thanks,
>> Wen Gu
>>
> Hi Wen,
>
> I don't forget it ;) I'm trying to run it by myself. Please give us more time for the trying and review.
>
> Thanks
> Wenjia
>
Sure, Wenjia. Thank you!
Please feel free to add comments. I will wait for you to complete the review before
deciding what to do next.
Regards,
Wen Gu
>>> Another way to solve this open issue is using a 128-bit UUID to identify SMC-D loopback
>>> device or virtio-ism device, because the probability of a 128-bit UUID collision is
>>> considered negligible. But it may need to extend the CLC message to carry a longer GID,
>>> which is the last option.
>>>
>>> v3->v2
>>> 1. Adapt new generalized interface provided by [2];
>>> 2. Select loopback device through SMC-D v2 protocol;
>>> 3. Split the loopback-related implementation and generic implementation into different
>>> patches more reasonably.
>>>
>>> v1->v2
>>> 1. Fix some build WARNINGs complained by kernel test rebot
>>> Reported-by: kernel test robot <lkp@...el.com>
>>> 2. Add iperf3 test data.
>>>
>>> [1] https://lore.kernel.org/netdev/1671506505-104676-1-git-send-email-guwen@linux.alibaba.com/
>>> [2] https://lore.kernel.org/netdev/20230123181752.1068-1-jaka@linux.ibm.com/
>>> [3] https://lists.oasis-open.org/archives/virtio-comment/202302/msg00148.html
>>> [4] https://github.com/goldsborough/ipc-bench
>>> [5] https://lore.kernel.org/netdev/b9867c7d-bb2b-16fc-feda-b79579aa833d@linux.ibm.com/
>>>
>>> Wen Gu (9):
>>> net/smc: Decouple ism_dev from SMC-D device dump
>>> net/smc: Decouple ism_dev from SMC-D DMB registration
>>> net/smc: Extract v2 check helper from SMC-D device registration
>>> net/smc: Introduce SMC-D loopback device
>>> net/smc: Introduce an interface for getting DMB attribute
>>> net/smc: Introudce interfaces for DMB attach and detach
>>> net/smc: Avoid data copy from sndbuf to peer RMB in SMC-D
>>> net/smc: Modify cursor update logic when using mappable DMB
>>> net/smc: Add interface implementation of loopback device
>>>
>>> drivers/s390/net/ism_drv.c | 5 +-
>>> include/net/smc.h | 18 +-
>>> net/smc/Makefile | 2 +-
>>> net/smc/af_smc.c | 26 ++-
>>> net/smc/smc_cdc.c | 59 ++++--
>>> net/smc/smc_cdc.h | 1 +
>>> net/smc/smc_core.c | 70 ++++++-
>>> net/smc/smc_core.h | 1 +
>>> net/smc/smc_ism.c | 79 ++++++--
>>> net/smc/smc_ism.h | 4 +
>>> net/smc/smc_loopback.c | 442 +++++++++++++++++++++++++++++++++++++++++++++
>>> net/smc/smc_loopback.h | 55 ++++++
>>> 12 files changed, 725 insertions(+), 37 deletions(-)
>>> create mode 100644 net/smc/smc_loopback.c
>>> create mode 100644 net/smc/smc_loopback.h
>>>
Powered by blists - more mailing lists