[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <b25f56de-7913-2a56-950f-dfe0defd6936@linux.alibaba.com>
Date: Thu, 12 Jan 2023 20:12:06 +0800
From: Wen Gu <guwen@...ux.alibaba.com>
To: Alexandra Winter <wintera@...ux.ibm.com>,
Niklas Schnelle <schnelle@...ux.ibm.com>, kgraul@...ux.ibm.com,
wenjia@...ux.ibm.com, jaka@...ux.ibm.com, davem@...emloft.net,
edumazet@...gle.com, kuba@...nel.org, pabeni@...hat.com
Cc: linux-s390@...r.kernel.org, netdev@...r.kernel.org,
linux-kernel@...r.kernel.org
Subject: Re: [RFC PATCH net-next v2 0/5] net/smc:Introduce SMC-D based
loopback acceleration
On 2023/1/5 00:09, Alexandra Winter wrote:
>
>
> On 21.12.22 14:14, Wen Gu wrote:
>>
>>
>> On 2022/12/20 22:02, Niklas Schnelle wrote:
>>
>>> On Tue, 2022-12-20 at 11:21 +0800, Wen Gu wrote:
>>>> Hi, all
>>>>
>>>> # Background
>>>>
>>>> As previously mentioned in [1], we (Alibaba Cloud) are trying to use SMC
>>>> to accelerate TCP applications in cloud environment, improving inter-host
>>>> or inter-VM communication.
>>>>
>>>> In addition of these, we also found the value of SMC-D in scenario of local
>>>> inter-process communication, such as accelerate communication between containers
>>>> within the same host. So this RFC tries to provide a SMC-D loopback solution
>>>> in such scenario, to bring a significant improvement in latency and throughput
>>>> compared to TCP loopback.
>>>>
>>>> # Design
>>>>
>>>> This patch set provides a kind of SMC-D loopback solution.
>>>>
>>>> Patch #1/5 and #2/5 provide an SMC-D based dummy device, preparing for the
>>>> inter-process communication acceleration. Except for loopback acceleration,
>>>> the dummy device can also meet the requirements mentioned in [2], which is
>>>> providing a way to test SMC-D logic for broad community without ISM device.
>>>>
>>>> +------------------------------------------+
>>>> | +-----------+ +-----------+ |
>>>> | | process A | | process B | |
>>>> | +-----------+ +-----------+ |
>>>> | ^ ^ |
>>>> | | +---------------+ | |
>>>> | | | SMC stack | | |
>>>> | +--->| +-----------+ |<--| |
>>>> | | | dummy | | |
>>>> | | | device | | |
>>>> | +-+-----------+-+ |
>>>> | VM |
>>>> +------------------------------------------+
>>>>
>>>> Patch #3/5, #4/5, #5/5 provides a way to avoid data copy from sndbuf to RMB
>>>> and improve SMC-D loopback performance. Through extending smcd_ops with two
>>>> new semantic: attach_dmb and detach_dmb, sender's sndbuf shares the same
>>>> physical memory region with receiver's RMB. The data copied from userspace
>>>> to sender's sndbuf directly reaches the receiver's RMB without unnecessary
>>>> memory copy in the same kernel.
>>>>
>>>> +----------+ +----------+
>>>> | socket A | | socket B |
>>>> +----------+ +----------+
>>>> | ^
>>>> | +---------+ |
>>>> regard as | | ----------|
>>>> local sndbuf | B's | regard as
>>>> | | RMB | local RMB
>>>> |-------> | |
>>>> +---------+
>>>
>>> Hi Wen Gu,
>>>
>>> I maintain the s390 specific PCI support in Linux and would like to
>>> provide a bit of background on this. You're surely wondering why we
>>> even have a copy in there for our ISM virtual PCI device. To understand
>>> why this copy operation exists and why we need to keep it working, one
>>> needs a bit of s390 aka mainframe background.
>>>
>>> On s390 all (currently supported) native machines have a mandatory
>>> machine level hypervisor. All OSs whether z/OS or Linux run either on
>>> this machine level hypervisor as so called Logical Partitions (LPARs)
>>> or as second/third/… level guests on e.g. a KVM or z/VM hypervisor that
>>> in turn runs in an LPAR. Now, in terms of memory this machine level
>>> hypervisor sometimes called PR/SM unlike KVM, z/VM, or VMWare is a
>>> partitioning hypervisor without paging. This is one of the main reasons
>>> for the very-near-native performance of the machine hypervisor as the
>>> memory of its guests acts just like native RAM on other systems. It is
>>> never paged out and always accessible to IOMMU translated DMA from
>>> devices without the need for pinning pages and besides a trivial
>>> offset/limit adjustment an LPAR's MMU does the same amount of work as
>>> an MMU on a bare metal x86_64/ARM64 box.
>>>
>>> It also means however that when SMC-D is used to communicate between
>>> LPARs via an ISM device there is no way of mapping the DMBs to the
>>> same physical memory as there exists no MMU-like layer spanning
>>> partitions that could do such a mapping. Meanwhile for machine level
>>> firmware including the ISM virtual PCI device it is still possible to
>>> _copy_ memory between different memory partitions. So yeah while I do
>>> see the appeal of skipping the memcpy() for loopback or even between
>>> guests of a paging hypervisor such as KVM, which can map the DMBs on
>>> the same physical memory, we must keep in mind this original use case
>>> requiring a copy operation.
>>>
>>> Thanks,
>>> Niklas
>>>
>>
>> Hi Niklas,
>>
>> Thank you so much for the complete and detailed explanation! This provides
>> me a brand new perspective of s390 device that we hadn't dabbled in before.
>> Now I understand why shared memory is unavailable between different LPARs.
>>
>> Our original intention of proposing loopback device and the incoming device
>> (virtio-ism) for inter-VM is to use SMC-D to accelerate communication in the
>> case with no existing s390 ISM devices. In our conception, s390 ISM device,
>> loopback device and virtio-ism device are parallel and are abstracted by smcd_ops.
>>
>> +------------------------+
>> | SMC-D |
>> +------------------------+
>> -------- smcd_ops ---------
>> +------+ +------+ +------+
>> | s390 | | loop | |virtio|
>> | ISM | | back | | -ism |
>> | dev | | dev | | dev |
>> +------+ +------+ +------+
>>
>> We also believe that keeping the existing design and behavior of s390 ISM
>> device is unshaken. What we want to get support for is some smcd_ops extension
>> for devices with optional beneficial capability, such as nocopy here (Let's call
>> it this for now), which is really helpful for us in inter-process and inter-VM
>> scenario.
>>
>> And coincided with IBM's intention to add APIs between SMC-D and devices to
>> support various devices for SMC-D, as mentioned in [2], we send out this RFC and
>> the incoming virio-ism RFC, to provide some examples.
>>
>>>>
>>>> # Benchmark Test
>>>>
>>>> * Test environments:
>>>> - VM with Intel Xeon Platinum 8 core 2.50GHz, 16 GiB mem.
>>>> - SMC sndbuf/RMB size 1MB.
>>>>
>>>> * Test object:
>>>> - TCP: run on TCP loopback.
>>>> - domain: run on UNIX domain.
>>>> - SMC lo: run on SMC loopback device with patch #1/5 ~ #2/5.
>>>> - SMC lo-nocpy: run on SMC loopback device with patch #1/5 ~ #5/5.
>>>>
>>>> 1. ipc-benchmark (see [3])
>>>>
>>>> - ./<foo> -c 1000000 -s 100
>>>>
>>>> TCP domain SMC-lo SMC-lo-nocpy
>>>> Message
>>>> rate (msg/s) 75140 129548(+72.41) 152266(+102.64%) 151914(+102.17%)
>>>
>>> Interesting that it does beat UNIX domain sockets. Also, see my below
>>> comment for nginx/wrk as this seems very similar.
>>>
>>>>
>>>> 2. sockperf
>>>>
>>>> - serv: <smc_run> taskset -c <cpu> sockperf sr --tcp
>>>> - clnt: <smc_run> taskset -c <cpu> sockperf { tp | pp } --tcp --msg-size={ 64000 for tp | 14 for pp } -i 127.0.0.1 -t 30
>>>>
>>>> TCP SMC-lo SMC-lo-nocpy
>>>> Bandwidth(MBps) 4943.359 4936.096(-0.15%) 8239.624(+66.68%)
>>>> Latency(us) 6.372 3.359(-47.28%) 3.25(-49.00%)
>>>>
>>>> 3. iperf3
>>>>
>>>> - serv: <smc_run> taskset -c <cpu> iperf3 -s
>>>> - clnt: <smc_run> taskset -c <cpu> iperf3 -c 127.0.0.1 -t 15
>>>>
>>>> TCP SMC-lo SMC-lo-nocpy
>>>> Bitrate(Gb/s) 40.5 41.4(+2.22%) 76.4(+88.64%)
>>>>
>>>> 4. nginx/wrk
>>>>
>>>> - serv: <smc_run> nginx
>>>> - clnt: <smc_run> wrk -t 8 -c 500 -d 30 http://127.0.0.1:80
>>>>
>>>> TCP SMC-lo SMC-lo-nocpy
>>>> Requests/s 154643.22 220894.03(+42.84%) 226754.3(+46.63%)
>>>
>>>
>>> This result is very interesting indeed. So with the much more realistic
>>> nginx/wrk workload it seems to copy hurts much less than the
>>> iperf3/sockperf would suggest while SMC-D itself seems to help more.
>>> I'd hope that this translates to actual applications as well. Maybe
>>> this makes SMC-D based loopback interesting even while keeping the
>>> copy, at least until we can come up with a sane way to work a no-copy
>>> variant into SMC-D?
>>>
>>
>> I agree, nginx/wrk workload is much more realistic for many applications.
>>
>> But we also encounter many other cases similar to sockperf on the cloud, which
>> requires high throughput, such as AI training and big data.
>>
>> So avoidance of copying between DMBs can help these cases a lot :)
>>
>>>>
>>>>
>>>> # Discussion
>>>>
>>>> 1. API between SMC-D and ISM device
>>>>
>>>> As Jan mentioned in [2], IBM are working on placing an API between SMC-D
>>>> and the ISM device for easier use of different "devices" for SMC-D.
>>>>
>>>> So, considering that the introduction of attach_dmb or detach_dmb can
>>>> effectively avoid data copying from sndbuf to RMB and brings obvious
>>>> throughput advantages in inter-VM or inter-process scenarios, can the
>>>> attach/detach semantics be taken into consideration when designing the
>>>> API to make it a standard ISM device behavior?
>> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>>
>>> Due to the reasons explained above this behavior can't be emulated by
>>> ISM devices at least not when crossing partitions. Not sure if we can
>>> still incorporate it in the API and allow for both copying and
>>> remapping SMC-D like devices, it definitely needs careful consideration
>>> and I think also a better understanding of the benefit for real world
>>> workloads.
>>>
>>
>> Here I am not rigorous.
>>
>> Nocopy shouldn't be a standard ISM device behavior indeed. Actually we hope it be a
>> standard optional _SMC-D_ device behavior and defined by smcd_ops.
>>
>> For devices don't support these options, like ISM device on s390 architecture,
>> .attach_dmb/.detach_dmb and other reasonable extensions (which will be proposed to
>> discuss in incoming virtio-ism RFC) can be set to NULL or return invalid. And for
>> devices do support, they may be used for improving performance in some cases.
>>
>> In addition, can I know more latest news about the API design? :) , like its scale, will
>> it be a almost refactor of existing interface or incremental patching? and its object,
>> will it be tailored for exact ISM behavior or to reserve some options for other devices,
>> like nocopy here? From my understanding of [2], it might be the latter?
>>
>>>>
>>>> Maybe our RFC of SMC-D based inter-process acceleration (this one) and
>>>> inter-VM acceleration (will coming soon, which is the update of [1])
>>>> can provide some examples for new API design. And we are very glad to
>>>> discuss this on the mail list.
>>>>
>>>> 2. Way to select different ISM-like devices
>>>>
>>>> With the proposal of SMC-D loopback 'device' (this RFC) and incoming
>>>> device used for inter-VM acceleration as update of [1], SMC-D has more
>>>> options to choose from. So we need to consider that how to indicate
>>>> supported devices, how to determine which one to use, and their priority...
>>>
>>> Agree on this part, though it is for the SMC maintainers to decide, I
>>> think we would definitely want to be able to use any upcoming inter-VM
>>> devices on s390 possibly also in conjunction with ISM devices for
>>> communication across partitions.
>>>
>>
>> Yes, this part needs to be discussed with SMC maintainers. And thank you, we are very glad
>> if our devices can be applied on s390 through the efforts.
>>
>>
>> Best Regards,
>> Wen Gu
>>
>>>>
>>>> IMHO, this may require an update of CLC message and negotiation mechanism.
>>>> Again, we are very glad to discuss this with you on the mailing list.
>
> As described in
> SMC protocol (including SMC-D): https://www.ibm.com/support/pages/system/files/inline-files/IBM%20Shared%20Memory%20Communications%20Version%202_2.pdf
> the CLC messages provide a list of up to 8 ISM devices to chose from.
> So I would hope that we can use the existing protocol.
>
> The challenge will be to define GID (Global Interface ID) and CHID (a fabric ID) in
> a meaningful way for the new devices.
> There is always smcd_ops->query_remote_gid() as a safety net. But the idea is that
> a CHID mismatch is a fast way to tell that these 2 interfaces do match.
>
>
Hi Winter and all,
Thanks for your reply and suggestions! And sorry for my late reply because it took me
some time to understand SMC-Dv2 protocol and implementation.
I agree with your opinion. The existing SMC-Dv2 protocol whose CLC messages include
ism_dev[] list can solve the devices negotiation problem. And I am very willing to use
the existing protocol, because we all know that the protocol update is a long and complex
process.
If I understand correctly, SMC-D loopback(dummy) device can coordinate with existing
SMC-Dv2 protocol as follows. If there is any mistake, please point out.
# Initialization
- Initialize the loopback device with unique GID [Q-1].
- Register the loopback device as SMC-Dv2-capable device with a system_eid whose 24th
or 28th byte is non-zero [Q-2], so that this system's smc_ism_v2_capable will be set
to TRUE and SMC-Dv2 is available.
# Proposal
- Find the loopback device from the smcd_dev_list in smc_find_ism_v2_device_clnt();
- Record the SEID, GID and CHID[Q-3] of loopback device in the v2 extension part of CLC
proposal message.
# Accept
- Check the GID/CHID list and SEID in CLC proposal message, and find local matched ISM
device from smcd_dev_list in smc_find_ism_v2_device_serv(). If both sides of the
communication are in the same VM and share the same loopback device, the SEID, GID and
CHID will match and loopback device will be chosen [Q-4].
- Record the loopback device's GID/CHID and matched SEID into CLC accept message.
# Confirm
- Confirm the server-selected device (loopback device) accordingto CLC accept messages.
- Record the loopback device's GID/CHID and server-selected SEID in CLC confirm message.
Follow the above process, I supplement a patch based on this RFC in the email attachment.
With the attachment patch, SMC-D loopback will switch to use SMC-Dv2 protocol.
And in the above process, there are something I want to consult and discuss, which is marked
with '[Q-*]' in the above description.
# [Q-1]:
The GID of loopback device is randomly generated in this RFC patch set, but I will find a way
to unique the GID in formal patches. Any suggestions are welcome.
# [Q-2]:
In Linux implementation, the system_eid of the first registered smcd device will determinate
system's smc_ism_v2_capable (see smcd_register_dev()).
And I wonder that
1) How to define the system_eid? It can be inferred from the code that the 24th and 28th byte
are special for SMC-Dv2. So in attachment patch, I define the loopback device SEID as
static struct smc_lo_systemeid LO_SYSTEM_EID = {
.seid_string = "SMC-SYSZ-LOSEID000000000",
.serial_number = "1000",
.type = "1000",
};
Is there anything else I need to pay attention to?
2) Seems only the first added smcd device determinate the system smc_ism_v2_capable? If two
different smcd devices respectively with v1-indicated and v2-indicated system_eid, will
the order in which they are registered affects the result of smc_ism_v2_capable ?
# [Q-3]:
In attachment patch, I define a special CHID (0xFFFF) for loopback device, as a kind of
'unassociated ISM CHID' that not associated with any IP (OSA or HiperSockets) interfaces.
What's your opinion about this?
# [Q-4]:
In current Linux implementation, server will select the first successfully initialized device
from the candidates as the final selected one in smc_find_ism_v2_device_serv().
for (i = 0; i < matches; i++) {
ini->smcd_version = SMC_V2;
ini->is_smcd = true;
ini->ism_selected = i;
rc = smc_listen_ism_init(new_smc, ini);
if (rc) {
smc_find_ism_store_rc(rc, ini);
/* try next active ISM device */
continue;
}
return; /* matching and usable V2 ISM device found */
}
IMHO, maybe candidate devices should have different priorities? For example, the loopback device
may be preferred to use if loopback is available.
Best Regards,
Wen Gu
>>>>
>>>> [1] https://lore.kernel.org/netdev/20220720170048.20806-1-tonylu@linux.alibaba.com/
>>>> [2] https://lore.kernel.org/netdev/35d14144-28f7-6129-d6d3-ba16dae7a646@linux.ibm.com/
>>>> [3] https://github.com/goldsborough/ipc-bench
>>>>
>>>> v1->v2
>>>> 1. Fix some build WARNINGs complained by kernel test rebot
>>>> Reported-by: kernel test robot <lkp@...el.com>
>>>> 2. Add iperf3 test data.
>>>>
>>>> Wen Gu (5):
>>>> net/smc: introduce SMC-D loopback device
>>>> net/smc: choose loopback device in SMC-D communication
>>>> net/smc: add dmb attach and detach interface
>>>> net/smc: avoid data copy from sndbuf to peer RMB in SMC-D loopback
>>>> net/smc: logic of cursors update in SMC-D loopback connections
>>>>
>>>> include/net/smc.h | 3 +
>>>> net/smc/Makefile | 2 +-
>>>> net/smc/af_smc.c | 88 +++++++++++-
>>>> net/smc/smc_cdc.c | 59 ++++++--
>>>> net/smc/smc_cdc.h | 1 +
>>>> net/smc/smc_clc.c | 4 +-
>>>> net/smc/smc_core.c | 62 +++++++++
>>>> net/smc/smc_core.h | 2 +
>>>> net/smc/smc_ism.c | 39 +++++-
>>>> net/smc/smc_ism.h | 2 +
>>>> net/smc/smc_loopback.c | 358 +++++++++++++++++++++++++++++++++++++++++++++++++
>>>> net/smc/smc_loopback.h | 63 +++++++++
>>>> 12 files changed, 662 insertions(+), 21 deletions(-)
>>>> create mode 100644 net/smc/smc_loopback.c
>>>> create mode 100644 net/smc/smc_loopback.h
>>>>
View attachment "0001-net-smc-define-SEID-and-CHID-of-loopback-device.patch" of type "text/plain" (4891 bytes)
Powered by blists - more mailing lists