[<prev] [next>] [day] [month] [year] [list]
Message-ID:
<GV2PR10MB8037B9F99338C2A59F26336FBB3B2@GV2PR10MB8037.EURPRD10.PROD.OUTLOOK.COM>
Date: Thu, 28 Mar 2024 12:18:39 +0000
From: "Goerlitz Andreas (SO/PAF1-Mb)" <Andreas.Goerlitz@...bosch.com>
To: Wen Gu <guwen@...ux.alibaba.com>
CC: "linux-s390@...r.kernel.org" <linux-s390@...r.kernel.org>,
"netdev@...r.kernel.org" <netdev@...r.kernel.org>
Subject: Re: SMC-R throughput drops for specific message sizes
Hello Wen Gu and community,
our group performed more experiments with SMC-R. The results discussed subsequently were performed on two Mellanox-powered (mlx5, ConnectX-5) PCs, with the following configuration:
Kernel 6.5.0-25-generic
MTU 9000
net.smc.wmem = $((256*1024))
net.smc.rmem = $((256*1024))
net.smc.autocorking_size = 65536
net.smc.smcr_buf_type = 1
Bandwidth ~ 3.2GB/s (25.0 Gbit/s)
We modified your server.c (consumer) and client.c (producer) to estimate the throughput and observed that the "msgsize" of the consumer seems to be mainly responsible for the throughput drops, as shown below.
Good cases (server/consumer msgsize <= RMBE/2):
-----------------------------------------------
server: smc_run ./server -p 12345 -m $((128*1024))
client: smc_run ./client -i 192.168.0.2 -p 12345 -m $((128*1024)) -c 1000
Sent 261881856 bytes in 82224.819000 us [3.184939 GB/s]
server: smc_run ./server -p 12345 -m $((128*1024))
client: smc_run ./client -i 192.168.0.2 -p 12345 -m $((256*1024)) -c 1000
Sent 261881856 bytes in 82097.127000 us [3.189892 GB/s]
Bad cases (server/consumer msgsize > RMBE/2):
-----------------------------------------------
server: smc_run ./server -p 12345 -m $((256*1024))
client: smc_run ./client -i 192.168.0.2 -p 12345 -m $((128*1024)) -c 1000
Sent 261881856 bytes in 130970.306000 us [1.999545 GB/s]
server: smc_run ./server -p 12345 -m $((256*1024))
client: smc_run ./client -i 192.168.0.2 -p 12345 -m $((256*1024)) -c 1000
Sent 130940928 bytes in 88172.887000 us [1.485037 GB/s]
Our explanation is that in the "bad cases" producer and consumer act synchronously in the following sense: The producer is sending messages (e.g., msgsize = RMBE on producer side), and at some point, it must wait until the consumer processes some of its RMBE, and answers with a CDC message.
During this time, the producer is blocked (since RMBE of consumer is full).
In case the consumer processes the entire RMBE (i.e., msgsize=RMBE on consumer side), it is then also blocked as there is nothing left to be processed anymore - i.e. it must wait for the producer.
We believe/suspect that this (unintended) synchronization leads to the throughput drops.
To enforce the consumer to process smaller messages, reply faster to the producer (CDC) and still be able to process some remaining data (i.e., to avoid being blocked), we cap the value of len to RMBE/2 in smc_rx_recvmsg:
--- a/net/smc/smc_rx.c 2024-03-25 12:31:32.264614422 +0100
+++ b/net/smc/smc_rx.c 2024-03-25 12:22:31.989913322 +0100
@@ -344,7 +344,7 @@
int smc_rx_recvmsg(struct smc_sock *smc, struct msghdr *msg,
struct pipe_inode_info *pipe, size_t len, int flags)
{
- size_t copylen, read_done = 0, read_remaining = len;
+ size_t copylen, read_remaining, read_done = 0;
size_t chunk_len, chunk_off, chunk_len_sum;
struct smc_connection *conn = &smc->conn;
int (*func)(struct smc_connection *conn);
@@ -363,6 +363,10 @@
sk = &smc->sk;
if (sk->sk_state == SMC_LISTEN)
return -ENOTCONN;
+
+ len = min_t(size_t, len, conn->rmb_desc->len / 2);
+ read_remaining = len;
+
if (flags & MSG_OOB)
return smc_rx_recv_urg(smc, msg, len, flags);
timeo = sock_rcvtimeo(sk, flags & MSG_DONTWAIT);
We ran qperf experiments (as before) on the standard SMC-R module [std] (kernel 6.5.0-25-generic), Wen Gu’s proposal [wengu] (i.e. setting force = true), and our proposal [our] (i.e. capping len to RMBE/2).
The measured throughput is shown in subplots (a) in the appended figures.
Additionally, we traced
tracepoint:smc:smc_tx_sendmsg{
@tx_ret = lhist(args->len,0,262144,16384);
}
tracepoint:smc:smc_rx_recvmsg{
@rx_ret = lhist(args->len,0,262144,16384);
}
and calculated the percentage of rx_ret and tx_ret being greater than RMBE/2 - shown in subplots (b) and (c) respectively.
As can be observed, there seems to be a correlation between a drop in throughput and rx_ret being greater than RMBE/2.
This is avoided in our proposal, and full throughput is achieved.
We hope that our analysis and interpretation can help to solve the issue with the throughput drops in SMC-R.
p.s., I would like to acknowledge all individuals who contributed to the analysis of SMC-R from our team (sorted by last name):
Soumyadeep Debnath
Andreas Görlitz
Costin Iordache
Alexandros Nikolaou
Maik Riestock
Ievgen Tatolov
Mit freundlichen Grüßen / Best regards
Andreas Goerlitz (SO/PAF1-Mb)
Bosch Service Solutions Magdeburg GmbH | Otto-von-Guericke-Str. 13 | 39104 Magdeburg | GERMANY | [www.boschservicesolutions.com]www.boschservicesolutions.com
Andreas.Goerlitz@...bosch.com
Sitz: Magdeburg, Registergericht: Amtsgericht Stendal, HRB 24039
Geschäftsführung: Robert Mulatz, Georg Wessels
View attachment "client.c" of type "text/x-csrc" (2612 bytes)
Download attachment "results_our.png" of type "image/png" (147628 bytes)
Download attachment "results_std.png" of type "image/png" (165763 bytes)
Download attachment "results_wengu.png" of type "image/png" (171372 bytes)
View attachment "server.c" of type "text/x-csrc" (2403 bytes)
Powered by blists - more mailing lists