netdev - Re: SMC-R throughput drops for specific message sizes

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [day] [month] [year] [list]
Message-ID:
 <GV2PR10MB8037B9F99338C2A59F26336FBB3B2@GV2PR10MB8037.EURPRD10.PROD.OUTLOOK.COM>
Date: Thu, 28 Mar 2024 12:18:39 +0000
From: "Goerlitz Andreas (SO/PAF1-Mb)" <Andreas.Goerlitz@...bosch.com>
To: Wen Gu <guwen@...ux.alibaba.com>
CC: "linux-s390@...r.kernel.org" <linux-s390@...r.kernel.org>,
	"netdev@...r.kernel.org" <netdev@...r.kernel.org>
Subject: Re: SMC-R throughput drops for specific message sizes

Hello Wen Gu and community,

our group performed more experiments with SMC-R. The results discussed subsequently were performed on two Mellanox-powered (mlx5, ConnectX-5) PCs, with the following configuration:
Kernel 6.5.0-25-generic
MTU 9000
net.smc.wmem = $((256*1024))
net.smc.rmem = $((256*1024))
net.smc.autocorking_size = 65536
net.smc.smcr_buf_type = 1
Bandwidth ~ 3.2GB/s (25.0 Gbit/s)

We modified your server.c (consumer) and client.c (producer) to estimate the throughput and observed that the "msgsize" of the consumer seems to be mainly responsible for the throughput drops, as shown below.

Good cases (server/consumer msgsize <= RMBE/2):
-----------------------------------------------
server:  smc_run ./server -p 12345 -m $((128*1024))
client:  smc_run ./client -i 192.168.0.2 -p 12345 -m $((128*1024)) -c 1000
         Sent 261881856 bytes in 82224.819000 us [3.184939 GB/s]

server:  smc_run ./server -p 12345 -m $((128*1024))
client:  smc_run ./client -i 192.168.0.2 -p 12345 -m $((256*1024)) -c 1000
         Sent 261881856 bytes in 82097.127000 us [3.189892 GB/s]

 

Bad cases (server/consumer msgsize > RMBE/2):
-----------------------------------------------
server:   smc_run ./server -p 12345 -m $((256*1024))
client:   smc_run ./client -i 192.168.0.2 -p 12345 -m $((128*1024)) -c 1000
          Sent 261881856 bytes in 130970.306000 us [1.999545 GB/s]

server:   smc_run ./server -p 12345 -m $((256*1024))
client:   smc_run ./client -i 192.168.0.2 -p 12345 -m $((256*1024)) -c 1000
          Sent 130940928 bytes in 88172.887000 us [1.485037 GB/s]


Our explanation is that in the "bad cases" producer and consumer act synchronously in the following sense: The producer is sending messages (e.g., msgsize = RMBE on producer side), and at some point, it must wait until the consumer processes some of its RMBE, and answers with a CDC message.
During this time, the producer is blocked (since RMBE of consumer is full).
In case the consumer processes the entire RMBE (i.e., msgsize=RMBE on consumer side), it is then also blocked as there is nothing left to be processed anymore - i.e. it must wait for the producer.
We believe/suspect that this (unintended) synchronization leads to the throughput drops.

To enforce the consumer to process smaller messages, reply faster to the producer (CDC) and still be able to process some remaining data (i.e., to avoid being blocked), we cap the value of len to RMBE/2 in smc_rx_recvmsg:

--- a/net/smc/smc_rx.c  2024-03-25 12:31:32.264614422 +0100
+++ b/net/smc/smc_rx.c  2024-03-25 12:22:31.989913322 +0100
@@ -344,7 +344,7 @@
 int smc_rx_recvmsg(struct smc_sock *smc, struct msghdr *msg,
                    struct pipe_inode_info *pipe, size_t len, int flags)
 {
-       size_t copylen, read_done = 0, read_remaining = len;
+       size_t copylen, read_remaining, read_done = 0;
         size_t chunk_len, chunk_off, chunk_len_sum;
         struct smc_connection *conn = &smc->conn;
         int (*func)(struct smc_connection *conn);
@@ -363,6 +363,10 @@
         sk = &smc->sk;
         if (sk->sk_state == SMC_LISTEN)
                 return -ENOTCONN;
+
+       len = min_t(size_t, len, conn->rmb_desc->len / 2);
+       read_remaining = len;
+
         if (flags & MSG_OOB)
                 return smc_rx_recv_urg(smc, msg, len, flags);
         timeo = sock_rcvtimeo(sk, flags & MSG_DONTWAIT);

We ran qperf experiments (as before) on the standard SMC-R module [std] (kernel 6.5.0-25-generic), Wen Gu’s proposal [wengu] (i.e. setting force = true), and our proposal [our] (i.e. capping len to RMBE/2).
The measured throughput is shown in subplots (a) in the appended figures.
Additionally, we traced

tracepoint:smc:smc_tx_sendmsg{
   @tx_ret = lhist(args->len,0,262144,16384);
}

tracepoint:smc:smc_rx_recvmsg{
   @rx_ret = lhist(args->len,0,262144,16384);
} 

and calculated the percentage of rx_ret and tx_ret being greater than RMBE/2 - shown in subplots (b) and (c) respectively.

As can be observed, there seems to be a correlation between a drop in throughput and rx_ret being greater than RMBE/2.
This is avoided in our proposal, and full throughput is achieved.

We hope that our analysis and interpretation can help to solve the issue with the throughput drops in SMC-R.

p.s., I would like to acknowledge all individuals who contributed to the analysis of SMC-R from our team (sorted by last name):
Soumyadeep Debnath
Andreas Görlitz
Costin Iordache
Alexandros Nikolaou
Maik Riestock
Ievgen Tatolov



Mit freundlichen Grüßen / Best regards

Andreas Goerlitz (SO/PAF1-Mb)

Bosch Service Solutions Magdeburg GmbH | Otto-von-Guericke-Str. 13 | 39104 Magdeburg | GERMANY | [www.boschservicesolutions.com]www.boschservicesolutions.com
Andreas.Goerlitz@...bosch.com


Sitz: Magdeburg, Registergericht: Amtsgericht Stendal, HRB 24039

Geschäftsführung: Robert Mulatz, Georg Wessels
View attachment "client.c" of type "text/x-csrc" (2612 bytes)

Download attachment "results_our.png" of type "image/png" (147628 bytes)

Download attachment "results_std.png" of type "image/png" (165763 bytes)

Download attachment "results_wengu.png" of type "image/png" (171372 bytes)

View attachment "server.c" of type "text/x-csrc" (2403 bytes)