netdev - Re: [PATCH] net/smc: Add autocork support

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <68e9534b-7ff5-5a65-9017-124dbae0c74b@linux.ibm.com>
Date:   Wed, 16 Feb 2022 14:58:32 +0100
From:   Stefan Raspl <raspl@...ux.ibm.com>
To:     Dust Li <dust.li@...ux.alibaba.com>,
        Karsten Graul <kgraul@...ux.ibm.com>,
        Tony Lu <tonylu@...ux.alibaba.com>
Cc:     kuba@...nel.org, davem@...emloft.net, netdev@...r.kernel.org,
        linux-s390@...r.kernel.org, linux-rdma@...r.kernel.org
Subject: Re: [PATCH] net/smc: Add autocork support

On 2/16/22 04:49, Dust Li wrote:
> This patch adds autocork support for SMC which could improve
> throughput for small message by x2 ~ x4.
> 
> The main idea is borrowed from TCP autocork with some RDMA
> specific modification:
> 1. The first message should never cork to make sure we won't
>     bring extra latency
> 2. If we have posted any Tx WRs to the NIC that have not
>     completed, cork the new messages until:
>     a) Receive CQE for the last Tx WR
>     b) We have corked enough message on the connection
> 3. Try to push the corked data out when we receive CQE of
>     the last Tx WR to prevent the corked messages hang in
>     the send queue.
> 
> Both SMC autocork and TCP autocork check the TX completion
> to decide whether we should cork or not. The difference is
> when we got a SMC Tx WR completion, the data have been confirmed
> by the RNIC while TCP TX completion just tells us the data
> have been sent out by the local NIC.
> 
> Add an atomic variable tx_pushing in smc_connection to make
> sure only one can send to let it cork more and save CDC slot.
> 
> SMC autocork should not bring extra latency since the first
> message will always been sent out immediately.
> 
> The qperf tcp_bw test shows more than x4 increase under small
> message size with Mellanox connectX4-Lx, same result with other
> throughput benchmarks like sockperf/netperf.
> The qperf tcp_lat test shows SMC autocork has not increase any
> ping-pong latency.
> 
> BW test:
>   client: smc_run taskset -c 1 qperf smc-server -oo msg_size:1:64K:*2 \
> 			-t 30 -vu tcp_bw
>   server: smc_run taskset -c 1 qperf
> 
> MsgSize(Bytes)        TCP         SMC-NoCork           SMC-AutoCork
>        1         2.57 MB/s     698 KB/s(-73.5%)     2.98 MB/s(16.0% )
>        2          5.1 MB/s    1.41 MB/s(-72.4%)     5.82 MB/s(14.1% )
>        4         10.2 MB/s    2.83 MB/s(-72.3%)     11.7 MB/s(14.7% )
>        8         20.8 MB/s    5.62 MB/s(-73.0%)     22.9 MB/s(10.1% )
>       16         42.5 MB/s    11.5 MB/s(-72.9%)     45.5 MB/s(7.1%  )
>       32         80.7 MB/s    22.3 MB/s(-72.4%)     86.7 MB/s(7.4%  )
>       64          155 MB/s    45.6 MB/s(-70.6%)      160 MB/s(3.2%  )
>      128          295 MB/s    90.1 MB/s(-69.5%)      273 MB/s(-7.5% )
>      256          539 MB/s     179 MB/s(-66.8%)      610 MB/s(13.2% )
>      512          943 MB/s     360 MB/s(-61.8%)     1.02 GB/s(10.8% )
>     1024         1.58 GB/s     710 MB/s(-56.1%)     1.91 GB/s(20.9% )
>     2048         2.47 GB/s    1.34 GB/s(-45.7%)     2.92 GB/s(18.2% )
>     4096         2.86 GB/s     2.5 GB/s(-12.6%)      2.4 GB/s(-16.1%)
>     8192         3.89 GB/s    3.14 GB/s(-19.3%)     4.05 GB/s(4.1%  )
>    16384         3.29 GB/s    4.67 GB/s(41.9% )     5.09 GB/s(54.7% )
>    32768         2.73 GB/s    5.48 GB/s(100.7%)     5.49 GB/s(101.1%)
>    65536            3 GB/s    4.85 GB/s(61.7% )     5.24 GB/s(74.7% )
> 
> Latency test:
>   client: smc_run taskset -c 1 qperf smc-server -oo msg_size:1:64K:*2 \
> 			-t 30 -vu tcp_lat
>   server: smc_run taskset -c 1 qperf
> 
>   MsgSize              SMC-NoCork           SMC-AutoCork
>         1               9.7 us               9.6 us( -1.03%)
>         2              9.43 us              9.39 us( -0.42%)
>         4               9.6 us              9.35 us( -2.60%)
>         8              9.42 us               9.2 us( -2.34%)
>        16              9.13 us              9.43 us(  3.29%)
>        32              9.19 us               9.5 us(  3.37%)
>        64              9.38 us               9.5 us(  1.28%)
>       128               9.9 us              9.29 us( -6.16%)
>       256              9.42 us              9.26 us( -1.70%)
>       512                10 us              9.45 us( -5.50%)
>      1024              10.4 us               9.6 us( -7.69%)
>      2048              10.4 us              10.2 us( -1.92%)
>      4096                11 us              10.5 us( -4.55%)
>      8192              11.7 us              11.8 us(  0.85%)
>     16384              14.5 us              14.2 us( -2.07%)
>     32768              19.4 us              19.3 us( -0.52%)
>     65536              28.1 us              28.8 us(  2.49%)
> 
> With SMC autocork support, we can archive better throughput than
> TCP in most message sizes without any latency tradeoff.
> 
> Signed-off-by: Dust Li <dust.li@...ux.alibaba.com>
> ---
>   net/smc/smc.h     |   2 +
>   net/smc/smc_cdc.c |  11 +++--
>   net/smc/smc_tx.c  | 118 ++++++++++++++++++++++++++++++++++++++++------
>   3 files changed, 114 insertions(+), 17 deletions(-)
> 
> diff --git a/net/smc/smc.h b/net/smc/smc.h
> index a096d8af21a0..bc7df235281c 100644
> --- a/net/smc/smc.h
> +++ b/net/smc/smc.h
> @@ -192,6 +192,8 @@ struct smc_connection {
>   						 * - dec on polled tx cqe
>   						 */
>   	wait_queue_head_t	cdc_pend_tx_wq; /* wakeup on no cdc_pend_tx_wr*/
> +	atomic_t		tx_pushing;     /* nr_threads trying tx push */
> +
>   	struct delayed_work	tx_work;	/* retry of smc_cdc_msg_send */
>   	u32			tx_off;		/* base offset in peer rmb */
>   
> diff --git a/net/smc/smc_cdc.c b/net/smc/smc_cdc.c
> index 9d5a97168969..2b37bec90824 100644
> --- a/net/smc/smc_cdc.c
> +++ b/net/smc/smc_cdc.c
> @@ -48,9 +48,14 @@ static void smc_cdc_tx_handler(struct smc_wr_tx_pend_priv *pnd_snd,
>   		conn->tx_cdc_seq_fin = cdcpend->ctrl_seq;
>   	}
>   
> -	if (atomic_dec_and_test(&conn->cdc_pend_tx_wr) &&
> -	    unlikely(wq_has_sleeper(&conn->cdc_pend_tx_wq)))
> -		wake_up(&conn->cdc_pend_tx_wq);
> +	if (atomic_dec_and_test(&conn->cdc_pend_tx_wr)) {
> +		/* If this is the last pending WR complete, we must push to
> +		 * prevent hang when autocork enabled.
> +		 */
> +		smc_tx_sndbuf_nonempty(conn);
> +		if (unlikely(wq_has_sleeper(&conn->cdc_pend_tx_wq)))
> +			wake_up(&conn->cdc_pend_tx_wq);
> +	}
>   	WARN_ON(atomic_read(&conn->cdc_pend_tx_wr) < 0);
>   
>   	smc_tx_sndbuf_nonfull(smc);
> diff --git a/net/smc/smc_tx.c b/net/smc/smc_tx.c
> index 5df3940d4543..bc737ac79805 100644
> --- a/net/smc/smc_tx.c
> +++ b/net/smc/smc_tx.c
> @@ -31,6 +31,7 @@
>   #include "smc_tracepoint.h"
>   
>   #define SMC_TX_WORK_DELAY	0
> +#define SMC_DEFAULT_AUTOCORK_SIZE	(64 * 1024)

Probably a matter of taste, but why not use hex here?

>   
>   /***************************** sndbuf producer *******************************/
>   
> @@ -127,10 +128,52 @@ static int smc_tx_wait(struct smc_sock *smc, int flags)
>   static bool smc_tx_is_corked(struct smc_sock *smc)
>   {
>   	struct tcp_sock *tp = tcp_sk(smc->clcsock->sk);
> -
>   	return (tp->nonagle & TCP_NAGLE_CORK) ? true : false;
>   }

Can you drop this line elimination?


> +/* If we have pending CDC messages, do not send:
> + * Because CQE of this CDC message will happen shortly, it gives
> + * a chance to coalesce future sendmsg() payload in to one RDMA Write,
> + * without need for a timer, and with no latency trade off.
> + * Algorithm here:
> + *  1. First message should never cork
> + *  2. If we have pending CDC messages, wait for the first
> + *     message's completion
> + *  3. Don't cork to much data in a single RDMA Write to prevent burst,
> + *     total corked message should not exceed min(64k, sendbuf/2)

I assume the 64k is incurred from IP as used by RoCEv2?


> + */
> +static bool smc_should_autocork(struct smc_sock *smc, struct msghdr *msg,
> +				int size_goal)
> +{
> +	struct smc_connection *conn = &smc->conn;
> +
> +	if (atomic_read(&conn->cdc_pend_tx_wr) == 0 ||
> +	    smc_tx_prepared_sends(conn) > min(size_goal,
> +					      conn->sndbuf_desc->len >> 1))
> +		return false;
> +	return true;
> +}
> +
> +static bool smc_tx_should_cork(struct smc_sock *smc, struct msghdr *msg)
> +{
> +	struct smc_connection *conn = &smc->conn;
> +
> +	if (smc_should_autocork(smc, msg, SMC_DEFAULT_AUTOCORK_SIZE))
> +		return true;

Are there any fixed plans to make SMC_DEFAULT_AUTOCORK dynamic...? 'cause 
otherwise we could simply eliminate this parameter, and use the define within 
smc_should_autocork() instead.


Ciao,
Stefan