netdev - Re: [PATCH v2] tcp: Add memory barrier to tcp

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANn89i+XkcQV6_=ysKACN+JQM=P7SqbfTvhxF+jSwd=MJ6t0sw@mail.gmail.com>
Date: Wed, 17 Jan 2024 22:58:57 +0100
From: Eric Dumazet <edumazet@...gle.com>
To: Salvatore Dipietro <dipiets@...zon.com>
Cc: alisaidi@...zon.com, benh@...zon.com, blakgeof@...zon.com, 
	davem@...emloft.net, dipietro.salvatore@...il.com, dsahern@...nel.org, 
	kuba@...nel.org, netdev@...r.kernel.org, pabeni@...hat.com
Subject: Re: [PATCH v2] tcp: Add memory barrier to tcp_push()

On Wed, Jan 17, 2024 at 10:29 PM Salvatore Dipietro <dipiets@...zon.com> wrote:
>
> On CPUs with weak memory models, reads and updates performed by tcp_push to the
> sk variables can get reordered leaving the socket throttled when it should not.
> The tasklet running tcp_wfree() may also not observe the memory updates in time
> and will skip flushing any packets throttled by tcp_push(), delaying the sending.
> This can pathologically cause 40ms extra latency due to bad interactions with
> delayed acks.
>
> Modeling the memory access behavior of tcp_push() (P0) and tcp_wfree() (P1)
> using the herd7 simulator, proves this behavior can occur. Below is the litmus
> model which describes the functions:
> ```
> C MP+tcp
> {
>   [flag] = 0;
>   [sk] = 5;
>   [corked] = 0;
> }
>
> P0(int *flag, int *sk, int *corked){
>     int r0;
>     int r1;
>     int r2;
>
>     r1 = READ_ONCE(*sk);
>     if (r1 == 5) {
>
>         r0 = READ_ONCE(*flag);
>         if (r0 == 0) {
>             WRITE_ONCE(*flag, 1);
>         }
>
>         // memory barrier added in this patch,
>         // original code does not order the reads/writes
>         smp_mb();
>
>         r2 = READ_ONCE(*sk);
>         if (r2 == 5 ) {
>             WRITE_ONCE(*corked,1);
>         }
>     }
> }
>

Interesting.

Quite frankly I doubt this herd7 stuff needs to be in a changelog,
this is too verbose/pedantic.

What about instead referring to similar commit bf06200e732d
("tcp: tsq: fix nonagle handling")

This was exactly the same issue, but at that time tcp_push() was not
re-reading sk->sk_wmem_alloc




> P1(int *flag, int *sk, int *corked){
>     int r0;
>     int r1;
>
>     r1 = READ_ONCE(*sk);
>     smp_store_release(sk, 0);
>
>     r0 = smp_load_acquire(flag);
>     if (r0 == 1) {
>         smp_store_release(flag, 0);
>     }
> }
> locations [0:r0; 0:r1; 0:r2; 1:r0; 1:r1; flag; sk; corked; ]
> exists ( flag=1 /\ corked=1 )
> ```
>
> Adding the memory barrier removes the positive witness from the memory model.
> smp_mb__after_atomic() is used to not incur in unnecessary overhead on x86
> since not affected.
> Patch has been tested using an AWS c7g.2xlarge instance with Ubuntu 22.04 and
> Apache Tomcat 9.0.83 running the basic servlet below:
> ```
> import java.io.IOException;
> import java.io.OutputStreamWriter;
> import java.io.PrintWriter;
> import javax.servlet.ServletException;
> import javax.servlet.http.HttpServlet;
> import javax.servlet.http.HttpServletRequest;
> import javax.servlet.http.HttpServletResponse;
>
> public class HelloWorldServlet extends HttpServlet {
>     @Override
>     protected void doGet(HttpServletRequest request, HttpServletResponse response)
>       throws ServletException, IOException {
>         response.setContentType("text/html;charset=utf-8");
>         OutputStreamWriter osw = new OutputStreamWriter(response.getOutputStream(),"UTF-8");
>         String s = "a".repeat(3096);
>         osw.write(s,0,s.length());
>         osw.flush();
>     }
> }
> ```
> Load was applied using wrk2 (https://github.com/kinvolk/wrk2) from an AWS
> c6i.8xlarge instance. Before the patch an additional 40ms latency from P99.99+
> values is observed while, with the patch, the extra latency disappears.
>
> # No patch and tcp_autocorking=1
> ./wrk -t32 -c128 -d40s --latency -R10000  http://172.31.60.173:8080/hello/hello
>   ...
>  50.000%    0.91ms
>  75.000%    1.13ms
>  90.000%    1.46ms
>  99.000%    1.74ms
>  99.900%    1.89ms
>  99.990%   41.95ms  <<< 40+ ms extra latency
>  99.999%   48.32ms
> 100.000%   48.96ms
>
> # With patch and tcp_autocorking=1
> ./wrk -t32 -c128 -d40s --latency -R10000  http://172.31.60.173:8080/hello/hello
>   ...
>  50.000%    0.90ms
>  75.000%    1.13ms
>  90.000%    1.45ms
>  99.000%    1.72ms
>  99.900%    1.83ms
>  99.990%    2.11ms  <<< no 40+ ms extra latency
>  99.999%    2.53ms
> 100.000%    2.62ms
>
> Patch has been also tested on x86 (m7i.2xlarge instance) which it is not
> affected by this issue and the patch doesn't introduce any additional
> delay.
>
> Fixes: f54b311142a9 ("tcp: auto corking")

Bug came with a181ceb501b3 ("tcp: autocork should not hold first
packet in write queue")



> Signed-off-by: Salvatore Dipietro <dipiets@...zon.com>
> ---
>  net/ipv4/tcp.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index ff6838ca2e58..ab9e3922393c 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -726,6 +726,7 @@ void tcp_push(struct sock *sk, int flags, int mss_now,
>                 /* It is possible TX completion already happened
>                  * before we set TSQ_THROTTLED.
>                  */
> +               smp_mb__after_atomic();
>                 if (refcount_read(&sk->sk_wmem_alloc) > skb->truesize)
>                         return;
>         }
> --
> 2.42.0
>