netdev - Re: [PATCH v3] tcp: Add memory barrier to tcp

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Wed, 17 Apr 2024 09:33:37 +0200
From: Eric Dumazet <edumazet@...gle.com>
To: Geoff Blake <blakgeof@...zon.com>
Cc: Paolo Abeni <pabeni@...hat.com>, Salvatore Dipietro <dipiets@...zon.com>, alisaidi@...zon.com, 
	benh@...zon.com, davem@...emloft.net, dipietro.salvatore@...il.com, 
	dsahern@...nel.org, kuba@...nel.org, netdev@...r.kernel.org
Subject: Re: [PATCH v3] tcp: Add memory barrier to tcp_push()

On Thu, Jan 18, 2024 at 8:30 PM Geoff Blake <blakgeof@...zon.com> wrote:
>
>
>
> On Thu, 18 Jan 2024, Paolo Abeni wrote:
>
> > CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
> >
> >
> >
> > On Thu, 2024-01-18 at 11:46 -0600, Geoff Blake wrote:
> > >
> > > On Thu, 18 Jan 2024, Paolo Abeni wrote:
> > >
> > > > CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
> > > >
> > > >
> > > >
> > > > On Wed, 2024-01-17 at 15:16 -0800, Salvatore Dipietro wrote:
> > > > > On CPUs with weak memory models, reads and updates performed by tcp_push to the
> > > > > sk variables can get reordered leaving the socket throttled when it should not.
> > > > > The tasklet running tcp_wfree() may also not observe the memory updates in time
> > > > > and will skip flushing any packets throttled by tcp_push(), delaying the sending.
> > > > > This can pathologically cause 40ms extra latency due to bad interactions with
> > > > > delayed acks.
> > > > >
> > > > > Adding a memory barrier in tcp_push before the sk_wmem_alloc read removes the
> > > > > bug, similarly to the previous commit bf06200e732d ("tcp: tsq: fix nonagle
> > > > > handling"). smp_mb__after_atomic() is used to not incur in unnecessary overhead
> > > > > on x86 since not affected.
> > > > >
> > > > > Patch has been tested using an AWS c7g.2xlarge instance with Ubuntu 22.04 and
> > > > > Apache Tomcat 9.0.83 running the basic servlet below:
> > > > >
> > > > > import java.io.IOException;
> > > > > import java.io.OutputStreamWriter;
> > > > > import java.io.PrintWriter;
> > > > > import javax.servlet.ServletException;
> > > > > import javax.servlet.http.HttpServlet;
> > > > > import javax.servlet.http.HttpServletRequest;
> > > > > import javax.servlet.http.HttpServletResponse;
> > > > >
> > > > > public class HelloWorldServlet extends HttpServlet {
> > > > >     @Override
> > > > >     protected void doGet(HttpServletRequest request, HttpServletResponse response)
> > > > >       throws ServletException, IOException {
> > > > >         response.setContentType("text/html;charset=utf-8");
> > > > >         OutputStreamWriter osw = new OutputStreamWriter(response.getOutputStream(),"UTF-8");
> > > > >         String s = "a".repeat(3096);
> > > > >         osw.write(s,0,s.length());
> > > > >         osw.flush();
> > > > >     }
> > > > > }
> > > > >
> > > > > Load was applied using wrk2 (https://github.com/kinvolk/wrk2) from an AWS
> > > > > c6i.8xlarge instance. Before the patch an additional 40ms latency from P99.99+
> > > > > values is observed while, with the patch, the extra latency disappears.
> > > > >
> > > > > # No patch and tcp_autocorking=1
> > > > > ./wrk -t32 -c128 -d40s --latency -R10000  http://172.31.60.173:8080/hello/hello
> > > > >   ...
> > > > >  50.000%    0.91ms
> > > > >  75.000%    1.13ms
> > > > >  90.000%    1.46ms
> > > > >  99.000%    1.74ms
> > > > >  99.900%    1.89ms
> > > > >  99.990%   41.95ms  <<< 40+ ms extra latency
> > > > >  99.999%   48.32ms
> > > > > 100.000%   48.96ms
> > > > >
> > > > > # With patch and tcp_autocorking=1
> > > > > ./wrk -t32 -c128 -d40s --latency -R10000  http://172.31.60.173:8080/hello/hello
> > > > >   ...
> > > > >  50.000%    0.90ms
> > > > >  75.000%    1.13ms
> > > > >  90.000%    1.45ms
> > > > >  99.000%    1.72ms
> > > > >  99.900%    1.83ms
> > > > >  99.990%    2.11ms  <<< no 40+ ms extra latency
> > > > >  99.999%    2.53ms
> > > > > 100.000%    2.62ms
> > > > >
> > > > > Patch has been also tested on x86 (m7i.2xlarge instance) which it is not
> > > > > affected by this issue and the patch doesn't introduce any additional
> > > > > delay.
> > > > >
> > > > > Fixes: a181ceb501b3 ("tcp: autocork should not hold first packet in write
> > > > > queue")
> > > >
> > > > Please read carefully the process documentation under
> > > > Documentation/process/ and specifically the netdev specific bits:
> > > >
> > > > no resubmissions within the 24h grace period.
> > > >
> > > > Please double-check your patch with checkpatch for formal errors: the
> > > > fixes tag must not be split across multiple lines.
> > > >
> > > > And please do not sent new version in reply to a previous one: it will
> > > > confuse the bot.
> > > >
> > > > > Signed-off-by: Salvatore Dipietro <dipiets@...zon.com>
> > > > > ---
> > > > >  net/ipv4/tcp.c | 1 +
> > > > >  1 file changed, 1 insertion(+)
> > > > >
> > > > > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> > > > > index ff6838ca2e58..ab9e3922393c 100644
> > > > > --- a/net/ipv4/tcp.c
> > > > > +++ b/net/ipv4/tcp.c
> > > > > @@ -726,6 +726,7 @@ void tcp_push(struct sock *sk, int flags, int mss_now,
> > > > >               /* It is possible TX completion already happened
> > > > >                * before we set TSQ_THROTTLED.
> > > > >                */
> > > > > +             smp_mb__after_atomic();
> > > >
> > > > Out of sheer ignorance I'm wondering if moving such barrier inside the
> > > > above 'if' just after 'set_bit' would suffice?
> > >
> > > According to the herd7 modeling tool, the answer is no for weak memory
> > > models.  If we put the smp_mb inside the if, it allows the machine to
> > > reorder the two reads to sk_wmem_alloc and we can get to the bad state
> > > this patch is fixing.  Placing it outside the if ensures
> > > the ordering between those two reads as well as ordering the write to the
> > > flags variable.
> >
> > For the records, I asked because reading the documentation  I assumed
> > that smp_mb__after_atomic() enforces the ordering WRT the previous RMW
> > operation (in this case 'set_bit'). Is this a wrong interpretation?
> >
> > Now I wonder if the patch is enough. Specifically, would it work
> > correctly when the TSQ_THROTTLED bit is already set and there is no RMW
> > operation in between the two refcount_read()?
> >
>
> I fat fingered the model in the previous reply, sorry for the confusion.
> It appears we can put the barrier inside the if check and be fine.  For
> the x86 case reads won't pass other reads according to my understanding
> of their memory model which makes this work as is without the barrier to
> begin with for when TSQ_THROTTLED is already
> set or not set.
>
> For arm64, we need to generate a dmb.ish instruction after the store.  Can
> run a test to make sure the model is correct and can move the barrier
> to inside the if after the atomic bitset.

I read again this code (in 'fixed' kernels) and could not convince
myself it is bug free/.

Without RMW with return value or barriers, this all boils to bare reads,
and all these reads could be ordered differently.

I am thinking of making the following change to be on the cautious side.

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index f23b97777ea5e93fc860d3f225e464f9376cfe09..2e768aab39128cc8aebc2dbd65406f3bccf449f7
100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -722,12 +722,9 @@ void tcp_push(struct sock *sk, int flags, int mss_now,

        if (tcp_should_autocork(sk, skb, size_goal)) {

-               /* avoid atomic op if TSQ_THROTTLED bit is already set */
-               if (!test_bit(TSQ_THROTTLED, &sk->sk_tsq_flags)) {
+               if (!test_and_set_bit(TSQ_THROTTLED, &sk->sk_tsq_flags))
                        NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPAUTOCORKING);
-                       set_bit(TSQ_THROTTLED, &sk->sk_tsq_flags);
-                       smp_mb__after_atomic();
-               }
+
                /* It is possible TX completion already happened
                 * before we set TSQ_THROTTLED.
                 */