lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <82bba704bf0019cc551310c563623092bf01ef8d.camel@redhat.com>
Date: Thu, 18 Jan 2024 19:30:19 +0100
From: Paolo Abeni <pabeni@...hat.com>
To: Geoff Blake <blakgeof@...zon.com>
Cc: Salvatore Dipietro <dipiets@...zon.com>, edumazet@...gle.com, 
	alisaidi@...zon.com, benh@...zon.com, davem@...emloft.net, 
	dipietro.salvatore@...il.com, dsahern@...nel.org, kuba@...nel.org, 
	netdev@...r.kernel.org
Subject: Re: [PATCH v3] tcp: Add memory barrier to tcp_push()

On Thu, 2024-01-18 at 11:46 -0600, Geoff Blake wrote:
> 
> On Thu, 18 Jan 2024, Paolo Abeni wrote:
> 
> > CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
> > 
> > 
> > 
> > On Wed, 2024-01-17 at 15:16 -0800, Salvatore Dipietro wrote:
> > > On CPUs with weak memory models, reads and updates performed by tcp_push to the
> > > sk variables can get reordered leaving the socket throttled when it should not.
> > > The tasklet running tcp_wfree() may also not observe the memory updates in time
> > > and will skip flushing any packets throttled by tcp_push(), delaying the sending.
> > > This can pathologically cause 40ms extra latency due to bad interactions with
> > > delayed acks.
> > > 
> > > Adding a memory barrier in tcp_push before the sk_wmem_alloc read removes the
> > > bug, similarly to the previous commit bf06200e732d ("tcp: tsq: fix nonagle
> > > handling"). smp_mb__after_atomic() is used to not incur in unnecessary overhead
> > > on x86 since not affected.
> > > 
> > > Patch has been tested using an AWS c7g.2xlarge instance with Ubuntu 22.04 and
> > > Apache Tomcat 9.0.83 running the basic servlet below:
> > > 
> > > import java.io.IOException;
> > > import java.io.OutputStreamWriter;
> > > import java.io.PrintWriter;
> > > import javax.servlet.ServletException;
> > > import javax.servlet.http.HttpServlet;
> > > import javax.servlet.http.HttpServletRequest;
> > > import javax.servlet.http.HttpServletResponse;
> > > 
> > > public class HelloWorldServlet extends HttpServlet {
> > >     @Override
> > >     protected void doGet(HttpServletRequest request, HttpServletResponse response)
> > >       throws ServletException, IOException {
> > >         response.setContentType("text/html;charset=utf-8");
> > >         OutputStreamWriter osw = new OutputStreamWriter(response.getOutputStream(),"UTF-8");
> > >         String s = "a".repeat(3096);
> > >         osw.write(s,0,s.length());
> > >         osw.flush();
> > >     }
> > > }
> > > 
> > > Load was applied using wrk2 (https://github.com/kinvolk/wrk2) from an AWS
> > > c6i.8xlarge instance. Before the patch an additional 40ms latency from P99.99+
> > > values is observed while, with the patch, the extra latency disappears.
> > > 
> > > # No patch and tcp_autocorking=1
> > > ./wrk -t32 -c128 -d40s --latency -R10000  http://172.31.60.173:8080/hello/hello
> > >   ...
> > >  50.000%    0.91ms
> > >  75.000%    1.13ms
> > >  90.000%    1.46ms
> > >  99.000%    1.74ms
> > >  99.900%    1.89ms
> > >  99.990%   41.95ms  <<< 40+ ms extra latency
> > >  99.999%   48.32ms
> > > 100.000%   48.96ms
> > > 
> > > # With patch and tcp_autocorking=1
> > > ./wrk -t32 -c128 -d40s --latency -R10000  http://172.31.60.173:8080/hello/hello
> > >   ...
> > >  50.000%    0.90ms
> > >  75.000%    1.13ms
> > >  90.000%    1.45ms
> > >  99.000%    1.72ms
> > >  99.900%    1.83ms
> > >  99.990%    2.11ms  <<< no 40+ ms extra latency
> > >  99.999%    2.53ms
> > > 100.000%    2.62ms
> > > 
> > > Patch has been also tested on x86 (m7i.2xlarge instance) which it is not
> > > affected by this issue and the patch doesn't introduce any additional
> > > delay.
> > > 
> > > Fixes: a181ceb501b3 ("tcp: autocork should not hold first packet in write
> > > queue")
> > 
> > Please read carefully the process documentation under
> > Documentation/process/ and specifically the netdev specific bits:
> > 
> > no resubmissions within the 24h grace period.
> > 
> > Please double-check your patch with checkpatch for formal errors: the
> > fixes tag must not be split across multiple lines.
> > 
> > And please do not sent new version in reply to a previous one: it will
> > confuse the bot.
> > 
> > > Signed-off-by: Salvatore Dipietro <dipiets@...zon.com>
> > > ---
> > >  net/ipv4/tcp.c | 1 +
> > >  1 file changed, 1 insertion(+)
> > > 
> > > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> > > index ff6838ca2e58..ab9e3922393c 100644
> > > --- a/net/ipv4/tcp.c
> > > +++ b/net/ipv4/tcp.c
> > > @@ -726,6 +726,7 @@ void tcp_push(struct sock *sk, int flags, int mss_now,
> > >               /* It is possible TX completion already happened
> > >                * before we set TSQ_THROTTLED.
> > >                */
> > > +             smp_mb__after_atomic();
> > 
> > Out of sheer ignorance I'm wondering if moving such barrier inside the
> > above 'if' just after 'set_bit' would suffice?
> 
> According to the herd7 modeling tool, the answer is no for weak memory 
> models.  If we put the smp_mb inside the if, it allows the machine to 
> reorder the two reads to sk_wmem_alloc and we can get to the bad state 
> this patch is fixing.  Placing it outside the if ensures 
> the ordering between those two reads as well as ordering the write to the 
> flags variable.

For the records, I asked because reading the documentation  I assumed
that smp_mb__after_atomic() enforces the ordering WRT the previous RMW
operation (in this case 'set_bit'). Is this a wrong interpretation?

Now I wonder if the patch is enough. Specifically, would it work
correctly when the TSQ_THROTTLED bit is already set and there is no RMW
operation in between the two refcount_read()?

Thanks,

Paolo


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ