[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <41792c55-6018-f331-315e-912438812df7@amazon.com>
Date: Thu, 18 Jan 2024 13:30:21 -0600
From: Geoff Blake <blakgeof@...zon.com>
To: Paolo Abeni <pabeni@...hat.com>
CC: Salvatore Dipietro <dipiets@...zon.com>, <edumazet@...gle.com>,
<alisaidi@...zon.com>, <benh@...zon.com>, <davem@...emloft.net>,
<dipietro.salvatore@...il.com>, <dsahern@...nel.org>, <kuba@...nel.org>,
<netdev@...r.kernel.org>
Subject: RE: [PATCH v3] tcp: Add memory barrier to tcp_push()
On Thu, 18 Jan 2024, Paolo Abeni wrote:
> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
>
>
>
> On Thu, 2024-01-18 at 11:46 -0600, Geoff Blake wrote:
> >
> > On Thu, 18 Jan 2024, Paolo Abeni wrote:
> >
> > > CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
> > >
> > >
> > >
> > > On Wed, 2024-01-17 at 15:16 -0800, Salvatore Dipietro wrote:
> > > > On CPUs with weak memory models, reads and updates performed by tcp_push to the
> > > > sk variables can get reordered leaving the socket throttled when it should not.
> > > > The tasklet running tcp_wfree() may also not observe the memory updates in time
> > > > and will skip flushing any packets throttled by tcp_push(), delaying the sending.
> > > > This can pathologically cause 40ms extra latency due to bad interactions with
> > > > delayed acks.
> > > >
> > > > Adding a memory barrier in tcp_push before the sk_wmem_alloc read removes the
> > > > bug, similarly to the previous commit bf06200e732d ("tcp: tsq: fix nonagle
> > > > handling"). smp_mb__after_atomic() is used to not incur in unnecessary overhead
> > > > on x86 since not affected.
> > > >
> > > > Patch has been tested using an AWS c7g.2xlarge instance with Ubuntu 22.04 and
> > > > Apache Tomcat 9.0.83 running the basic servlet below:
> > > >
> > > > import java.io.IOException;
> > > > import java.io.OutputStreamWriter;
> > > > import java.io.PrintWriter;
> > > > import javax.servlet.ServletException;
> > > > import javax.servlet.http.HttpServlet;
> > > > import javax.servlet.http.HttpServletRequest;
> > > > import javax.servlet.http.HttpServletResponse;
> > > >
> > > > public class HelloWorldServlet extends HttpServlet {
> > > > @Override
> > > > protected void doGet(HttpServletRequest request, HttpServletResponse response)
> > > > throws ServletException, IOException {
> > > > response.setContentType("text/html;charset=utf-8");
> > > > OutputStreamWriter osw = new OutputStreamWriter(response.getOutputStream(),"UTF-8");
> > > > String s = "a".repeat(3096);
> > > > osw.write(s,0,s.length());
> > > > osw.flush();
> > > > }
> > > > }
> > > >
> > > > Load was applied using wrk2 (https://github.com/kinvolk/wrk2) from an AWS
> > > > c6i.8xlarge instance. Before the patch an additional 40ms latency from P99.99+
> > > > values is observed while, with the patch, the extra latency disappears.
> > > >
> > > > # No patch and tcp_autocorking=1
> > > > ./wrk -t32 -c128 -d40s --latency -R10000 http://172.31.60.173:8080/hello/hello
> > > > ...
> > > > 50.000% 0.91ms
> > > > 75.000% 1.13ms
> > > > 90.000% 1.46ms
> > > > 99.000% 1.74ms
> > > > 99.900% 1.89ms
> > > > 99.990% 41.95ms <<< 40+ ms extra latency
> > > > 99.999% 48.32ms
> > > > 100.000% 48.96ms
> > > >
> > > > # With patch and tcp_autocorking=1
> > > > ./wrk -t32 -c128 -d40s --latency -R10000 http://172.31.60.173:8080/hello/hello
> > > > ...
> > > > 50.000% 0.90ms
> > > > 75.000% 1.13ms
> > > > 90.000% 1.45ms
> > > > 99.000% 1.72ms
> > > > 99.900% 1.83ms
> > > > 99.990% 2.11ms <<< no 40+ ms extra latency
> > > > 99.999% 2.53ms
> > > > 100.000% 2.62ms
> > > >
> > > > Patch has been also tested on x86 (m7i.2xlarge instance) which it is not
> > > > affected by this issue and the patch doesn't introduce any additional
> > > > delay.
> > > >
> > > > Fixes: a181ceb501b3 ("tcp: autocork should not hold first packet in write
> > > > queue")
> > >
> > > Please read carefully the process documentation under
> > > Documentation/process/ and specifically the netdev specific bits:
> > >
> > > no resubmissions within the 24h grace period.
> > >
> > > Please double-check your patch with checkpatch for formal errors: the
> > > fixes tag must not be split across multiple lines.
> > >
> > > And please do not sent new version in reply to a previous one: it will
> > > confuse the bot.
> > >
> > > > Signed-off-by: Salvatore Dipietro <dipiets@...zon.com>
> > > > ---
> > > > net/ipv4/tcp.c | 1 +
> > > > 1 file changed, 1 insertion(+)
> > > >
> > > > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> > > > index ff6838ca2e58..ab9e3922393c 100644
> > > > --- a/net/ipv4/tcp.c
> > > > +++ b/net/ipv4/tcp.c
> > > > @@ -726,6 +726,7 @@ void tcp_push(struct sock *sk, int flags, int mss_now,
> > > > /* It is possible TX completion already happened
> > > > * before we set TSQ_THROTTLED.
> > > > */
> > > > + smp_mb__after_atomic();
> > >
> > > Out of sheer ignorance I'm wondering if moving such barrier inside the
> > > above 'if' just after 'set_bit' would suffice?
> >
> > According to the herd7 modeling tool, the answer is no for weak memory
> > models. If we put the smp_mb inside the if, it allows the machine to
> > reorder the two reads to sk_wmem_alloc and we can get to the bad state
> > this patch is fixing. Placing it outside the if ensures
> > the ordering between those two reads as well as ordering the write to the
> > flags variable.
>
> For the records, I asked because reading the documentation I assumed
> that smp_mb__after_atomic() enforces the ordering WRT the previous RMW
> operation (in this case 'set_bit'). Is this a wrong interpretation?
>
> Now I wonder if the patch is enough. Specifically, would it work
> correctly when the TSQ_THROTTLED bit is already set and there is no RMW
> operation in between the two refcount_read()?
>
I fat fingered the model in the previous reply, sorry for the confusion.
It appears we can put the barrier inside the if check and be fine. For
the x86 case reads won't pass other reads according to my understanding
of their memory model which makes this work as is without the barrier to
begin with for when TSQ_THROTTLED is already
set or not set.
For arm64, we need to generate a dmb.ish instruction after the store. Can
run a test to make sure the model is correct and can move the barrier
to inside the if after the atomic bitset.
Thanks,
Geoff
Powered by blists - more mailing lists