linux-kernel - Re: [PATCH] Add provision to busyloop for events in ep

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAMP57yWQGKnHcn3gkPvz1bvPO=+VTvyMJ5OHZpp=WYX=CBhZvA@mail.gmail.com>
Date: Tue, 10 Sep 2024 10:41:21 -0700
From: Naman Gulati <namangulati@...gle.com>
To: Martin Karsten <mkarsten@...terloo.ca>
Cc: Joe Damato <jdamato@...tly.com>, Alexander Viro <viro@...iv.linux.org.uk>, 
	Christian Brauner <brauner@...nel.org>, Jan Kara <jack@...e.cz>, 
	"David S. Miller" <davem@...emloft.net>, Eric Dumazet <edumazet@...gle.com>, netdev@...r.kernel.org, 
	Stanislav Fomichev <sdf@...ichev.me>, linux-kernel@...r.kernel.org, skhawaja@...gle.com, 
	Willem de Bruijn <willemdebruijn.kernel@...il.com>
Subject: Re: [PATCH] Add provision to busyloop for events in ep_poll.

On Wed, Sep 4, 2024 at 5:46 AM Martin Karsten <mkarsten@...terloo.ca> wrote:
>
> On 2024-09-04 01:52, Naman Gulati wrote:
> > Thanks all for the comments and apologies for the delay in replying.
> > Stan and Joe I’ve addressed some of the common concerns below.
> >
> > On Thu, Aug 29, 2024 at 3:40 AM Joe Damato <jdamato@...tly.com> wrote:
> >>
> >> On Wed, Aug 28, 2024 at 06:10:11PM +0000, Naman Gulati wrote:
> >>> NAPI busypolling in ep_busy_loop loops on napi_poll and checks for new
> >>> epoll events after every napi poll. Checking just for epoll events in a
> >>> tight loop in the kernel context delivers latency gains to applications
> >>> that are not interested in napi busypolling with epoll.
> >>>
> >>> This patch adds an option to loop just for new events inside
> >>> ep_busy_loop, guarded by the EPIOCSPARAMS ioctl that controls epoll napi
> >>> busypolling.
> >>
> >> This makes an API change, so I think that linux-api@...r.kernel.org
> >> needs to be CC'd ?
> >>
> >>> A comparison with neper tcp_rr shows that busylooping for events in
> >>> epoll_wait boosted throughput by ~3-7% and reduced median latency by
> >>> ~10%.
> >>>
> >>> To demonstrate the latency and throughput improvements, a comparison was
> >>> made of neper tcp_rr running with:
> >>>      1. (baseline) No busylooping
> >>
> >> Is there NAPI-based steering to threads via SO_INCOMING_NAPI_ID in
> >> this case? More details, please, on locality. If there is no
> >> NAPI-based flow steering in this case, perhaps the improvements you
> >> are seeing are a result of both syscall overhead avoidance and data
> >> locality?
> >>
> >
> > The benchmarks were run with no NAPI steering.
> >
> > Regarding syscall overhead, I reproduced the above experiment with
> > mitigations=off
> > and found similar results as above. Pointing to the fact that the
> > above gains are
> > materialized from more than just avoiding syscall overhead.
>
> I suppose the natural follow-up questions are:
>
> 1) Where do the gains come from? and
>
> 2) Would they materialize with a realistic application?
>
> System calls have some overhead even with mitigations=off. In fact I
> understand on modern CPUs security mitigations are not that expensive to
> begin with? In a micro-benchmark that does nothing else but bouncing
> packets back and forth, this overhead might look more significant than
> in a realistic application?
>
> It seems your change does not eliminate any processing from each
> packet's path, but instead eliminates processing in between packet
> arrivals? This might lead to a small latency improvement, which might
> turn into a small throughput improvement in these micro-benchmarks, but
> that might quickly evaporate when an application has actual work to do
> in between packet arrivals.

This is a good point, and I was able to confirm this. I profiled the
changes in the
patch by fixing the number of threads and flows but scaling message sizes with
tcp_rr, using the notion that creating and processing large messages in tcp_rr
would take more time. As the message size increases from 1 B to MSS (4KB
in my setup), I found that the difference in latency and throughput diminishes
between looping inside epoll vs looping on nonblocking epoll_wait in userspace.

Understandably, as the message sizes increase the application becomes the
bottleneck and the syscall overhead becomes marginal to the whole cost of the
operation.

I also found that looping inside epoll yields latency and throughput
improvements again when message sizes increase past MSS. I believe this can
be rationalized as the cost of processing the packet in the application is then
amortized over the multiple transmitted segments and the system call overhead
becomes more prominent again.

This is some rough data showing the above
Setup: 5 threads on both client and server, 30 flows, mitigations=off,
both server
and client using the same request/response size

Looping inside epoll:
Message Size  Throughput  Latency P50  Latency P90  Latency P99  Latency P99.9
 1 B                   543971         57                 76
      93                  106
 250 B               501245         60                 77
    97                  109
 500 B               494467         60                 77
    93                  111
 1 KB                 486412         60                 77
     97                  114
 2 KB                 385125         77                 96
     114                123
 4 KB                 378612         78                 97
     119                129
 8 KB                 349214         83                109
    125                137
 16 KB               379276         156               202
  243                274
Looping in userspace:
Message Size  Throughput  Latency P50  Latency P90  Latency P99  Latency P99.9
 1 B                   496296         59                 76
      95                   109
 250 B               468840         67                 77
    97                   111
 500 B               476804         61                 78
    97                   110
 1 KB                 464273         65                 79
    100                  115
 2 KB                 388334         76                 97
    114                  122
 4 KB                 377851         79                 98
    118                  124
 8 KB                 333718         91                115
   128                  141
 16 KB               354708         157               253
 307                  343

I also examined the perf traces for both looping setups and compared the
overhead delta between the invocation of epoll_wait in glibc and the invocation
of do_epoll_wait in the kernel to measure just the overhead of calling the
system call. With 1 B messages, looping in userspace had a higher overhead
in CPU cycles for invoking the syscall compared to looping inside epoll, however
the overhead gap also shrinks as the message sizes increase and the syscall
overhead becomes increasingly marginal.

I believe testing with a benchmark like memcached and using napi steering
would confirm the same results, and I recognize now that most regular workloads
won’t benefit from this patch.

>
> It would be good to know a little more about your experiments. You are
> referring to 5 threads, but does that mean 5 cores were busy on both
> client and server during the experiment? Which of client or server is
> the bottleneck? In your baseline experiment, are all 5 server cores
> busy? How many RX queues are in play and how is interrupt routing
> configured?

Apologies, should have been clearer in the description. The server and client
were both using 5 threads to handle the connections without any CPU pinning.
I did however confirm that all threads used distinct cores from
scheduling traces
and there was no contention.
Both hosts had 32 queues with a napi instance per queue.

>
> Thanks,
> Martin
>
>

Given the above analysis it doesn’t make sense adding the extra knobs to the
epoll interface for an optimization that's not widely applicable, therefore this
patch can be considered as not needed.
Nonetheless, appreciate the feedback Joe and Martin.