[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <30ddb66a-aeea-480d-bf79-38fc06ea45b0@uwaterloo.ca>
Date: Wed, 4 Sep 2024 08:46:10 -0400
From: Martin Karsten <mkarsten@...terloo.ca>
To: Naman Gulati <namangulati@...gle.com>, Joe Damato <jdamato@...tly.com>,
Alexander Viro <viro@...iv.linux.org.uk>,
Christian Brauner <brauner@...nel.org>, Jan Kara <jack@...e.cz>,
"David S. Miller" <davem@...emloft.net>, Eric Dumazet <edumazet@...gle.com>,
netdev@...r.kernel.org, Stanislav Fomichev <sdf@...ichev.me>,
linux-kernel@...r.kernel.org, skhawaja@...gle.com,
Willem de Bruijn <willemdebruijn.kernel@...il.com>
Subject: Re: [PATCH] Add provision to busyloop for events in ep_poll.
On 2024-09-04 01:52, Naman Gulati wrote:
> Thanks all for the comments and apologies for the delay in replying.
> Stan and Joe I’ve addressed some of the common concerns below.
>
> On Thu, Aug 29, 2024 at 3:40 AM Joe Damato <jdamato@...tly.com> wrote:
>>
>> On Wed, Aug 28, 2024 at 06:10:11PM +0000, Naman Gulati wrote:
>>> NAPI busypolling in ep_busy_loop loops on napi_poll and checks for new
>>> epoll events after every napi poll. Checking just for epoll events in a
>>> tight loop in the kernel context delivers latency gains to applications
>>> that are not interested in napi busypolling with epoll.
>>>
>>> This patch adds an option to loop just for new events inside
>>> ep_busy_loop, guarded by the EPIOCSPARAMS ioctl that controls epoll napi
>>> busypolling.
>>
>> This makes an API change, so I think that linux-api@...r.kernel.org
>> needs to be CC'd ?
>>
>>> A comparison with neper tcp_rr shows that busylooping for events in
>>> epoll_wait boosted throughput by ~3-7% and reduced median latency by
>>> ~10%.
>>>
>>> To demonstrate the latency and throughput improvements, a comparison was
>>> made of neper tcp_rr running with:
>>> 1. (baseline) No busylooping
>>
>> Is there NAPI-based steering to threads via SO_INCOMING_NAPI_ID in
>> this case? More details, please, on locality. If there is no
>> NAPI-based flow steering in this case, perhaps the improvements you
>> are seeing are a result of both syscall overhead avoidance and data
>> locality?
>>
>
> The benchmarks were run with no NAPI steering.
>
> Regarding syscall overhead, I reproduced the above experiment with
> mitigations=off
> and found similar results as above. Pointing to the fact that the
> above gains are
> materialized from more than just avoiding syscall overhead.
I suppose the natural follow-up questions are:
1) Where do the gains come from? and
2) Would they materialize with a realistic application?
System calls have some overhead even with mitigations=off. In fact I
understand on modern CPUs security mitigations are not that expensive to
begin with? In a micro-benchmark that does nothing else but bouncing
packets back and forth, this overhead might look more significant than
in a realistic application?
It seems your change does not eliminate any processing from each
packet's path, but instead eliminates processing in between packet
arrivals? This might lead to a small latency improvement, which might
turn into a small throughput improvement in these micro-benchmarks, but
that might quickly evaporate when an application has actual work to do
in between packet arrivals.
It would be good to know a little more about your experiments. You are
referring to 5 threads, but does that mean 5 cores were busy on both
client and server during the experiment? Which of client or server is
the bottleneck? In your baseline experiment, are all 5 server cores
busy? How many RX queues are in play and how is interrupt routing
configured?
Thanks,
Martin
Powered by blists - more mailing lists