lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Thu, 18 Jan 2024 12:51:18 +0100
From: Toke Høiland-Jørgensen <toke@...hat.com>
To: Jakub Kicinski <kuba@...nel.org>
Cc: Sebastian Andrzej Siewior <bigeasy@...utronix.de>, Alexei Starovoitov
 <alexei.starovoitov@...il.com>, LKML <linux-kernel@...r.kernel.org>,
 Network Development <netdev@...r.kernel.org>, "David S. Miller"
 <davem@...emloft.net>, Boqun Feng <boqun.feng@...il.com>, Daniel Borkmann
 <daniel@...earbox.net>, Eric Dumazet <edumazet@...gle.com>, Frederic
 Weisbecker <frederic@...nel.org>, Ingo Molnar <mingo@...hat.com>, Paolo
 Abeni <pabeni@...hat.com>, Peter Zijlstra <peterz@...radead.org>, Thomas
 Gleixner <tglx@...utronix.de>, Waiman Long <longman@...hat.com>, Will
 Deacon <will@...nel.org>, Alexei Starovoitov <ast@...nel.org>, Andrii
 Nakryiko <andrii@...nel.org>, Cong Wang <xiyou.wangcong@...il.com>, Hao
 Luo <haoluo@...gle.com>, Jamal Hadi Salim <jhs@...atatu.com>, Jesper
 Dangaard Brouer <hawk@...nel.org>, Jiri Olsa <jolsa@...nel.org>, Jiri
 Pirko <jiri@...nulli.us>, John Fastabend <john.fastabend@...il.com>, KP
 Singh <kpsingh@...nel.org>, Martin KaFai Lau <martin.lau@...ux.dev>, Ronak
 Doshi <doshir@...are.com>, Song Liu <song@...nel.org>, Stanislav Fomichev
 <sdf@...gle.com>, VMware PV-Drivers Reviewers <pv-drivers@...are.com>,
 Yonghong Song <yonghong.song@...ux.dev>, bpf <bpf@...r.kernel.org>
Subject: Re: [PATCH net-next 15/24] net: Use nested-BH locking for XDP
 redirect.

Jakub Kicinski <kuba@...nel.org> writes:

> On Wed, 17 Jan 2024 17:37:29 +0100 Toke Høiland-Jørgensen wrote:
>> I am not contesting that latency is important, but it's a pretty
>> fundamental trade-off and we don't want to kill throughput entirely
>> either. Especially since this is global to the whole kernel; and there
>> are definitely people who want to use XDP on an RT kernel and still
>> achieve high PPS rates.
>> 
>> (Whether those people really strictly speaking need to be running an RT
>> kernel is maybe debatable, but it does happen).
>> 
>> > I expected the lock operation (under RT) to always succeeds and not
>> > cause any delay because it should not be contended.  
>> 
>> A lock does cause delay even when it's not contended. Bear in mind that
>> at 10 Gbps line rate, we have a budget of 64 nanoseconds to process each
>> packet (for 64-byte packets). So just the atomic op to figure out
>> whether there's any contention (around 10ns on the Intel processors I
>> usually test on) will blow a huge chunk of the total processing budget.
>> We can't actually do the full processing needed in those 64 nanoseconds
>> (not to mention the 6.4 nanoseconds we have available at 100Gbps), which
>> is why it's essential to amortise as much as we can over multiple
>> packets.
>> 
>> This is all back-of-the-envelope calculations, of course. Having some
>> actual numbers to look at would be great; I don't suppose you have a
>> setup where you can run xdp-bench and see how your patches affect the
>> throughput?
>
> A potentially stupid idea which I have been turning in my head is 
> how we could get away from having the driver handle details of NAPI
> budgeting. It's an source of bugs and endless review comments.
>
> All drivers end up maintaining a counter of "how many packets have
> I processed" and comparing that against the budget. Would it be crazy
> if we put that inside napi_struct? Add a "budget" member inside
> napi_struct as well, and:
>
> struct napi_struct {
> ...
> 	// poll state
> 	unsigned int budget;
> 	unsigned int rx_used;
> ...
> }
>
> static inline bool napi_rx_has_budget(napi)
> {
> 	return napi->budget > napi->rx_used;
> }
>
> poll(napi) // no budget
> {
> 	while (napi_rx_has_budget(napi)) {
> 		napi_gro_receive(napi, skb); /* does napi->rx_used++ */
> 		// maybe add explicit napi_rx_count() if
> 		// driver did something funny with the frame.
> 	}
> }
>
> We can also create napi_tx_has_budget() so that people stop being
> confused whether budget is for Tx or not. And napi_xdp_comp_has_budget()
> so that people stop completing XDP in hard irq context (budget==0)...
>
> And we can pass napi into napi_consume_skb(), instead of, presumably
> inexplicably to a newcomer, passing in budget.
> And napi_complete_done() can lose the work_done argument, too.

I do agree that conceptually it makes a lot of sense to encapsulate the
budget like this so drivers don't have to do all this state tracking
themselves. It does appear that drivers are doing different things with
the budget as it is today, though. For instance, the intel drivers seem
to divide the budget over all the enabled RX rings(?); so I'm wondering
if it'll be possible to unify drivers around a more opaque NAPI poll
API?

-Toke


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ