lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <f7jrsvdauqebendldnyvjjsjypyxoqozwr3awtvo2bjv5t7xzm@p3owykvczayu>
Date: Wed, 8 Jan 2025 18:26:30 -0700
From: Daniel Xu <dxu@...uu.xyz>
To: Jesper Dangaard Brouer <hawk@...nel.org>
Cc: Alexander Lobakin <aleksander.lobakin@...el.com>, 
	Andrew Lunn <andrew+netdev@...n.ch>, "David S. Miller" <davem@...emloft.net>, 
	Eric Dumazet <edumazet@...gle.com>, Jakub Kicinski <kuba@...nel.org>, 
	Paolo Abeni <pabeni@...hat.com>, Lorenzo Bianconi <lorenzo@...nel.org>, 
	Alexei Starovoitov <ast@...nel.org>, Daniel Borkmann <daniel@...earbox.net>, 
	Andrii Nakryiko <andrii@...nel.org>, John Fastabend <john.fastabend@...il.com>, 
	Toke Høiland-Jørgensen <toke@...nel.org>, Martin KaFai Lau <martin.lau@...ux.dev>, netdev@...r.kernel.org, 
	bpf@...r.kernel.org, linux-kernel@...r.kernel.org, 
	Jesse Brandeburg <jbrandeburg@...udflare.com>, kernel-team <kernel-team@...udflare.com>
Subject: Re: [PATCH net-next v2 0/8] bpf: cpumap: enable GRO for XDP_PASS
 frames

On Tue, Jan 07, 2025 at 06:17:06PM +0100, Jesper Dangaard Brouer wrote:
> Awesome work! - some questions below
> 
> On 07/01/2025 16.29, Alexander Lobakin wrote:
> > Several months ago, I had been looking through my old XDP hints tree[0]
> > to check whether some patches not directly related to hints can be sent
> > standalone. Roughly at the same time, Daniel appeared and asked[1] about
> > GRO for cpumap from that tree.
> > 
> > Currently, cpumap uses its own kthread which processes cpumap-redirected
> > frames by batches of 8, without any weighting (but with rescheduling
> > points). The resulting skbs get passed to the stack via
> > netif_receive_skb_list(), which means no GRO happens.
> > Even though we can't currently pass checksum status from the drivers,
> > in many cases GRO performs better than the listified Rx without the
> > aggregation, confirmed by tests.
> > 
> > In order to enable GRO in cpumap, we need to do the following:
> > 
> > * patches 1-2: decouple the GRO struct from the NAPI struct and allow
> >    using it out of a NAPI entity within the kernel core code;
> > * patch 3: switch cpumap from netif_receive_skb_list() to
> >    gro_receive_skb().
> > 
> > Additional improvements:
> > 
> > * patch 4: optimize XDP_PASS in cpumap by using arrays instead of linked
> >    lists;
> > * patch 5-6: introduce and use function do get skbs from the NAPI percpu
> >    caches by bulks, not one at a time;
> > * patch 7-8: use that function in veth as well and remove the one that
> >    was now superseded by it.
> > 
> > My trafficgen UDP GRO tests, small frame sizes:
> > 
> 
> How does your trafficgen UDP test manage to get UDP GRO working?
> (Perhaps you can share test?)
> 
> What is the "small frame" size being used?
> 
> Is the UDP benchmark avoiding (re)calculating the RX checksum?
> (via setting UDP csum to zero)
> 
> >                  GRO off    GRO on
> > baseline        2.7        N/A       Mpps
> > patch 3         2.3        4         Mpps
> > patch 8         2.4        4.7       Mpps
> > 
> > 1...3 diff      -17        +48       %
> > 1...8 diff      -11        +74       %
> > 
> > Daniel reported from +14%[2] to +18%[3] of throughput in neper's TCP RR
> > tests. On my system however, the same test gave me up to +100%.
> > 
> 
> I can imagine that the TCP throughput tests will yield a huge
> performance boost.
> 
> > Note that there's a series from Lorenzo[4] which achieves the same, but
> > in a different way. During the discussions, the approach using a
> > standalone GRO instance was preferred over the threaded NAPI.
> > 
> 
> It looks like you are keeping the "remote" CPUMAP kthread process design
> intact in this series, right?
> 
> I think this design works for our use-case. For our use-case, we want to
> give "remote" CPU-thread higher scheduling priority.  It doesn't matter
> if this is a kthread or threaded-NAPI thread, as long as we can see this
> as a PID from userspace (by which we adjust the sched priority).
> 

Similiar for us as well - having a schedulable entity helps. I might
have mentioned it on an earlier thread, but with sched-ext, I think
things could get interesting for dynamically tuning the system. We've
got some vague ideas. Probably not this upcoming one, but maybe if any
of the ideas work we'll share them at netdev or something.

> Great to see this work progressing again :-)))

Agreed, thanks for continuing!

Daniel

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ