[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <aK2LZCedKkXuG1I_@localhost.localdomain>
Date: Tue, 26 Aug 2025 12:24:36 +0200
From: Michal Kubiak <michal.kubiak@...el.com>
To: Jesper Dangaard Brouer <hawk@...nel.org>
CC: Jacob Keller <jacob.e.keller@...el.com>, Anthony Nguyen
<anthony.l.nguyen@...el.com>, Intel Wired LAN
<intel-wired-lan@...ts.osuosl.org>, <netdev@...r.kernel.org>, "Christoph
Petrausch" <christoph.petrausch@...pl.com>, Jaroslav Pulchart
<jaroslav.pulchart@...ddata.com>, kernel-team <kernel-team@...udflare.com>
Subject: Re: [PATCH iwl-net v2] ice: fix Rx page leak on multi-buffer frames
On Tue, Aug 26, 2025 at 10:35:30AM +0200, Jesper Dangaard Brouer wrote:
>
>
> On 26/08/2025 01.00, Jacob Keller wrote:
> > XDP_DROP performance has been tested for this version, thanks to work from
> > Michal Kubiak. The results are quite promising, with 3 versions being
> > compared:
> >
> > * baseline net-next tree
> > * v1 applied
> > * v2 applied
> >
> > Michal said:
> >
> > I run the XDP_DROP performance comparison tests on my setup in the way I
> > usually do. I didn't have the pktgen configured on my link partner, but I
> > used 6 instances of the xdpsock running in Tx-only mode to generate
> > high-bandwith traffic. Also, I tried to replicate the conditions according
> > to Jesper's description, making sure that all the traffic is directed to a
> > single Rx queue and one CPU is 100% loaded.
> >
>
> Thank you for replicating the test setup. Using xdpsock as a traffic
> generator is fine, as long as we make sure that the generator TX speeds
> exceeds the Device Under Test RX XDP_DROP speed. It is also important
> for the test that packets hits a single RX queue and we verify one CPU is
> 100% load, as you describe.
>
> As a reminder the pktgen kernel module comes with ready-to-use sample
> shell-scripts[1].
>
> [1] https://elixir.bootlin.com/linux/v6.16.3/source/samples/pktgen
>
Thank you! I am aware of that and also use those scripts.
The xdpsock solution was just the quickest option for that specific
moment, so I decided not to change my link partner setup, (since I
successfully reproduced the performance drop in v1).
> > The performance hit from v1 is replicated, and shown to be gone in v2, with
> > our results showing even an increase compared to baseline instead of a
> > drop. I've included the relative packet per second deltas compared against
> > a baseline test with neither v1 or v2.
> >
>
> Thanks for also replicating the performance hit from v1 as I did in [2].
>
> To Michal: What CPU did you use?
> - I used CPU: AMD EPYC 9684X (with SRSO=IBPB)
In my test I used: Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz
>
> One of the reasons that I saw a larger percentage drop is that this CPU
> doesn't have DDIO/DCA, which deliver the packet to L3 cache (and a L2
> cache-miss will obviously take less time than a full main memory cache-
> miss). (Details: Newer AMD CPUs will get something called PCIe TLP
> Processing Hints (TPH), which resembles DDIO).
>
> Point is that I see some opportunities in driver to move some of the
> prefetches earlier. But we want to make sure it benefits both CPU types,
> and I can test on the AMD platform. (This CPU is a large part of our
> fleet so it makes sense for us to optimize this).
>
> > baseline to v1, no-touch:
> > -8,387,677 packets per second (17%) decrease.
> >
> > baseline to v2, no-touch:
> > +4,057,000 packets per second (8%) increase!
> >
> > baseline to v1, read data:
> > -411,709 packets per second (1%) decrease.
> >
> > baseline to v2, read data:
> > +4,331,857 packets per second (11%) increase!
>
> Thanks for providing these numbers.
> I would also like to know the throughput PPS packet numbers before and
> after, as this allows me to calculate the nanosec difference. Using
> percentages are usually useful, but it can be misleading when dealing
> with XDP_DROP speeds, because a small nanosec change will get
> "magnified" too much.
>
I was usually told to share the percentage data, because absolute numbers may
depend on various circumstances.
However, I understand your point regarding XDP_DROP. In such case it may
be justified. Please see my raw results (from xdp-bench summary) below:
net-next (main) (drop, no touch)
Duration : 105.7s
Packets received : 4,960,778,583
Average packets/s : 46,951,873
Rx dropped : 4,960,778,583
net-next (main) (drop, read data)
Duration : 94.5s
Packets received : 3,524,346,352
Average packets/s : 37,295,056
Rx dropped : 3,524,346,352
net-next (main+v1) (drop, no touch)
Duration : 122.5s
Packets received : 4,722,510,839
Average packets/s : 38,564,196
Rx dropped : 4,722,510,839
net-next (main+v1) (drop, read data)
Duration : 115.7s
Packets received : 4,265,991,147
Average packets/s : 36,883,347
Rx dropped : 4,265,991,147
net-next (main+v2) (drop, no touch)
Duration : 130.6s
Packets received : 6,664,104,907
Average packets/s : 51,008,873
Rx dropped : 6,664,104,907
net-next (main+v2) (drop, read data)
Duration : 143.6s
Packets received : 5,975,991,044
Average packets/s : 41,626,913
Rx dropped : 5,975,991,044
Thanks,
Michal
> > ---
> > Changes in v2:
> > - Only access shared info for fragmented frames
> > - Link to v1: https://lore.kernel.org/netdev/20250815204205.1407768-4-anthony.l.nguyen@intel.com/
>
> [2] https://lore.kernel.org/netdev/6e2cbea1-8c70-4bfa-9ce4-1d07b545a705@kernel.org/
>
> > ---
> > drivers/net/ethernet/intel/ice/ice_txrx.h | 1 -
> > drivers/net/ethernet/intel/ice/ice_txrx.c | 80 +++++++++++++------------------
> > 2 files changed, 34 insertions(+), 47 deletions(-)
>
> Acked-by: Jesper Dangaard Brouer <hawk@...nel.org>
Powered by blists - more mailing lists