[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <IA1PR11MB624104B8FACCCD9DC31493248B14A@IA1PR11MB6241.namprd11.prod.outlook.com>
Date: Tue, 16 Sep 2025 04:19:38 +0000
From: "Rinitha, SX" <sx.rinitha@...el.com>
To: "Keller, Jacob E" <jacob.e.keller@...el.com>, "Kubiak, Michal"
<michal.kubiak@...el.com>, "Nguyen, Anthony L" <anthony.l.nguyen@...el.com>,
Intel Wired LAN <intel-wired-lan@...ts.osuosl.org>, "netdev@...r.kernel.org"
<netdev@...r.kernel.org>
CC: Christoph Petrausch <christoph.petrausch@...pl.com>, "Jesper Dangaard
Brouer" <hawk@...nel.org>, Jaroslav Pulchart
<jaroslav.pulchart@...ddata.com>, "Keller, Jacob E"
<jacob.e.keller@...el.com>
Subject: RE: [Intel-wired-lan] [PATCH iwl-net v2] ice: fix Rx page leak on
multi-buffer frames
> -----Original Message-----
> From: Intel-wired-lan <intel-wired-lan-bounces@...osl.org> On Behalf Of Jacob Keller
> Sent: 26 August 2025 04:30
> To: Kubiak, Michal <michal.kubiak@...el.com>; Nguyen, Anthony L <anthony.l.nguyen@...el.com>; Intel Wired LAN <intel-wired-lan@...ts.osuosl.org>; netdev@...r.kernel.org
> Cc: Christoph Petrausch <christoph.petrausch@...pl.com>; Jesper Dangaard Brouer <hawk@...nel.org>; Jaroslav Pulchart <jaroslav.pulchart@...ddata.com>; Keller, Jacob E <jacob.e.keller@...el.com>
> Subject: [Intel-wired-lan] [PATCH iwl-net v2] ice: fix Rx page leak on multi-buffer frames
>
> The ice_put_rx_mbuf() function handles calling ice_put_rx_buf() for each buffer in the current frame. This function was introduced as part of handling multi-buffer XDP support in the ice driver.
>
> It works by iterating over the buffers from first_desc up to 1 plus the total number of fragments in the frame, cached from before the XDP program was executed.
>
> If the hardware posts a descriptor with a size of 0, the logic used in
> ice_put_rx_mbuf() breaks. Such descriptors get skipped and don't get added as fragments in ice_add_xdp_frag. Since the buffer isn't counted as a fragment, we do not iterate over it in ice_put_rx_mbuf(), and thus we don't call ice_put_rx_buf().
>
> Because we don't call ice_put_rx_buf(), we don't attempt to re-use the page or free it. This leaves a stale page in the ring, as we don't increment next_to_alloc.
>
> The ice_reuse_rx_page() assumes that the next_to_alloc has been incremented properly, and that it always points to a buffer with a NULL page. Since this function doesn't check, it will happily recycle a page over the top of the next_to_alloc buffer, losing track of the old page.
>
> Note that this leak only occurs for multi-buffer frames. The
> ice_put_rx_mbuf() function always handles at least one buffer, so a single-buffer frame will always get handled correctly. It is not clear precisely why the hardware hands us descriptors with a size of 0 sometimes, but it happens somewhat regularly with "jumbo frames" used by 9K MTU.
>
> To fix ice_put_rx_mbuf(), we need to make sure to call ice_put_rx_buf() on all buffers between first_desc and next_to_clean. Borrow the logic of a similar function in i40e used for this same purpose. Use the same logic also in ice_get_pgcnts().
>
> Instead of iterating over just the number of fragments, use a loop which iterates until the current index reaches to the next_to_clean element just past the current frame. Unlike i40e, the ice_put_rx_mbuf() function does call ice_put_rx_buf() on the last buffer of the frame indicating the end of packet.
>
> For non-linear (multi-buffer) frames, we need to take care when adjusting the pagecnt_bias. An XDP program might release fragments from the tail of the frame, in which case that fragment page is already released. Only update the pagecnt_bias for the first descriptor and fragments still remaining post-XDP program. Take care to only access the shared info for fragmented buffers, as this avoids a significant cache miss.
>
> The xdp_xmit value only needs to be updated if an XDP program is run, and only once per packet. Drop the xdp_xmit pointer argument from ice_put_rx_mbuf(). Instead, set xdp_xmit in the ice_clean_rx_irq() function directly. This avoids needing to pass the argument and avoids an extra bit-wise OR for each buffer in the frame.
>
> Move the increment of the ntc local variable to ensure its updated *before* all calls to ice_get_pgcnts() or ice_put_rx_mbuf(), as the loop logic requires the index of the element just after the current frame.
>
> Now that we use an index pointer in the ring to identify the packet, we no longer need to track or cache the number of fragments in the rx_ring.
>
> Cc: Christoph Petrausch <christoph.petrausch@...pl.com>
> Cc: Jesper Dangaard Brouer <hawk@...nel.org>
> Reported-by: Jaroslav Pulchart <jaroslav.pulchart@...ddata.com>
> Closes: https://lore.kernel.org/netdev/CAK8fFZ4hY6GUJNENz3wY9jaYLZXGfpr7dnZxzGMYoE44caRbgw@mail.gmail.com/
> Fixes: 743bbd93cf29 ("ice: put Rx buffers after being done with current frame")
> Tested-by: Michal Kubiak <michal.kubiak@...el.com>
> Signed-off-by: Jacob Keller <jacob.e.keller@...el.com>
> ---
> I've tested this in a setup with MTU 9000, using a combination of iperf3 and wrk generated traffic.
>
> I tested this in a couple of ways. First, I check memory allocations using
> /proc/allocinfo:
>
> awk '/ice_alloc_mapped_page/ { printf("%s %s\n", $1, $2) }' /proc/allocinfo | numfmt --to=iec
>
> Second, I ported some stats from i40e written by Joe Damato to track the page allocation and busy counts. I consistently saw that the allocate stat increased without the busy or waive stats increasing. I also added a stat to track directly when we overwrote a page pointer that was non-NULL in ice_reuse_rx_page(), and saw it increment consistently.
>
> With this fix, all of these indicators are fixed. I've tested both 1500 byte and 9000 byte MTU and no longer see the leak. With the counters I was able to immediately see a leak within a few minutes of iperf3, so I am confident that I've resolved the leak with this fix.
>
> I've now also tested with xdp-bench and confirm that XDP_TX and XDP_REDIR work properly with the fix for updating xdp_xmit.
>
> This version (v2) avoids the cache miss regression reported by Jesper. I refactored a bit to only check the shared info if the XDP buffer is fragmented. I considered adding a helper function to do this to the XDP header file. However, I scanned several drivers and noticed that only ice and i40e access the nr_frags in this way. The ice variation I believe will be removed with the conversion to page pool, so I don't think the addition of a helper is warranted.
>
> XDP_DROP performance has been tested for this version, thanks to work from Michal Kubiak. The results are quite promising, with 3 versions being
> compared:
>
> * baseline net-next tree
> * v1 applied
> * v2 applied
>
> Michal said:
>
> I run the XDP_DROP performance comparison tests on my setup in the way I
> usually do. I didn't have the pktgen configured on my link partner, but I
> used 6 instances of the xdpsock running in Tx-only mode to generate
> high-bandwith traffic. Also, I tried to replicate the conditions according
> to Jesper's description, making sure that all the traffic is directed to a
> single Rx queue and one CPU is 100% loaded.
>
> The performance hit from v1 is replicated, and shown to be gone in v2, with our results showing even an increase compared to baseline instead of a drop. I've included the relative packet per second deltas compared against a baseline test with neither v1 or v2.
>
> baseline to v1, no-touch:
> -8,387,677 packets per second (17%) decrease.
>
> baseline to v2, no-touch:
> +4,057,000 packets per second (8%) increase!
>
> baseline to v1, read data:
> -411,709 packets per second (1%) decrease.
>
> baseline to v2, read data:
> +4,331,857 packets per second (11%) increase!
> ---
> Changes in v2:
> - Only access shared info for fragmented frames
> - Link to v1: https://lore.kernel.org/netdev/20250815204205.1407768-4-anthony.l.nguyen@intel.com/
> ---
> drivers/net/ethernet/intel/ice/ice_txrx.h | 1 - drivers/net/ethernet/intel/ice/ice_txrx.c | 80 +++++++++++++------------------
> 2 files changed, 34 insertions(+), 47 deletions(-)
>
Tested-by: Rinitha S <sx.rinitha@...el.com> (A Contingent worker at Intel)
Powered by blists - more mailing lists