netdev - Re: [PATCH iwl-net] ice: fix Rx page leak on multi-buffer frames

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <2615a858-4b4b-422c-aa7c-a5e9f78dbabe@intel.com>
Date: Fri, 11 Jul 2025 16:41:13 -0700
From: Jacob Keller <jacob.e.keller@...el.com>
To: Maciej Fijalkowski <maciej.fijalkowski@...el.com>
CC: Intel Wired LAN <intel-wired-lan@...ts.osuosl.org>, Przemek Kitszel
	<przemyslaw.kitszel@...el.com>, Alexander Lobakin
	<aleksander.lobakin@...el.com>, Joe Damato <jdamato@...tly.com>, "Anthony
 Nguyen" <anthony.l.nguyen@...el.com>, <netdev@...r.kernel.org>, "Christoph
 Petrausch" <christoph.petrausch@...pl.com>, Jaroslav Pulchart
	<jaroslav.pulchart@...ddata.com>
Subject: Re: [PATCH iwl-net] ice: fix Rx page leak on multi-buffer frames



On 7/11/2025 8:33 AM, Maciej Fijalkowski wrote:
> On Wed, Jul 09, 2025 at 12:07:30PM -0700, Jacob Keller wrote:
>> The ice_put_rx_mbuf() function handles calling ice_put_rx_buf() for each
>> buffer in the current frame. This function was introduced as part of
>> handling multi-buffer XDP support in the ice driver.
>>
>> It works by iterating over the buffers from first_desc up to 1 plus the
>> total number of fragments in the frame, cached from before the XDP program
>> was executed.
>>
>> If the hardware posts a descriptor with a size of 0, the logic used in
>> ice_put_rx_mbuf() breaks. Such descriptors get skipped and don't get added
>> as fragments in ice_add_xdp_frag. Since the buffer isn't counted as a
>> fragment, we do not iterate over it in ice_put_rx_mbuf(), and thus we don't
>> call ice_put_rx_buf().
>>
>> Because we don't call ice_put_rx_buf(), we don't attempt to re-use the
>> page or free it. This leaves a stale page in the ring, as we don't
>> increment next_to_alloc.
>>
>> The ice_reuse_rx_page() assumes that the next_to_alloc has been incremented
>> properly, and that it always points to a buffer with a NULL page. Since
>> this function doesn't check, it will happily recycle a page over the top
>> of the next_to_alloc buffer, losing track of the old page.
>>
>> Note that this leak only occurs for multi-buffer frames. The
>> ice_put_rx_mbuf() function always handles at least one buffer, so a
>> single-buffer frame will always get handled correctly. It is not clear
>> precisely why the hardware hands us descriptors with a size of 0 sometimes,
>> but it happens somewhat regularly with "jumbo frames" used by 9K MTU.
>>
>> To fix ice_put_rx_mbuf(), we need to make sure to call ice_put_rx_buf() on
>> all buffers between first_desc and next_to_clean. Borrow the logic of a
>> similar function in i40e used for this same purpose. Use the same logic
>> also in ice_get_pgcnts().
>>
>> Instead of iterating over just the number of fragments, use a loop which
>> iterates until the current index reaches to the next_to_clean element just
>> past the current frame. Check the current number of fragments (post XDP
>> program). For all buffers up 1 more than the number of fragments, we'll
>> update the pagecnt_bias. For any buffers past this, pagecnt_bias is left
>> as-is. This ensures that fragments released by the XDP program, as well as
>> any buffers with zero-size won't have their pagecnt_bias updated
>> incorrectly. Unlike i40e, the ice_put_rx_mbuf() function does call
>> ice_put_rx_buf() on the last buffer of the frame indicating end of packet.
>>
>> Move the increment of the ntc local variable to ensure its updated *before*
>> all calls to ice_get_pgcnts() or ice_put_rx_mbuf(), as the loop logic
>> requires the index of the element just after the current frame.
>>
>> This has the advantage that we also no longer need to track or cache the
>> number of fragments in the rx_ring, which saves a few bytes in the ring.
>>
>> Cc: Christoph Petrausch <christoph.petrausch@...pl.com>
>> Reported-by: Jaroslav Pulchart <jaroslav.pulchart@...ddata.com>
>> Closes: https://lore.kernel.org/netdev/CAK8fFZ4hY6GUJNENz3wY9jaYLZXGfpr7dnZxzGMYoE44caRbgw@mail.gmail.com/
>> Fixes: 743bbd93cf29 ("ice: put Rx buffers after being done with current frame")
>> Signed-off-by: Jacob Keller <jacob.e.keller@...el.com>
>> ---
>> I've tested this in a setup with MTU 9000, using a combination of iperf3
>> and wrk generated traffic.
>>
>> I tested this in a couple of ways. First, I check memory allocations using
>> /proc/allocinfo:
>>
>>   awk '/ice_alloc_mapped_page/ { printf("%s %s\n", $1, $2) }' /proc/allocinfo | numfmt --to=iec
>>
>> Second, I ported some stats from i40e written by Joe Damato to track the
>> page allocation and busy counts. I consistently saw that the allocate stat
>> increased without the busy or waive stats increasing. I also added a stat
>> to track directly when we overwrote a page pointer that was non-NULL in
>> ice_reuse_rx_page(), and saw it increment consistently.
>>
>> With this fix, all of these indicators are fixed. I've tested both 1500
>> byte and 9000 byte MTU and no longer see the leak. With the counters I was
>> able to immediately see a leak within a few minutes of iperf3, so I am
>> confident that I've resolved the leak with this fix.
>> ---
>>  drivers/net/ethernet/intel/ice/ice_txrx.h |  1 -
>>  drivers/net/ethernet/intel/ice/ice_txrx.c | 71 ++++++++++++-------------------
>>  2 files changed, 28 insertions(+), 44 deletions(-)
>>
>> diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.h b/drivers/net/ethernet/intel/ice/ice_txrx.h
>> index a4b1e9514632..07155e615f75 100644
>> --- a/drivers/net/ethernet/intel/ice/ice_txrx.h
>> +++ b/drivers/net/ethernet/intel/ice/ice_txrx.h
>> @@ -358,7 +358,6 @@ struct ice_rx_ring {
>>  	struct ice_tx_ring *xdp_ring;
>>  	struct ice_rx_ring *next;	/* pointer to next ring in q_vector */
>>  	struct xsk_buff_pool *xsk_pool;
>> -	u32 nr_frags;
>>  	u16 max_frame;
>>  	u16 rx_buf_len;
>>  	dma_addr_t dma;			/* physical address of ring */
>> diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.c b/drivers/net/ethernet/intel/ice/ice_txrx.c
>> index 0e5107fe62ad..b139066b6f0d 100644
>> --- a/drivers/net/ethernet/intel/ice/ice_txrx.c
>> +++ b/drivers/net/ethernet/intel/ice/ice_txrx.c
>> @@ -865,10 +865,6 @@ ice_add_xdp_frag(struct ice_rx_ring *rx_ring, struct xdp_buff *xdp,
>>  	__skb_fill_page_desc_noacc(sinfo, sinfo->nr_frags++, rx_buf->page,
>>  				   rx_buf->page_offset, size);
>>  	sinfo->xdp_frags_size += size;
>> -	/* remember frag count before XDP prog execution; bpf_xdp_adjust_tail()
>> -	 * can pop off frags but driver has to handle it on its own
>> -	 */
>> -	rx_ring->nr_frags = sinfo->nr_frags;
>>  
>>  	if (page_is_pfmemalloc(rx_buf->page))
>>  		xdp_buff_set_frag_pfmemalloc(xdp);
>> @@ -939,20 +935,20 @@ ice_get_rx_buf(struct ice_rx_ring *rx_ring, const unsigned int size,
>>  /**
>>   * ice_get_pgcnts - grab page_count() for gathered fragments
>>   * @rx_ring: Rx descriptor ring to store the page counts on
>> + * @ntc: the next to clean element (not included in this frame!)
>>   *
>>   * This function is intended to be called right before running XDP
>>   * program so that the page recycling mechanism will be able to take
>>   * a correct decision regarding underlying pages; this is done in such
>>   * way as XDP program can change the refcount of page
>>   */
>> -static void ice_get_pgcnts(struct ice_rx_ring *rx_ring)
>> +static void ice_get_pgcnts(struct ice_rx_ring *rx_ring, unsigned int ntc)
>>  {
>> -	u32 nr_frags = rx_ring->nr_frags + 1;
>>  	u32 idx = rx_ring->first_desc;
>>  	struct ice_rx_buf *rx_buf;
>>  	u32 cnt = rx_ring->count;
>>  
>> -	for (int i = 0; i < nr_frags; i++) {
>> +	while (idx != ntc) {
>>  		rx_buf = &rx_ring->rx_buf[idx];
>>  		rx_buf->pgcnt = page_count(rx_buf->page);
>>  
>> @@ -1125,62 +1121,49 @@ ice_put_rx_buf(struct ice_rx_ring *rx_ring, struct ice_rx_buf *rx_buf)
>>  }
>>  
>>  /**
>> - * ice_put_rx_mbuf - ice_put_rx_buf() caller, for all frame frags
>> + * ice_put_rx_mbuf - ice_put_rx_buf() caller, for all buffers in frame
>>   * @rx_ring: Rx ring with all the auxiliary data
>>   * @xdp: XDP buffer carrying linear + frags part
>>   * @xdp_xmit: XDP_TX/XDP_REDIRECT verdict storage
>> - * @ntc: a current next_to_clean value to be stored at rx_ring
>> + * @ntc: the next to clean element (not included in this frame!)
>>   * @verdict: return code from XDP program execution
>>   *
>> - * Walk through gathered fragments and satisfy internal page
>> - * recycle mechanism; we take here an action related to verdict
>> - * returned by XDP program;
>> + * Called after XDP program is completed, or on error with verdict set to
>> + * ICE_XDP_CONSUMED.
>> + *
>> + * Walk through buffers from first_desc to the end of the frame, releasing
>> + * buffers and satisfying internal page recycle mechanism. The action depends
>> + * on verdict from XDP program.
>>   */
>>  static void ice_put_rx_mbuf(struct ice_rx_ring *rx_ring, struct xdp_buff *xdp,
>>  			    u32 *xdp_xmit, u32 ntc, u32 verdict)
>>  {
>> -	u32 nr_frags = rx_ring->nr_frags + 1;
>> +	u32 nr_frags = xdp_get_shared_info_from_buff(xdp)->nr_frags;
>>  	u32 idx = rx_ring->first_desc;
>>  	u32 cnt = rx_ring->count;
>> -	u32 post_xdp_frags = 1;
>>  	struct ice_rx_buf *buf;
>> -	int i;
>> +	int i = 0;
>>  
>> -	if (unlikely(xdp_buff_has_frags(xdp)))
>> -		post_xdp_frags += xdp_get_shared_info_from_buff(xdp)->nr_frags;
>> -
>> -	for (i = 0; i < post_xdp_frags; i++) {
>> +	while (idx != ntc) {
>>  		buf = &rx_ring->rx_buf[idx];
>> +		if (++idx == cnt)
>> +			idx = 0;
>>  
>> -		if (verdict & (ICE_XDP_TX | ICE_XDP_REDIR)) {
>> +		/* An XDP program could release fragments from the end of the
>> +		 * buffer. For these, we need to keep the pagecnt_bias as-is.
>> +		 * To do this, only adjust pagecnt_bias for fragments up to
>> +		 * the total remaining after the XDP program has run.
>> +		 */
>> +		if (verdict != ICE_XDP_CONSUMED)
>>  			ice_rx_buf_adjust_pg_offset(buf, xdp->frame_sz);
>> -			*xdp_xmit |= verdict;
> 
> Hi Jake,
> 
> you're likely breaking XDP_REDIRECT/XDP_TX workloads. I believe you need
> to give this patch a spin against xdp-bench and test all actions...
> 
> anyways thanks for great analysis and bugfix!
> 

Maciej is right, we need to update xdp_xmit somewhere. I think doing so
in ice_rx_put_mbuf is wrong (we're bitwise OR multiple times the same
verdict once per buffer, and we only need to do this if an XDP program
is run, so 2 of the 3 callers don't need to do this. One of them even
passes NULL to xdp_xmit, which isn't ever checked, but happens to work
because it doesn't pass XDP_TX or XDP_REDIR as the verdict).

I'll fix this by dropping the xdp_xmit parameter to this function and
updating the xdp_xmit outside in the one place where we need to do so.

I'm building that version to test against xdp-bench to verify.

Thanks for spotting this oversight!

Hopefully I'll have v2 out soon but it might not make it before the weekend.


Download attachment "OpenPGP_signature.asc" of type "application/pgp-signature" (237 bytes)