lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <6e2cbea1-8c70-4bfa-9ce4-1d07b545a705@kernel.org>
Date: Tue, 19 Aug 2025 18:44:56 +0200
From: Jesper Dangaard Brouer <hawk@...nel.org>
To: Jacob Keller <jacob.e.keller@...el.com>,
 Tony Nguyen <anthony.l.nguyen@...el.com>, ast@...nel.org,
 davem@...emloft.net, kuba@...nel.org, pabeni@...hat.com,
 netdev@...r.kernel.org
Cc: maciej.fijalkowski@...el.com, magnus.karlsson@...el.com,
 andrew+netdev@...n.ch, daniel@...earbox.net, john.fastabend@...il.com,
 sdf@...ichev.me, bpf@...r.kernel.org, horms@...nel.org,
 przemyslaw.kitszel@...el.com, aleksander.lobakin@...el.com,
 jaroslav.pulchart@...ddata.com, jdamato@...tly.com,
 christoph.petrausch@...pl.com, Rinitha S <sx.rinitha@...el.com>,
 Priya Singh <priyax.singh@...el.com>, Eelco Chaudron <echaudro@...hat.com>,
 edumazet@...gle.com
Subject: Re: [PATCH net 3/6] ice: fix Rx page leak on multi-buffer frames



On 19/08/2025 02.38, Jacob Keller wrote:
> 
> 
> On 8/18/2025 4:05 AM, Jesper Dangaard Brouer wrote:
>> On 15/08/2025 22.41, Tony Nguyen wrote:
>>> This has the advantage that we also no longer need to track or cache the
>>> number of fragments in the rx_ring, which saves a few bytes in the ring.
>>>
>>
>> Have anyone tested the performance impact for XDP_DROP ?
>> (with standard non-multi-buffer frames)
>>
>> Below code change will always touch cache-lines in shared_info area.
>> Before it was guarded with a xdp_buff_has_frags() check.
>>
> 
> I did some basic testing with XDP_DROP previously using the xdp-bench
> tool, and do not recall notice an issue. I don't recall the actual
> numbers now though, so I did some quick tests again.
> 
> without patch...
> 
> Client:
> $ iperf3 -u -c 192.168.93.1 -t86400 -l1200 -P20 -b5G
> ...
> [SUM]  10.00-10.33  sec   626 MBytes  16.0 Gbits/sec  546909
> 
> $ iperf3 -s -B 192.168.93.1%ens260f0
> [SUM]   0.00-10.00  sec  17.7 GBytes  15.2 Gbits/sec  0.011 ms
> 9712/15888183 (0.061%)  receiver
> 
> $ xdp-bench drop ens260f0
> Summary                 1,778,935 rx/s                  0 err/s
> Summary                 2,041,087 rx/s                  0 err/s
> Summary                 2,005,052 rx/s                  0 err/s
> Summary                 1,918,967 rx/s                  0 err/s
> 
> with patch...
> 
> Client:
> $ iperf3 -u -c 192.168.93.1 -t86400 -l1200 -P20 -b5G
> ...
> [SUM]  78.00-78.90  sec  2.01 GBytes  19.1 Gbits/sec  1801284
> 
> Server:
> $ iperf3 -s -B 192.168.93.1%ens260f0
> [SUM]  77.00-78.00  sec  2.14 GBytes  18.4 Gbits/sec  0.012 ms
> 9373/1921186 (0.49%)
> 
> xdp-bench:
> $ xdp-bench drop ens260f0
> Dropping packets on ens260f0 (ifindex 8; driver ice)
> Summary                 1,910,918 rx/s                  0 err/s
> Summary                 1,866,562 rx/s                  0 err/s
> Summary                 1,901,233 rx/s                  0 err/s
> Summary                 1,859,854 rx/s                  0 err/s
> Summary                 1,593,493 rx/s                  0 err/s
> Summary                 1,891,426 rx/s                  0 err/s
> Summary                 1,880,673 rx/s                  0 err/s
> Summary                 1,866,043 rx/s                  0 err/s
> Summary                 1,872,845 rx/s                  0 err/s
> 
> 
> I ran a few times and it seemed to waffle a bit around 15Gbit/sec to
> 20Gbit/sec, with throughput varying regardless of which patch applied. I
> actually tended to see slightly higher numbers with this fix applied,
> but it was not consistent and hard to measure.
> 

Above testing is not a valid XDP_DROP test.

The packet generator need to be much much faster, as XDP_DROP is for
DDoS protection use-cases (one of Cloudflare's main products).

I recommend using the script for pktgen in kernel tree:
  samples/pktgen/pktgen_sample03_burst_single_flow.sh

Example:
  ./pktgen_sample03_burst_single_flow.sh -vi mlx5p2 -d 198.18.100.1 -m 
b4:96:91:ad:0b:09 -t $(nproc)


> without the patch:

On my testlab with CPU: AMD EPYC 9684X (SRSO=IBPB) running:
  - sudo ./xdp-bench drop ice4  # (defaults to no-touch)

XDP_DROP (with no-touch)
  Without patch :  54,052,300 rx/s = 18.50 nanosec/packet
  With the patch:  33,420,619 rx/s = 29.92 nanosec/packet
  Diff: 11.42 nanosec

Using perf stat I can see an increase in cache-misses.

The difference is less, if we read-packet data, running:
  - sudo ./xdp-bench drop ice4 --packet-operation read-data

XDP_DROP (with read-data)
  Without patch :  27,200,683 rx/s = 36.76 nanosec/packet
  With the patch:  24,348,751 rx/s = 41.07 nanosec/packet
  Diff: 4.31 nanosec

On this CPU we don't have DDIO/DCA, so we take a big hit reading the
packet data in XDP.  This will be needed by our DDoS bpf_prog.
The nanosec diff isn't the same, so it seem this change can hide a
little behind the cache-miss in the XDP bpf_prog.


> Without xdp-bench running the XDP program, top showed a CPU usage of
> 740% and an ~86 idle score.
> 

We don't want a scaling test for this. For this XDP_DROP/DDoS test we
want to target a single CPU. This is easiest done by generating a single
flow (hint pktgen script is called _single_flow). We want to see a
single CPU running ksoftirqd 100% of the time.

> With xdp-bench running, the iperf cpu drops off the top listing and the
> CPU idle score goes up to 99.9
> 

With the single flow generator DoS "attacking" a single CPU, we want to
see a single CPU running ksoftirqd 100% of the time, for XDP_DROP case.

> 
> with the patch:
> 
> The iperf3 CPU use seems to go up, but so does the throughput. It is
> hard to get an isolated measure. I don't have an immediate setup for
> fine tuned performance testing available to do anything more rigorous.
> 
> Personally, I think its overall in the noise, as I saw the same peak
> performance and CPU usages with and without the patch.
> 
> I also tried testing TCP and also didn't see a significant difference
> with or without the patch. Though, testing xdp-bench with TCP is not
> that useful since the client stops transmitting once the packets are
> dropped instead of handled.
> 
> $ iperf3 -c 192.168.93.1 -t86400 -l8000 -P5
> 
> Without patch:
> [SUM]  24.00-25.00  sec  7.80 GBytes  67.0 Gbits/sec
> 
> With patch:
> [SUM]  28.00-29.00  sec  7.85 GBytes  67.4 Gbits/sec
> 
> Again, it ranges from 60 to 68 Gbit/sec in both cases, though I think
> the peak is slightly higher with the fix applied, sometimes I saw it
> spike up to 70Gbit/sec but it mostly hovers around 67 Gbit/sec.
> 
> I'm sure theres a lot of factors impacting the performance here, but I
> think there's not much evidence that its significantly different.
>>> Cc: Christoph Petrausch <christoph.petrausch@...pl.com>
>>> Reported-by: Jaroslav Pulchart <jaroslav.pulchart@...ddata.com>
>>> Closes: https://lore.kernel.org/netdev/CAK8fFZ4hY6GUJNENz3wY9jaYLZXGfpr7dnZxzGMYoE44caRbgw@mail.gmail.com/
>>> Fixes: 743bbd93cf29 ("ice: put Rx buffers after being done with current frame")
>>> Signed-off-by: Jacob Keller <jacob.e.keller@...el.com>
>>> Tested-by: Rinitha S <sx.rinitha@...el.com> (A Contingent worker at Intel)
>>> Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@...el.com>
>>> Tested-by: Maciej Fijalkowski <maciej.fijalkowski@...el.com>
>>> Tested-by: Priya Singh <priyax.singh@...el.com>
>>> Signed-off-by: Tony Nguyen <anthony.l.nguyen@...el.com>
>>> ---
>>>    drivers/net/ethernet/intel/ice/ice_txrx.c | 81 +++++++++--------------
>>>    drivers/net/ethernet/intel/ice/ice_txrx.h |  1 -
>>>    2 files changed, 33 insertions(+), 49 deletions(-)
>>>
>>> diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.c b/drivers/net/ethernet/intel/ice/ice_txrx.c
>>> index 29e0088ab6b2..93907ab2eac7 100644
>>> --- a/drivers/net/ethernet/intel/ice/ice_txrx.c
>>> +++ b/drivers/net/ethernet/intel/ice/ice_txrx.c
>>> @@ -894,10 +894,6 @@ ice_add_xdp_frag(struct ice_rx_ring *rx_ring, struct xdp_buff *xdp,
>>>    	__skb_fill_page_desc_noacc(sinfo, sinfo->nr_frags++, rx_buf->page,
>>>    				   rx_buf->page_offset, size);
>>>    	sinfo->xdp_frags_size += size;
>>> -	/* remember frag count before XDP prog execution; bpf_xdp_adjust_tail()
>>> -	 * can pop off frags but driver has to handle it on its own
>>> -	 */
>>> -	rx_ring->nr_frags = sinfo->nr_frags;
>>>    
>>>    	if (page_is_pfmemalloc(rx_buf->page))
>>>    		xdp_buff_set_frag_pfmemalloc(xdp);
>>> @@ -968,20 +964,20 @@ ice_get_rx_buf(struct ice_rx_ring *rx_ring, const unsigned int size,
>>>    /**
>>>     * ice_get_pgcnts - grab page_count() for gathered fragments
>>>     * @rx_ring: Rx descriptor ring to store the page counts on
>>> + * @ntc: the next to clean element (not included in this frame!)
>>>     *
>>>     * This function is intended to be called right before running XDP
>>>     * program so that the page recycling mechanism will be able to take
>>>     * a correct decision regarding underlying pages; this is done in such
>>>     * way as XDP program can change the refcount of page
>>>     */
>>> -static void ice_get_pgcnts(struct ice_rx_ring *rx_ring)
>>> +static void ice_get_pgcnts(struct ice_rx_ring *rx_ring, unsigned int ntc)
>>>    {
>>> -	u32 nr_frags = rx_ring->nr_frags + 1;
>>>    	u32 idx = rx_ring->first_desc;
>>>    	struct ice_rx_buf *rx_buf;
>>>    	u32 cnt = rx_ring->count;
>>>    
>>> -	for (int i = 0; i < nr_frags; i++) {
>>> +	while (idx != ntc) {
>>>    		rx_buf = &rx_ring->rx_buf[idx];
>>>    		rx_buf->pgcnt = page_count(rx_buf->page);
>>>    
>>> @@ -1154,62 +1150,48 @@ ice_put_rx_buf(struct ice_rx_ring *rx_ring, struct ice_rx_buf *rx_buf)
>>>    }
>>>    
>>>    /**
>>> - * ice_put_rx_mbuf - ice_put_rx_buf() caller, for all frame frags
>>> + * ice_put_rx_mbuf - ice_put_rx_buf() caller, for all buffers in frame
>>>     * @rx_ring: Rx ring with all the auxiliary data
>>>     * @xdp: XDP buffer carrying linear + frags part
>>> - * @xdp_xmit: XDP_TX/XDP_REDIRECT verdict storage
>>> - * @ntc: a current next_to_clean value to be stored at rx_ring
>>> + * @ntc: the next to clean element (not included in this frame!)
>>>     * @verdict: return code from XDP program execution
>>>     *
>>> - * Walk through gathered fragments and satisfy internal page
>>> - * recycle mechanism; we take here an action related to verdict
>>> - * returned by XDP program;
>>> + * Called after XDP program is completed, or on error with verdict set to
>>> + * ICE_XDP_CONSUMED.
>>> + *
>>> + * Walk through buffers from first_desc to the end of the frame, releasing
>>> + * buffers and satisfying internal page recycle mechanism. The action depends
>>> + * on verdict from XDP program.
>>>     */
>>>    static void ice_put_rx_mbuf(struct ice_rx_ring *rx_ring, struct xdp_buff *xdp,
>>> -			    u32 *xdp_xmit, u32 ntc, u32 verdict)
>>> +			    u32 ntc, u32 verdict)
>>>    {
>>> -	u32 nr_frags = rx_ring->nr_frags + 1;
>>> +	u32 nr_frags = xdp_get_shared_info_from_buff(xdp)->nr_frags;
>>
>> Here we unconditionally access the skb_shared_info area.
>>
>>>    	u32 idx = rx_ring->first_desc;
>>>    	u32 cnt = rx_ring->count;
>>> -	u32 post_xdp_frags = 1;
>>>    	struct ice_rx_buf *buf;
>>> -	int i;
>>> -
>>> -	if (unlikely(xdp_buff_has_frags(xdp)))
>>
>> Previously we only touch shared_info area if this is a multi-buff frame.
>>
> 
> I'm not certain, but reading the helpers it might be correct to do
> something like this:
> 
> if (unlikely(xdp_buff_has_frags(xdp)))
>    nr_frags = xdp_get_shared_info_from_buff(xdp)->nr_frags;
> else
>    nr_frags = 1

Yes, that looks like a correct pattern.

> either in the driver code or by adding a new xdp helper function.
> 
> I'm not sure its worth it though. We have pending work from our
> development team to refactor ice to use page pool and switch to libeth
> XDP helpers which eliminates all of this driver-specific logic.

Please do proper testing of XDP_DROP case when doing this change.

> I don't personally think its worth holding up this series and this
> important memory leak fix for a minor potential code change that I can't
> measure an obvious improvement on.

IMHO you included an optimization (that wasn't a gain) in a fix patch.
I think you can fix the memory leak without the "optimization" part.

pw-bot: cr

--Jesper


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ