linux-kernel - Re: [Intel-wired-lan] [BUG] ice: Temporary packet processing overload causes permanent RX drops

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <67a5ef2a-83bc-4b35-9eac-5b9297dfeb2d@intel.com>
Date: Mon, 8 Dec 2025 16:05:27 -0800
From: Jacob Keller <jacob.e.keller@...el.com>
To: Marcus Wichelmann <marcus.wichelmann@...zner-cloud.de>, Tony Nguyen
	<anthony.l.nguyen@...el.com>, Przemek Kitszel <przemyslaw.kitszel@...el.com>,
	Andrew Lunn <andrew+netdev@...n.ch>, "David S. Miller" <davem@...emloft.net>,
	Eric Dumazet <edumazet@...gle.com>, Jakub Kicinski <kuba@...nel.org>, "Paolo
 Abeni" <pabeni@...hat.com>, <intel-wired-lan@...ts.osuosl.org>, Netdev
	<netdev@...r.kernel.org>, <linux-kernel@...r.kernel.org>
CC: <sdn@...zner-cloud.de>
Subject: Re: [Intel-wired-lan] [BUG] ice: Temporary packet processing overload
 causes permanent RX drops



On 12/5/2025 6:01 AM, Marcus Wichelmann wrote:
> Hi there, I broke some network cards again. This time I noticed continuous RX packet drops with an Intel E810-XXV.
> 
> When such a card temporarily (just for a few seconds) receives a large flood of packets and the kernel cannot keep
> up with processing them, the following appears in the Kernel log:
> 
> kernel: ice 0000:c7:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x002b address=0x4000180000 flags=0x0020]
> kernel: ice 0000:c7:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x002b address=0x4000180000 flags=0x0020]
> kernel: workqueue: ice_rx_dim_work [ice] hogged CPU for >10000us 4 times, consider switching to WQ_UNBOUND
> kernel: workqueue: drm_fb_helper_damage_work hogged CPU for >10000us 4 times, consider switching to WQ_UNBOUND
> kernel: workqueue: drm_fb_helper_damage_work hogged CPU for >10000us 5 times, consider switching to WQ_UNBOUND
> kernel: workqueue: ice_rx_dim_work [ice] hogged CPU for >10000us 5 times, consider switching to WQ_UNBOUND
> kernel: workqueue: psi_avgs_work hogged CPU for >10000us 4 times, consider switching to WQ_UNBOUND
> kernel: ice 0000:c7:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x002b address=0x4000180000 flags=0x0020]
> kernel: workqueue: drm_fb_helper_damage_work hogged CPU for >10000us 7 times, consider switching to WQ_UNBOUND
> kernel: workqueue: ice_rx_dim_work [ice] hogged CPU for >10000us 7 times, consider switching to WQ_UNBOUND

I am a bit curious why the ice_rx_dim_work hogs so much CPU here..

> kernel: workqueue: psi_avgs_work hogged CPU for >10000us 5 times, consider switching to WQ_UNBOUND
> kernel: ice 0000:c7:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x002b address=0x4000180000 flags=0x0020]
> ...
> 
> After that, the NIC seems to be in a permanently broken state and continues to drop a few percent of the received
> packets, even at low data rates. When reducing the incoming packet rate to just 10.000 pps, I can see over 500 pps
> of that being dropped. After reinitializing the NIC (e.g. by changing the channel count using ethtool), the error
> goes away and it's rock solid again. Until the next packet flood.
> 

A reset likely causes a bunch of stuff to get flushed and reconfigured.

> We have reproduced this with:
>   Linux 6.8.0-88-generic (Ubuntu 24.04)
>   Linux 6.14.0-36-generic (Ubuntu 24.04 HWE)
>   Linux 6.18.0-061800-generic (Ubuntu Mainline PPA)
> 

I think we recently merged a bunch of work on the Rx path as part of our
conversion to page pool. It would be interesting to see if those changes
impact this. Clearly the issue goes back some time since v6.8 at least..

> CPU: AMD EPYC 9825 144-Core Processor (288 threads)
> 
> lspci | grep Ethernet
>   c7:00.0 Ethernet controller: Intel Corporation Ethernet Controller E810-XXV for SFP (rev 02)
>   c7:00.1 Ethernet controller: Intel Corporation Ethernet Controller E810-XXV for SFP (rev 02)
> 
> ethtool -i eth0
>   driver: ice
>   version: 6.18.0-061800-generic
>   firmware-version: 4.90 0x80020ef6 1.3863.0
>   expansion-rom-version: 
>   bus-info: 0000:c7:00.0
>   supports-statistics: yes
>   supports-test: yes
>   supports-eeprom-access: yes
>   supports-register-dump: yes
>   supports-priv-flags: yes
> 
> ethtool -l eth0
>   Channel parameters for eth0:
>   Pre-set maximums:
>   RX:		288
>   TX:		288
>   Other:		1
>   Combined:	288
>   Current hardware settings:
>   RX:		0
>   TX:		32
>   Other:		1
>   Combined:	256
> These are the defaults after boot.
> 
> ethtool -S eth0 | grep rx_dropped
>   rx_dropped: 7206525
>   rx_dropped.nic: 0
> ethtool -S eth1 | grep rx_dropped
>   rx_dropped: 6889634
>   rx_dropped.nic: 0
> 

Interesting. From reviewing the code, the rx_dropped counter appears to
be the hardware Rx discard counter which comes from GLV_RDPC, which
means its definitely hardware that is dropping the packets. Possibly
because the rings are full and somehow don't get cleared even after the
traffic stops...

> How to reproduce:
> 
> 1. Use another host to flood the host with the E810 NIC with 64 byte large UDP packets. I used trafgen for that and
> made sure, that the source ports are randomized to make RSS spread the load over all channels. The packet rate must
> be high enough to overload the packet processing on the receiving host.
> In my case, 4 Mpps was already enough to make the errors show up in the kernel log and trigger the permanent packet
> loss, but the needed packet rate may depend on how CPU intensive the processing of each packet is. Dropping packets
> early (e.g. using iptables) makes reproducing harder.
> 
> 2. Monitor the rx_dropped counter and the kernel log. After a few seconds, above warnings/errors should show up in
> the kernel log.
> 
> 3. Stop the traffic generator and re-run it with a way lower packet rate, e.g. 10.000 pps. Now it can be seen that
> a good part of these packets is being dropped, even though the kernel could easily keep up with this small packet rate.
> 

I assume the rx_dropped counter still incrementing here?

> In my case the two ports of the E810 NIC were part of a bonding, but I don't think this is required to reproduce the
> issue.
> 
> Please let me know, if there is more information I could provide.
> 
> Thanks,
> Marcus



Download attachment "OpenPGP_signature.asc" of type "application/pgp-signature" (237 bytes)