netdev - Re: [Intel-wired-lan] Increased memory usage on NUMA nodes with ICE driver after upgrade to 6.13.y (regression in commit 492a044508ad)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <f2e43212-dc49-4f87-9bbc-53a77f3523e5@intel.com>
Date: Mon, 30 Jun 2025 11:59:16 -0700
From: Jacob Keller <jacob.e.keller@...el.com>
To: Jaroslav Pulchart <jaroslav.pulchart@...ddata.com>, Maciej Fijalkowski
	<maciej.fijalkowski@...el.com>
CC: Jakub Kicinski <kuba@...nel.org>, Przemek Kitszel
	<przemyslaw.kitszel@...el.com>, "intel-wired-lan@...ts.osuosl.org"
	<intel-wired-lan@...ts.osuosl.org>, "Damato, Joe" <jdamato@...tly.com>,
	"netdev@...r.kernel.org" <netdev@...r.kernel.org>, "Nguyen, Anthony L"
	<anthony.l.nguyen@...el.com>, Michal Swiatkowski
	<michal.swiatkowski@...ux.intel.com>, "Czapnik, Lukasz"
	<lukasz.czapnik@...el.com>, "Dumazet, Eric" <edumazet@...gle.com>, "Zaki,
 Ahmed" <ahmed.zaki@...el.com>, Martin Karsten <mkarsten@...terloo.ca>, "Igor
 Raits" <igor@...ddata.com>, Daniel Secik <daniel.secik@...ddata.com>, "Zdenek
 Pesek" <zdenek.pesek@...ddata.com>
Subject: Re: [Intel-wired-lan] Increased memory usage on NUMA nodes with ICE
 driver after upgrade to 6.13.y (regression in commit 492a044508ad)



On 6/30/2025 10:24 AM, Jaroslav Pulchart wrote:
>>
>>
>>
>> On 6/30/2025 12:35 AM, Jaroslav Pulchart wrote:
>>>>
>>>>>
>>>>> On Wed, 25 Jun 2025 19:51:08 +0200 Jaroslav Pulchart wrote:
>>>>>> Great, please send me a link to the related patch set. I can apply them in
>>>>>> our kernel build and try them ASAP!
>>>>>
>>>>> Sorry if I'm repeating the question - have you tried
>>>>> CONFIG_MEM_ALLOC_PROFILING? Reportedly the overhead in recent kernels
>>>>> is low enough to use it for production workloads.
>>>>
>>>> I try it now, the fresh booted server:
>>>>
>>>> # sort -g /proc/allocinfo| tail -n 15
>>>>     45409728   236509 fs/dcache.c:1681 func:__d_alloc
>>>>     71041024    17344 mm/percpu-vm.c:95 func:pcpu_alloc_pages
>>>>     71524352    11140 kernel/dma/direct.c:141 func:__dma_direct_alloc_pages
>>>>     85098496     4486 mm/slub.c:2452 func:alloc_slab_page
>>>>    115470992   101647 fs/ext4/super.c:1388 [ext4] func:ext4_alloc_inode
>>>>    134479872    32832 kernel/events/ring_buffer.c:811 func:perf_mmap_alloc_page
>>>>    141426688    34528 mm/filemap.c:1978 func:__filemap_get_folio
>>>>    191594496    46776 mm/memory.c:1056 func:folio_prealloc
>>>>    360710144      172 mm/khugepaged.c:1084 func:alloc_charge_folio
>>>>    444076032    33790 mm/slub.c:2450 func:alloc_slab_page
>>>>    530579456   129536 mm/page_ext.c:271 func:alloc_page_ext
>>>>    975175680      465 mm/huge_memory.c:1165 func:vma_alloc_anon_folio_pmd
>>>>   1022427136   249616 mm/memory.c:1054 func:folio_prealloc
>>>>   1105125376   139252 drivers/net/ethernet/intel/ice/ice_txrx.c:681
>>>> [ice] func:ice_alloc_mapped_page
>>>>   1621598208   395848 mm/readahead.c:186 func:ractl_alloc_folio
>>>>
>>>
>>> The "drivers/net/ethernet/intel/ice/ice_txrx.c:681 [ice]
>>> func:ice_alloc_mapped_page" is just growing...
>>>
>>> # uptime ; sort -g /proc/allocinfo| tail -n 15
>>>  09:33:58 up 4 days, 6 min,  1 user,  load average: 6.65, 8.18, 9.81
>>>
>>> # sort -g /proc/allocinfo| tail -n 15
>>>     85216896   443838 fs/dcache.c:1681 func:__d_alloc
>>>    106156032    25917 mm/shmem.c:1854 func:shmem_alloc_folio
>>>    116850096   102861 fs/ext4/super.c:1388 [ext4] func:ext4_alloc_inode
>>>    134479872    32832 kernel/events/ring_buffer.c:811 func:perf_mmap_alloc_page
>>>    143556608     6894 mm/slub.c:2452 func:alloc_slab_page
>>>    186793984    45604 mm/memory.c:1056 func:folio_prealloc
>>>    362807296    88576 mm/percpu-vm.c:95 func:pcpu_alloc_pages
>>>    530579456   129536 mm/page_ext.c:271 func:alloc_page_ext
>>>    598237184    51309 mm/slub.c:2450 func:alloc_slab_page
>>>    838860800      400 mm/huge_memory.c:1165 func:vma_alloc_anon_folio_pmd
>>>    929083392   226827 mm/filemap.c:1978 func:__filemap_get_folio
>>>   1034657792   252602 mm/memory.c:1054 func:folio_prealloc
>>>   1262485504      602 mm/khugepaged.c:1084 func:alloc_charge_folio
>>>   1335377920   325970 mm/readahead.c:186 func:ractl_alloc_folio
>>>   2544877568   315003 drivers/net/ethernet/intel/ice/ice_txrx.c:681
>>> [ice] func:ice_alloc_mapped_page
>>>
>> ice_alloc_mapped_page is the function used to allocate the pages for the
>> Rx ring buffers.
>>
>> There were a number of fixes for the hot path from Maciej which might be
>> related. Although those fixes were primarily for XDP they do impact the
>> regular hot path as well.
>>
>> These were fixes on top of work he did which landed in v6.13, so it
>> seems plausible they might be related. In particular one which mentions
>> a missing buffer put:
>>
>> 743bbd93cf29 ("ice: put Rx buffers after being done with current frame")
>>
>> It says the following:
>>>     While at it, address an error path of ice_add_xdp_frag() - we were
>>>     missing buffer putting from day 1 there.
>>>
>>
>> It seems to me the issue must be somehow related to the buffer cleanup
>> logic for the Rx ring, since thats the only thing allocated by
>> ice_alloc_mapped_page.
>>
>> It might be something fixed with the work Maciej did.. but it seems very
>> weird that 492a044508ad ("ice: Add support for persistent NAPI config")
>> would affect that logic at all....
> 
> I believe there were/are at least two separate issues. Regarding
> commit 492a044508ad (“ice: Add support for persistent NAPI config”):
> * On 6.13.y and 6.14.y kernels, this change prevented us from lowering
> the driver’s initial, large memory allocation immediately after server
> power-up. A few hours (max few days) later, this inevitably led to an
> out-of-memory condition.
> * Reverting the commit in those series only delayed the OOM, it
> allowed the queue size (and thus memory footprint) to shrink on boot
> just as it did in 6.12.y but didn’t eliminate the underlying 'leak'.
> * In 6.15.y, however, that revert isn’t required (and isn’t even
> applicable). The after boot allocation can once again be tuned down
> without patching. Still, we observe the same increase in memory use
> over time, as shown in the 'allocmap' output.
> Thus, commit 492a044508ad led us down a false trail, or at the very
> least hastened the inevitable OOM.

That seems reasonable. I'm still surprised the specific commit leads to
any large increase in memory, since it should only be a few bytes per
NAPI. But there may be some related driver-specific issues.

Either way, we clearly need to isolate how we're leaking memory in the
hot path. I think it might be related to the fixes from Maciej which are
pretty recent so might not be in 6.13 or 6.14


Download attachment "OpenPGP_signature.asc" of type "application/pgp-signature" (237 bytes)