[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <d50ac1a9-f1e2-49ee-b89b-05dac9bc6ee1@huawei.com>
Date: Thu, 5 Sep 2024 18:47:22 +0800
From: Yunsheng Lin <linyunsheng@...wei.com>
To: Jakub Kicinski <kuba@...nel.org>, <netdev@...r.kernel.org>
CC: <davem@...emloft.net>, <edumazet@...gle.com>, <pabeni@...hat.com>,
<ilias.apalodimas@...aro.org>, Jesper Dangaard Brouer <hawk@...nel.org>,
Alexander Duyck <alexander.duyck@...il.com>, Yonglong Liu
<liuyonglong@...wei.com>, <fanghaiqing@...wei.com>, "Zhangkun(Ken,Turing)"
<zhangkun09@...wei.com>
Subject: Re: [RFC net] net: make page pool stall netdev unregistration to
avoid IOMMU crashes
On 2024/8/6 23:16, Jakub Kicinski wrote:
> There appears to be no clean way to hold onto the IOMMU, so page pool
> cannot outlast the driver which created it. We have no way to stall
> the driver unregister, but we can use netdev unregistration as a proxy.
>
> Note that page pool pages may last forever, we have seen it happen
> e.g. when application leaks a socket and page is stuck in its rcv queue.
I am assuming the page will be released when the application dies or
exits, right?
Also I am not sure if the above application is privileged one or not?
If it is not a privileged one, perhaps we need to fix the above problem
in the kernel as it does not seem to make sense for a unprivileged
application to have the kernel leaking page and stall the unregistering
of devices.
> Hopefully this is fine in this particular case, as we will only stall
> unregistering of devices which want the page pool to manage the DMA
> mapping for them, i.e. HW backed netdevs. And obviously keeping
> the netdev around is preferable to a crash.
For the internal testing and debugging, it seems there are at least
two cases that the page is not released fast enough for now:
1. ipv4 packet defragmentation timeout: this seems to cause delay up
to 30 secs:
#define IP_FRAG_TIME (30 * HZ) /* fragment lifetime */
2. skb_defer_free_flush(): this may cause infinite delay if there is
no triggering for net_rx_action(). Below is the dump_stack() when
the page is returned back to page_pool after reloading the driver,
causing the triggering of net_rx_action():
[ 515.286580] Call trace:
[ 515.289012] dump_backtrace+0x9c/0x100
[ 515.292748] show_stack+0x20/0x38
[ 515.296049] dump_stack_lvl+0x78/0x90
[ 515.299699] dump_stack+0x18/0x28
[ 515.303001] page_pool_put_unrefed_netmem+0x2c4/0x3d0
[ 515.308039] napi_pp_put_page+0xb4/0xe0
[ 515.311863] skb_release_data+0xf8/0x1e0
[ 515.315772] kfree_skb_list_reason+0xb4/0x2a0
[ 515.320115] skb_release_data+0x148/0x1e0
[ 515.324111] napi_consume_skb+0x64/0x190
[ 515.328021] net_rx_action+0x110/0x2a8
[ 515.331758] handle_softirqs+0x120/0x368
[ 515.335668] __do_softirq+0x1c/0x28
[ 515.339143] ____do_softirq+0x18/0x30
[ 515.342792] call_on_irq_stack+0x24/0x58
[ 515.346701] do_softirq_own_stack+0x24/0x38
[ 515.350871] irq_exit_rcu+0x94/0xd0
[ 515.354347] el1_interrupt+0x38/0x68
[ 515.357910] el1h_64_irq_handler+0x18/0x28
[ 515.361994] el1h_64_irq+0x64/0x68
[ 515.365382] default_idle_call+0x34/0x140
[ 515.369378] do_idle+0x20c/0x270
[ 515.372593] cpu_startup_entry+0x40/0x50
[ 515.376503] secondary_start_kernel+0x138/0x160
[ 515.381021] __secondary_switched+0xb8/0xc0
>
> More work is needed for weird drivers which share one pool among
> multiple netdevs, as they are not allowed to set the pp->netdev
> pointer. We probably need to add a bit that says "don't expose
> to uAPI for them".
Which driver are we talking about here sharing one pool among multiple
netdevs? Is the sharing for memory saving?
>
Powered by blists - more mailing lists