netdev - Re: [RFC net] net: make page pool stall netdev unregistration to avoid IOMMU crashes

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <d50ac1a9-f1e2-49ee-b89b-05dac9bc6ee1@huawei.com>
Date: Thu, 5 Sep 2024 18:47:22 +0800
From: Yunsheng Lin <linyunsheng@...wei.com>
To: Jakub Kicinski <kuba@...nel.org>, <netdev@...r.kernel.org>
CC: <davem@...emloft.net>, <edumazet@...gle.com>, <pabeni@...hat.com>,
	<ilias.apalodimas@...aro.org>, Jesper Dangaard Brouer <hawk@...nel.org>,
	Alexander Duyck <alexander.duyck@...il.com>, Yonglong Liu
	<liuyonglong@...wei.com>, <fanghaiqing@...wei.com>, "Zhangkun(Ken,Turing)"
	<zhangkun09@...wei.com>
Subject: Re: [RFC net] net: make page pool stall netdev unregistration to
 avoid IOMMU crashes

On 2024/8/6 23:16, Jakub Kicinski wrote:
> There appears to be no clean way to hold onto the IOMMU, so page pool
> cannot outlast the driver which created it. We have no way to stall
> the driver unregister, but we can use netdev unregistration as a proxy.
> 
> Note that page pool pages may last forever, we have seen it happen
> e.g. when application leaks a socket and page is stuck in its rcv queue.

I am assuming the page will be released when the application dies or
exits, right?

Also I am not sure if the above application is privileged one or not?
If it is not a privileged one, perhaps we need to fix the above problem
in the kernel as it does not seem to make sense for a unprivileged
application to have the kernel leaking page and stall the unregistering
of devices.

> Hopefully this is fine in this particular case, as we will only stall
> unregistering of devices which want the page pool to manage the DMA
> mapping for them, i.e. HW backed netdevs. And obviously keeping
> the netdev around is preferable to a crash.

For the internal testing and debugging, it seems there are at least
two cases that the page is not released fast enough for now:
1. ipv4 packet defragmentation timeout: this seems to cause delay up
   to 30 secs:
#define IP_FRAG_TIME	(30 * HZ)		/* fragment lifetime	*/

2. skb_defer_free_flush(): this may cause infinite delay if there is
   no triggering for net_rx_action(). Below is the dump_stack() when
   the page is returned back to page_pool after reloading the driver,
   causing the triggering of net_rx_action():

[  515.286580] Call trace:
[  515.289012]  dump_backtrace+0x9c/0x100
[  515.292748]  show_stack+0x20/0x38
[  515.296049]  dump_stack_lvl+0x78/0x90
[  515.299699]  dump_stack+0x18/0x28
[  515.303001]  page_pool_put_unrefed_netmem+0x2c4/0x3d0
[  515.308039]  napi_pp_put_page+0xb4/0xe0
[  515.311863]  skb_release_data+0xf8/0x1e0
[  515.315772]  kfree_skb_list_reason+0xb4/0x2a0
[  515.320115]  skb_release_data+0x148/0x1e0
[  515.324111]  napi_consume_skb+0x64/0x190
[  515.328021]  net_rx_action+0x110/0x2a8
[  515.331758]  handle_softirqs+0x120/0x368
[  515.335668]  __do_softirq+0x1c/0x28
[  515.339143]  ____do_softirq+0x18/0x30
[  515.342792]  call_on_irq_stack+0x24/0x58
[  515.346701]  do_softirq_own_stack+0x24/0x38
[  515.350871]  irq_exit_rcu+0x94/0xd0
[  515.354347]  el1_interrupt+0x38/0x68
[  515.357910]  el1h_64_irq_handler+0x18/0x28
[  515.361994]  el1h_64_irq+0x64/0x68
[  515.365382]  default_idle_call+0x34/0x140
[  515.369378]  do_idle+0x20c/0x270
[  515.372593]  cpu_startup_entry+0x40/0x50
[  515.376503]  secondary_start_kernel+0x138/0x160
[  515.381021]  __secondary_switched+0xb8/0xc0

> 
> More work is needed for weird drivers which share one pool among
> multiple netdevs, as they are not allowed to set the pp->netdev
> pointer. We probably need to add a bit that says "don't expose
> to uAPI for them".

Which driver are we talking about here sharing one pool among multiple
netdevs? Is the sharing for memory saving?

>