linux-kernel - Re: [PATCH net-next v3 3/3] page_pool: fix IOMMU crash when driver has already unbound

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <cf1911c5-622f-484c-9ee5-11e1ac83da24@huawei.com>
Date: Sat, 26 Oct 2024 15:32:35 +0800
From: Yunsheng Lin <linyunsheng@...wei.com>
To: Toke Høiland-Jørgensen <toke@...hat.com>, Jesper
 Dangaard Brouer <hawk@...nel.org>, <davem@...emloft.net>, <kuba@...nel.org>,
	<pabeni@...hat.com>
CC: <zhangkun09@...wei.com>, <fanghaiqing@...wei.com>,
	<liuyonglong@...wei.com>, Robin Murphy <robin.murphy@....com>, Alexander
 Duyck <alexander.duyck@...il.com>, IOMMU <iommu@...ts.linux.dev>, Andrew
 Morton <akpm@...ux-foundation.org>, Eric Dumazet <edumazet@...gle.com>, Ilias
 Apalodimas <ilias.apalodimas@...aro.org>, <linux-mm@...ck.org>,
	<linux-kernel@...r.kernel.org>, <netdev@...r.kernel.org>, kernel-team
	<kernel-team@...udflare.com>
Subject: Re: [PATCH net-next v3 3/3] page_pool: fix IOMMU crash when driver
 has already unbound

On 2024/10/25 19:16, Toke Høiland-Jørgensen wrote:

...

>>
>> You and Jesper seems to be mentioning a possible fact that there might
>> be 'hundreds of gigs of memory' needed for inflight pages, it would be nice
>> to provide more info or reasoning above why 'hundreds of gigs of memory' is
>> needed here so that we don't do a over-designed thing to support recording
>> unlimited in-flight pages if the driver unbound stalling turns out impossible
>> and the inflight pages do need to be recorded.
> 
> I don't have a concrete example of a use that will blow the limit you
> are setting (but maybe Jesper does), I am simply objecting to the
> arbitrary imposing of any limit at all. It smells a lot of "640k ought
> to be enough for anyone".

>From Jesper's link, it seems there is still some limit for those memory and
the limit seems to be configured dynamically.

Currently the number of supporting in-flight pages is depending on the rx
queue number and their queue depth in this patch, as mentioned before, it
always can be dynamic at some cost of complexity and maybe some overhead.

> 
>> I guess it is common sense to start with easy one until someone complains
>> with some testcase and detailed reasoning if we need to go the hard way as
>> you and Jesper are also prefering waiting over having to record the inflight
>> pages.
> 
> AFAIU Jakub's comment on his RFC patch for waiting, he was suggesting
> exactly this: Add the wait, and see if the cases where it can stall turn
> out to be problems in practice.

I should mention that jakub seemed to prefer to only check some condition
to decide if the waiting is needed in [1].

If the below 'kick' need to be done during the waiting, it might be
going the hard way instead of going the easy here comparing to recording
the inflight pages IHMO.

1. https://lore.kernel.org/all/20241014171406.43e730c9@kernel.org/

> 
>> More detailed about why we might need to go the hard way of having to record
>> the inflight pages as below.
>>
>>>
>>>>>
>>>>> The page_pool already have a system for waiting for these outstanding /
>>>>> inflight packets to get returned.  As I suggested before, the page_pool
>>>>> should simply take over the responsability (from net_device) to free the
>>>>> struct device (after inflight reach zero), where AFAIK the DMA device is
>>>>> connected via.
>>>>
>>>> It seems you mentioned some similar suggestion in previous version,
>>>> it would be good to show some code about the idea in your mind, I am sure
>>>> that Yonglong Liu Cc'ed will be happy to test it if there some code like
>>>> POC/RFC is provided.
>>>
>>> I believe Jesper is basically referring to Jakub's RFC that you
>>> mentioned below.
>>>
>>>> I should mention that it seems that DMA device is not longer vaild when
>>>> remove() function of the device driver returns, as mentioned in [2], which
>>>> means dma API is not allowed to called after remove() function of the device
>>>> driver returns.
>>>>
>>>> 2. https://www.spinics.net/lists/netdev/msg1030641.html
>>>>
>>>>>
>>>>> The alternative is what Kuba suggested (and proposed an RFC for),  that
>>>>> the net_device teardown waits for the page_pool inflight packets.
>>>>
>>>> As above, the question is how long does the waiting take here?
>>>> Yonglong tested Kuba's RFC, see [3], the waiting took forever due to
>>>> reason as mentioned in commit log:
>>>> "skb_defer_free_flush(): this may cause infinite delay if there is no
>>>> triggering for net_rx_action()."
>>>
>>> Honestly, this just seems like a bug (the "no triggering of
>>> net_rx_action()") that should be root caused and fixed; not a reason
>>> that waiting can't work.
>>
>> I would prefer the waiting too if simple waiting fixed the test cases that
>> Youglong and Haiqing were reporting and I did not look into the rabbit hole
>> of possible caching in networking.
>>
>> As mentioned in commit log and [1]:
>> 1. ipv4 packet defragmentation timeout: this seems to cause delay up to 30
>>    secs, which was reported by Haiqing.
>> 2. skb_defer_free_flush(): this may cause infinite delay if there is no
>>    triggering for net_rx_action(), which was reported by Yonglong.
>>
>> For case 1, is it really ok to stall the driver unbound up to 30 secs for the
>> default setting of defragmentation timeout?
>>
>> For case 2, it is possible to add timeout for those kind of caching like the
>> defragmentation timeout too, but as mentioned in [2], it seems to be a normal
>> thing for this kind of caching in networking:
> 
> Both 1 and 2 seem to be cases where the netdev teardown code can just
> make sure to kick the respective queues and make sure there's nothing
> outstanding (for (1), walk the defrag cache and clear out anything
> related to the netdev going away, for (2) make sure to kick
> net_rx_action() as part of the teardown).

It would be good to be more specific about the 'kick' here, does it mean
taking the lock and doing one of below action for each cache instance?
1. flush all the cache of each cache instance.
2. scan for the page_pool owned page and do the finegrained flushing.

> 
>> "Eric pointed out/predicted there's no guarantee that applications will
>> read / close their sockets so a page pool page may be stuck in a socket
>> (but not leaked) forever."
> 
> As for this one, I would put that in the "well, let's see if this
> becomes a problem in practice" bucket.

As the commit log in [2], it seems it is already happening.

Those cache are mostly per-cpu and per-socket, and there may be hundreds of
CPUs and thousands of sockets in one system, are you really sure we need
to take the lock of each cache instance, which may be thousands of them,
and do the flushing/scaning of memory used in networking, which may be as
large as '24 GiB' mentioned by Jesper?

2. https://lore.kernel.org/all/20231126230740.2148636-13-kuba@kernel.org/

> 
> -Toke
> 
>