netdev - Re: issue with inflight pages from page

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ea762132-a6ff-379a-2cc2-6057754425f7@redhat.com>
Date:   Wed, 19 Apr 2023 16:02:41 +0200
From:   Jesper Dangaard Brouer <jbrouer@...hat.com>
To:     Eric Dumazet <edumazet@...gle.com>,
        Jesper Dangaard Brouer <jbrouer@...hat.com>
Cc:     brouer@...hat.com, Lorenzo Bianconi <lorenzo@...nel.org>,
        Jakub Kicinski <kuba@...nel.org>, netdev@...r.kernel.org,
        hawk@...nel.org, ilias.apalodimas@...aro.org, davem@...emloft.net,
        pabeni@...hat.com, bpf@...r.kernel.org,
        lorenzo.bianconi@...hat.com, nbd@....name,
        Toke Hoiland Jorgensen <toke@...hat.com>
Subject: Re: issue with inflight pages from page_pool


On 19/04/2023 14.09, Eric Dumazet wrote:
> On Wed, Apr 19, 2023 at 1:08 PM Jesper Dangaard Brouer
>>
>>
>> On 18/04/2023 09.36, Lorenzo Bianconi wrote:
>>>> On Mon, 17 Apr 2023 23:31:01 +0200 Lorenzo Bianconi wrote:
>>>>>> If it's that then I'm with Eric. There are many ways to keep the pages
>>>>>> in use, no point working around one of them and not the rest :(
>>>>>
>>>>> I was not clear here, my fault. What I mean is I can see the returned
>>>>> pages counter increasing from time to time, but during most of tests,
>>>>> even after 2h the tcp traffic has stopped, page_pool_release_retry()
>>>>> still complains not all the pages are returned to the pool and so the
>>>>> pool has not been deallocated yet.
>>>>> The chunk of code in my first email is just to demonstrate the issue
>>>>> and I am completely fine to get a better solution :)
>>>>
>>>> Your problem is perhaps made worse by threaded NAPI, you have
>>>> defer-free skbs sprayed across all cores and no NAPI there to
>>>> flush them :(
>>>
>>> yes, exactly :)
>>>
>>>>
>>>>> I guess we just need a way to free the pool in a reasonable amount
>>>>> of time. Agree?
>>>>
>>>> Whether we need to guarantee the release is the real question.
>>>
>>> yes, this is the main goal of my email. The defer-free skbs behaviour seems in
>>> contrast with the page_pool pending pages monitor mechanism or at least they
>>> do not work well together.
>>>
>>> @Jesper, Ilias: any input on it?
>>>
>>>> Maybe it's more of a false-positive warning.
>>>>
>>>> Flushing the defer list is probably fine as a hack, but it's not
>>>> a full fix as Eric explained. False positive can still happen.
>>>
>>> agree, it was just a way to give an idea of the issue, not a proper solution.
>>>
>>> Regards,
>>> Lorenzo
>>>
>>>>
>>>> I'm ambivalent. My only real request wold be to make the flushing
>>>> a helper in net/core/dev.c rather than open coded in page_pool.c.
>>
>> I agree. We need a central defer_list flushing helper
>>
>> It is too easy to say this is a false-positive warning.
>> IHMO this expose an issue with the sd->defer_list system.
>>
>> Lorenzo's test is adding+removing veth devices, which creates and runs
>> NAPI processing on random CPUs.  After veth netdevices (+NAPI) are
>> removed, nothing will naturally invoking net_rx_softirq on this CPU.
>> Thus, we have SKBs waiting on CPUs sd->defer_list.  Further more we will
>> not create new SKB with this skb->alloc_cpu, to trigger RX softirq IPI
>> call (trigger_rx_softirq), even if this CPU process and frees SKBs.
>>
>> I see two solutions:
>>
>>    (1) When netdevice/NAPI unregister happens call defer_list flushing
>> helper.
>>
>>    (2) Use napi_watchdog to detect if defer_list is (many jiffies) old,
>> and then call defer_list flushing helper.
>>
>>
>>>>
>>>> Somewhat related - Eric, do we need to handle defer_list in dev_cpu_dead()?
>>
>> Looks to me like dev_cpu_dead() also need this flushing helper for
>> sd->defer_list, or at least moving the sd->defer_list to an sd that will
>> run eventually.
> 
> I think I just considered having a few skbs in per-cpu list would not
> be an issue,
> especially considering skbs can sit hours in tcp receive queues.
>

It was the first thing I said to Lorenzo when he first reported the
problem to me (over chat): It is likely packets sitting in a TCP queue.
Then I instructed him to look at output from netstat to see queues and
look for TIME-WAIT, FIN-WAIT etc.


> Do we expect hacing some kind of callback/shrinker to instruct TCP or
> pipes to release all pages that prevent
> a page_pool to be freed ?
> 

This is *not* what I'm asking for.

With TCP sockets (pipes etc) we can take care of closing the sockets
(and programs etc) to free up the SKBs (and perhaps wait for timeouts)
to make sure the page_pool shutdown doesn't hang.

The problem arise for all the selftests that uses veth and bpf_test_run
(using bpf_test_run_xdp_live / xdp_test_run_setup).  For the selftests
we obviously take care of closing sockets and removing veth interfaces
again.  Problem: The defer_list corner-case isn't under our control.


> Here, we are talking of hundreds of thousands of skbs, compared to at
> most 32 skbs per cpu.
> 

It is not a memory usage concern.

> Perhaps sets sysctl_skb_defer_max to zero by default, so that admins
> can opt-in
> 

I really like the sd->defer_list system and I think is should be enabled
by default.  Even if disabled by default, we still need to handle these
corner cases, as the selftests shouldn't start to cause-issues when this
gets enabled.

The simple solution is: (1) When netdevice/NAPI unregister happens call
defer_list flushing helper.  And perhaps we also need to call it in
xdp_test_run_teardown().  How do you feel about that?

--Jesper