netdev - Re: issue with inflight pages from page

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <e8df2654-6a5b-3c92-489d-2fe5e444135f@redhat.com>
Date:   Wed, 19 Apr 2023 17:36:20 +0200
From:   Jesper Dangaard Brouer <jbrouer@...hat.com>
To:     Lorenzo Bianconi <lorenzo@...nel.org>,
        Jesper Dangaard Brouer <jbrouer@...hat.com>
Cc:     brouer@...hat.com, Eric Dumazet <edumazet@...gle.com>,
        Jakub Kicinski <kuba@...nel.org>, netdev@...r.kernel.org,
        hawk@...nel.org, ilias.apalodimas@...aro.org, davem@...emloft.net,
        pabeni@...hat.com, bpf@...r.kernel.org,
        lorenzo.bianconi@...hat.com, nbd@....name,
        Toke Hoiland Jorgensen <toke@...hat.com>
Subject: Re: issue with inflight pages from page_pool



On 19/04/2023 16.21, Lorenzo Bianconi wrote:
>>
>> On 19/04/2023 14.09, Eric Dumazet wrote:
>>> On Wed, Apr 19, 2023 at 1:08 PM Jesper Dangaard Brouer
>>>>
>>>>
>>>> On 18/04/2023 09.36, Lorenzo Bianconi wrote:
>>>>>> On Mon, 17 Apr 2023 23:31:01 +0200 Lorenzo Bianconi wrote:
>>>>>>>> If it's that then I'm with Eric. There are many ways to keep the pages
>>>>>>>> in use, no point working around one of them and not the rest :(
>>>>>>>
>>>>>>> I was not clear here, my fault. What I mean is I can see the returned
>>>>>>> pages counter increasing from time to time, but during most of tests,
>>>>>>> even after 2h the tcp traffic has stopped, page_pool_release_retry()
>>>>>>> still complains not all the pages are returned to the pool and so the
>>>>>>> pool has not been deallocated yet.
>>>>>>> The chunk of code in my first email is just to demonstrate the issue
>>>>>>> and I am completely fine to get a better solution :)
>>>>>>
>>>>>> Your problem is perhaps made worse by threaded NAPI, you have
>>>>>> defer-free skbs sprayed across all cores and no NAPI there to
>>>>>> flush them :(
>>>>>
>>>>> yes, exactly :)
>>>>>
>>>>>>
>>>>>>> I guess we just need a way to free the pool in a reasonable amount
>>>>>>> of time. Agree?
>>>>>>
>>>>>> Whether we need to guarantee the release is the real question.
>>>>>
>>>>> yes, this is the main goal of my email. The defer-free skbs behaviour seems in
>>>>> contrast with the page_pool pending pages monitor mechanism or at least they
>>>>> do not work well together.
>>>>>
>>>>> @Jesper, Ilias: any input on it?
>>>>>
>>>>>> Maybe it's more of a false-positive warning.
>>>>>>
>>>>>> Flushing the defer list is probably fine as a hack, but it's not
>>>>>> a full fix as Eric explained. False positive can still happen.
>>>>>
>>>>> agree, it was just a way to give an idea of the issue, not a proper solution.
>>>>>
>>>>> Regards,
>>>>> Lorenzo
>>>>>
>>>>>>
>>>>>> I'm ambivalent. My only real request wold be to make the flushing
>>>>>> a helper in net/core/dev.c rather than open coded in page_pool.c.
>>>>
>>>> I agree. We need a central defer_list flushing helper
>>>>
>>>> It is too easy to say this is a false-positive warning.
>>>> IHMO this expose an issue with the sd->defer_list system.
>>>>
>>>> Lorenzo's test is adding+removing veth devices, which creates and runs
>>>> NAPI processing on random CPUs.  After veth netdevices (+NAPI) are
>>>> removed, nothing will naturally invoking net_rx_softirq on this CPU.
>>>> Thus, we have SKBs waiting on CPUs sd->defer_list.  Further more we will
>>>> not create new SKB with this skb->alloc_cpu, to trigger RX softirq IPI
>>>> call (trigger_rx_softirq), even if this CPU process and frees SKBs.
>>>>
>>>> I see two solutions:
>>>>
>>>>     (1) When netdevice/NAPI unregister happens call defer_list flushing
>>>> helper.
>>>>
>>>>     (2) Use napi_watchdog to detect if defer_list is (many jiffies) old,
>>>> and then call defer_list flushing helper.
>>>>
>>>>
>>>>>>
>>>>>> Somewhat related - Eric, do we need to handle defer_list in dev_cpu_dead()?
>>>>
>>>> Looks to me like dev_cpu_dead() also need this flushing helper for
>>>> sd->defer_list, or at least moving the sd->defer_list to an sd that will
>>>> run eventually.
>>>
>>> I think I just considered having a few skbs in per-cpu list would not
>>> be an issue,
>>> especially considering skbs can sit hours in tcp receive queues.
>>>
>>
>> It was the first thing I said to Lorenzo when he first reported the
>> problem to me (over chat): It is likely packets sitting in a TCP queue.
>> Then I instructed him to look at output from netstat to see queues and
>> look for TIME-WAIT, FIN-WAIT etc.
>>
>>
>>> Do we expect hacing some kind of callback/shrinker to instruct TCP or
>>> pipes to release all pages that prevent
>>> a page_pool to be freed ?
>>>
>>
>> This is *not* what I'm asking for.
>>
>> With TCP sockets (pipes etc) we can take care of closing the sockets
>> (and programs etc) to free up the SKBs (and perhaps wait for timeouts)
>> to make sure the page_pool shutdown doesn't hang.
>>
>> The problem arise for all the selftests that uses veth and bpf_test_run
>> (using bpf_test_run_xdp_live / xdp_test_run_setup).  For the selftests
>> we obviously take care of closing sockets and removing veth interfaces
>> again.  Problem: The defer_list corner-case isn't under our control.
>>
>>
>>> Here, we are talking of hundreds of thousands of skbs, compared to at
>>> most 32 skbs per cpu.
>>>
>>
>> It is not a memory usage concern.
>>
>>> Perhaps sets sysctl_skb_defer_max to zero by default, so that admins
>>> can opt-in
>>>
>>
>> I really like the sd->defer_list system and I think is should be enabled
>> by default.  Even if disabled by default, we still need to handle these
>> corner cases, as the selftests shouldn't start to cause-issues when this
>> gets enabled.
>>
>> The simple solution is: (1) When netdevice/NAPI unregister happens call
>> defer_list flushing helper.  And perhaps we also need to call it in
>> xdp_test_run_teardown().  How do you feel about that?
>>
>> --Jesper
>>
> 
> Today I was discussing with Toke about this issue, and we were wondering,
> if we just consider the page_pool use-case, what about moving the real pool
> destroying steps when we return a page to the pool in page_pool_put_full_page()
> if the pool has marked to be destroyed and there are no inflight pages instead
> of assuming we have all the pages in the pool when we run page_pool_destroy()?

It sounds like you want to add a runtime check to the fast-path to
handle these corner cases?

For performance reason we should not call page_pool_inflight() check in 
fast-path, please!

Details: You hopefully mean running/calling page_pool_release(pool) and 
not page_pool_destroy().

I'm not totally against the idea, as long as someone is willing to do
extensive benchmarking that it doesn't affect fast-path performance.
Given we already read pool->p.flags in fast-path, it might be possible
to hide the extra branch (in the CPU pipeline).


> Maybe this means just get rid of the warn in page_pool_release_retry() :)
> 

Sure, we can remove the print statement, but it feels like closing our
eyes and ignoring the problem.  We can remove the print statement, and
still debug the problem, as I have added tracepoints (to debug this).
But users will not report these issue early... on the other hand most of
these reports will likely be false-positives.

This reminds me that Jakub's recent defer patches returning pages
'directly' to the page_pool alloc-cache, will actually result in this
kind of bug.  This is because page_pool_destroy() assumes that pages
cannot be returned to alloc-cache, as driver will have "disconnected" RX
side.  We need to address this bug separately.  Lorenzo you didn't
happen to use a kernel with Jakub's patches included, do you?

--Jesper