netdev - Re: issue with inflight pages from page

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CANn89iKAVERmJjTyscwjRTjTeWBUgA9COz+8HVH09Q0ehHL9Gw@mail.gmail.com>
Date:   Wed, 19 Apr 2023 14:09:20 +0200
From:   Eric Dumazet <edumazet@...gle.com>
To:     Jesper Dangaard Brouer <jbrouer@...hat.com>
Cc:     Lorenzo Bianconi <lorenzo@...nel.org>,
        Jakub Kicinski <kuba@...nel.org>, brouer@...hat.com,
        netdev@...r.kernel.org, hawk@...nel.org,
        ilias.apalodimas@...aro.org, davem@...emloft.net,
        pabeni@...hat.com, bpf@...r.kernel.org,
        lorenzo.bianconi@...hat.com, nbd@....name
Subject: Re: issue with inflight pages from page_pool

On Wed, Apr 19, 2023 at 1:08 PM Jesper Dangaard Brouer
<jbrouer@...hat.com> wrote:
>
>
> On 18/04/2023 09.36, Lorenzo Bianconi wrote:
> >> On Mon, 17 Apr 2023 23:31:01 +0200 Lorenzo Bianconi wrote:
> >>>> If it's that then I'm with Eric. There are many ways to keep the pages
> >>>> in use, no point working around one of them and not the rest :(
> >>>
> >>> I was not clear here, my fault. What I mean is I can see the returned
> >>> pages counter increasing from time to time, but during most of tests,
> >>> even after 2h the tcp traffic has stopped, page_pool_release_retry()
> >>> still complains not all the pages are returned to the pool and so the
> >>> pool has not been deallocated yet.
> >>> The chunk of code in my first email is just to demonstrate the issue
> >>> and I am completely fine to get a better solution :)
> >>
> >> Your problem is perhaps made worse by threaded NAPI, you have
> >> defer-free skbs sprayed across all cores and no NAPI there to
> >> flush them :(
> >
> > yes, exactly :)
> >
> >>
> >>> I guess we just need a way to free the pool in a reasonable amount
> >>> of time. Agree?
> >>
> >> Whether we need to guarantee the release is the real question.
> >
> > yes, this is the main goal of my email. The defer-free skbs behaviour seems in
> > contrast with the page_pool pending pages monitor mechanism or at least they
> > do not work well together.
> >
> > @Jesper, Ilias: any input on it?
> >
> >> Maybe it's more of a false-positive warning.
> >>
> >> Flushing the defer list is probably fine as a hack, but it's not
> >> a full fix as Eric explained. False positive can still happen.
> >
> > agree, it was just a way to give an idea of the issue, not a proper solution.
> >
> > Regards,
> > Lorenzo
> >
> >>
> >> I'm ambivalent. My only real request wold be to make the flushing
> >> a helper in net/core/dev.c rather than open coded in page_pool.c.
>
> I agree. We need a central defer_list flushing helper
>
> It is too easy to say this is a false-positive warning.
> IHMO this expose an issue with the sd->defer_list system.
>
> Lorenzo's test is adding+removing veth devices, which creates and runs
> NAPI processing on random CPUs.  After veth netdevices (+NAPI) are
> removed, nothing will naturally invoking net_rx_softirq on this CPU.
> Thus, we have SKBs waiting on CPUs sd->defer_list.  Further more we will
> not create new SKB with this skb->alloc_cpu, to trigger RX softirq IPI
> call (trigger_rx_softirq), even if this CPU process and frees SKBs.
>
> I see two solutions:
>
>   (1) When netdevice/NAPI unregister happens call defer_list flushing
> helper.
>
>   (2) Use napi_watchdog to detect if defer_list is (many jiffies) old,
> and then call defer_list flushing helper.
>
>
> >>
> >> Somewhat related - Eric, do we need to handle defer_list in dev_cpu_dead()?
>
> Looks to me like dev_cpu_dead() also need this flushing helper for
> sd->defer_list, or at least moving the sd->defer_list to an sd that will
> run eventually.

I think I just considered having a few skbs in per-cpu list would not
be an issue,
especially considering skbs can sit hours in tcp receive queues.

Do we expect hacing some kind of callback/shrinker to instruct TCP or
pipes to release all pages that prevent
a page_pool to be freed ?

Here, we are talking of hundreds of thousands of skbs, compared to at
most 32 skbs per cpu.

Perhaps sets sysctl_skb_defer_max to zero by default, so that admins can opt-in