netdev - Re: [PATCH v1 06/15] net: page_pool: add ->scrub mem provider callback

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAHS8izNt8pfBwGnRNWphN4vJJ=1yJX=++-RmGVHrVOvy59=13Q@mail.gmail.com>
Date: Tue, 15 Oct 2024 01:58:52 +0300
From: Mina Almasry <almasrymina@...gle.com>
To: David Wei <dw@...idwei.uk>
Cc: Pavel Begunkov <asml.silence@...il.com>, io-uring@...r.kernel.org, 
	netdev@...r.kernel.org, Jens Axboe <axboe@...nel.dk>, Jakub Kicinski <kuba@...nel.org>, 
	Paolo Abeni <pabeni@...hat.com>, "David S. Miller" <davem@...emloft.net>, 
	Eric Dumazet <edumazet@...gle.com>, Jesper Dangaard Brouer <hawk@...nel.org>, David Ahern <dsahern@...nel.org>
Subject: Re: [PATCH v1 06/15] net: page_pool: add ->scrub mem provider callback

On Sun, Oct 13, 2024 at 8:25 PM David Wei <dw@...idwei.uk> wrote:
>
> On 2024-10-10 10:54, Mina Almasry wrote:
> > On Wed, Oct 9, 2024 at 2:58 PM Pavel Begunkov <asml.silence@...il.com> wrote:
> >>
> >> On 10/9/24 22:00, Mina Almasry wrote:
> >>> On Mon, Oct 7, 2024 at 3:16 PM David Wei <dw@...idwei.uk> wrote:
> >>>>
> >>>> From: Pavel Begunkov <asml.silence@...il.com>
> >>>>
> >>>> page pool is now waiting for all ppiovs to return before destroying
> >>>> itself, and for that to happen the memory provider might need to push
> >>>> some buffers, flush caches and so on.
> >>>>
> >>>> todo: we'll try to get by without it before the final release
> >>>>
> >>>
> >>> Is the intention to drop this todo and stick with this patch, or to
> >>> move ahead with this patch?
> >>
> >> Heh, I overlooked this todo. The plan is to actually leave it
> >> as is, it's by far the simplest way and doesn't really gets
> >> into anyone's way as it's a slow path.
> >>
> >>> To be honest, I think I read in a follow up patch that you want to
> >>> unref all the memory on page_pool_destory, which is not how the
> >>> page_pool is used today. Tdoay page_pool_destroy does not reclaim
> >>> memory. Changing that may be OK.
> >>
> >> It doesn't because it can't (not breaking anything), which is a
> >> problem as the page pool might never get destroyed. io_uring
> >> doesn't change that, a buffer can't be reclaimed while anything
> >> in the kernel stack holds it. It's only when it's given to the
> >> user we can force it back out of there.
>
> The page pool will definitely be destroyed, the call to
> netdev_rx_queue_restart() with mp_ops/mp_priv set to null and netdev
> core will ensure that.
>
> >>
> >> And it has to happen one way or another, we can't trust the
> >> user to put buffers back, it's just devmem does that by temporarily
> >> attaching the lifetime of such buffers to a socket.
> >>
> >
> > (noob question) does io_uring not have a socket equivalent that you
> > can tie the lifetime of the buffers to? I'm thinking there must be
> > one, because in your patches IIRC you have the fill queues and the
> > memory you bind from the userspace, there should be something that
> > tells you that the userspace has exited/crashed and it's time to now
> > destroy the fill queue and unbind the memory, right?
> >
> > I'm thinking you may want to bind the lifetime of the buffers to that,
> > instead of the lifetime of the pool. The pool will not be destroyed
> > until the next driver/reset reconfiguration happens, right? That could
> > be long long after the userspace has stopped using the memory.
> >
>
> Yes, there are io_uring objects e.g. interface queue that hold
> everything together. IIRC page pool destroy doesn't unref but it waits
> for all pages that are handed out to skbs to be returned. So for us,
> below might work:
>
> 1. Call netdev_rx_queue_restart() which allocates a new pp for the rx
>    queue and tries to free the old pp
> 2. At this point we're guaranteed that any packets hitting this rx queue
>    will not go to user pages from our memory provider
> 3. Assume userspace is gone (either crash or gracefully terminating),
>    unref the uref for all pages, same as what scrub() is doing today
> 4. Any pages that are still in skb frags will get freed when the sockets
>    etc are closed
> 5. Rely on the pp delay release to eventually terminate and clean up
>
> Let me know what you think Pavel.

Something roughly along those lines sounds more reasonable to me.

The critical point is as I said above, if you free the memory only
when the pp is destroyed, then the memory lives from 1 io_uring ZC
instance to the next. The next instance will see a reduced address
space because the previously destroyed io_uring ZC connection did not
free the memory. You could have users in production opening thousands
of io_uring ZC connections between rxq resets, and not cleaning up
those connections. In that case I think eventually they'll run out of
memory as the memory leaks until it's cleaned up with a pp destroy
(driver reset?).


-- 
Thanks,
Mina