netdev - Re: [PATCH v1 06/15] net: page_pool: add ->scrub mem provider callback

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <6acf95a6-2ddc-4eee-a6e1-257ac8d41285@gmail.com>
Date: Wed, 16 Oct 2024 18:42:26 +0100
From: Pavel Begunkov <asml.silence@...il.com>
To: Mina Almasry <almasrymina@...gle.com>, David Wei <dw@...idwei.uk>
Cc: io-uring@...r.kernel.org, netdev@...r.kernel.org,
 Jens Axboe <axboe@...nel.dk>, Jakub Kicinski <kuba@...nel.org>,
 Paolo Abeni <pabeni@...hat.com>, "David S. Miller" <davem@...emloft.net>,
 Eric Dumazet <edumazet@...gle.com>, Jesper Dangaard Brouer
 <hawk@...nel.org>, David Ahern <dsahern@...nel.org>
Subject: Re: [PATCH v1 06/15] net: page_pool: add ->scrub mem provider
 callback

On 10/14/24 23:58, Mina Almasry wrote:
> On Sun, Oct 13, 2024 at 8:25 PM David Wei <dw@...idwei.uk> wrote:
>>
>> On 2024-10-10 10:54, Mina Almasry wrote:
>>> On Wed, Oct 9, 2024 at 2:58 PM Pavel Begunkov <asml.silence@...il.com> wrote:
>>>>
>>>> On 10/9/24 22:00, Mina Almasry wrote:
>>>>> On Mon, Oct 7, 2024 at 3:16 PM David Wei <dw@...idwei.uk> wrote:
>>>>>>
>>>>>> From: Pavel Begunkov <asml.silence@...il.com>
>>>>>>
>>>>>> page pool is now waiting for all ppiovs to return before destroying
>>>>>> itself, and for that to happen the memory provider might need to push
>>>>>> some buffers, flush caches and so on.
>>>>>>
>>>>>> todo: we'll try to get by without it before the final release
>>>>>>
>>>>>
>>>>> Is the intention to drop this todo and stick with this patch, or to
>>>>> move ahead with this patch?
>>>>
>>>> Heh, I overlooked this todo. The plan is to actually leave it
>>>> as is, it's by far the simplest way and doesn't really gets
>>>> into anyone's way as it's a slow path.
>>>>
>>>>> To be honest, I think I read in a follow up patch that you want to
>>>>> unref all the memory on page_pool_destory, which is not how the
>>>>> page_pool is used today. Tdoay page_pool_destroy does not reclaim
>>>>> memory. Changing that may be OK.
>>>>
>>>> It doesn't because it can't (not breaking anything), which is a
>>>> problem as the page pool might never get destroyed. io_uring
>>>> doesn't change that, a buffer can't be reclaimed while anything
>>>> in the kernel stack holds it. It's only when it's given to the
>>>> user we can force it back out of there.
>>
>> The page pool will definitely be destroyed, the call to
>> netdev_rx_queue_restart() with mp_ops/mp_priv set to null and netdev
>> core will ensure that.
>>
>>>>
>>>> And it has to happen one way or another, we can't trust the
>>>> user to put buffers back, it's just devmem does that by temporarily
>>>> attaching the lifetime of such buffers to a socket.
>>>>
>>>
>>> (noob question) does io_uring not have a socket equivalent that you
>>> can tie the lifetime of the buffers to? I'm thinking there must be

You can say it is bound to io_uring / io_uring's object
representing the queue.

>>> one, because in your patches IIRC you have the fill queues and the
>>> memory you bind from the userspace, there should be something that
>>> tells you that the userspace has exited/crashed and it's time to now
>>> destroy the fill queue and unbind the memory, right?
>>>
>>> I'm thinking you may want to bind the lifetime of the buffers to that,
>>> instead of the lifetime of the pool. The pool will not be destroyed
>>> until the next driver/reset reconfiguration happens, right? That could
>>> be long long after the userspace has stopped using the memory.

io_uring will reset the queue if it dies / requested to release
the queue.

>> Yes, there are io_uring objects e.g. interface queue that hold
>> everything together. IIRC page pool destroy doesn't unref but it waits
>> for all pages that are handed out to skbs to be returned. So for us,
>> below might work:
>>
>> 1. Call netdev_rx_queue_restart() which allocates a new pp for the rx
>>     queue and tries to free the old pp
>> 2. At this point we're guaranteed that any packets hitting this rx queue
>>     will not go to user pages from our memory provider

It's reasonable to assume that the driver will start destroying
the page pool, but I wouldn't rely on it when it comes to the
kernel state correctness, i.e. not crashing the kernel. It's a bit
fragile, drivers always tend to do all kinds of interesting stuff,
I'd rather deal with a loud io_uring / page pool leak in case of
some weirdness. And that means we can't really guarantee the above
and need to care about not racing with allocations.

>> 3. Assume userspace is gone (either crash or gracefully terminating),
>>     unref the uref for all pages, same as what scrub() is doing today
>> 4. Any pages that are still in skb frags will get freed when the sockets
>>     etc are closed

And we need to prevent from requests receiving netmem that are
already pushed to sockets.

>> 5. Rely on the pp delay release to eventually terminate and clean up
>>
>> Let me know what you think Pavel.

I think it's reasonable to leave it as is for now, I don't believe
anyone cares much about a simple slow path memory provider-only
callback. And we can always kill it later on if we find a good way
to synchronise pieces, which will be more apparent when we add some
more registration dynamism on top, when/if this patchset is merged.

In short, let's resend the series with the callback, see if
maintainers have a strong opinion, and otherwise I'd say it
should be fine as is.

> Something roughly along those lines sounds more reasonable to me.
> 
> The critical point is as I said above, if you free the memory only
> when the pp is destroyed, then the memory lives from 1 io_uring ZC
> instance to the next. The next instance will see a reduced address
> space because the previously destroyed io_uring ZC connection did not
> free the memory. You could have users in production opening thousands
> of io_uring ZC connections between rxq resets, and not cleaning up
> those connections. In that case I think eventually they'll run out of
> memory as the memory leaks until it's cleaned up with a pp destroy
> (driver reset?).

Not sure what giving memory from one io_uring zc instance to
another means. And it's perfectly valid to receive a buffer, close
the socket and only after use the data, it logically belongs to
the user, not the socket. It's only bound to io_uring zcrx/queue
object for clean up purposes if io_uring goes down, it's different
from devmem TCP.

-- 
Pavel Begunkov