netdev - Re: [PATCH v1 06/15] net: page_pool: add ->scrub mem provider callback

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <4a33bf2c-49ef-4cfd-97af-8341cdb977d3@gmail.com>
Date: Mon, 14 Oct 2024 14:37:26 +0100
From: Pavel Begunkov <asml.silence@...il.com>
To: David Wei <dw@...idwei.uk>, Mina Almasry <almasrymina@...gle.com>
Cc: io-uring@...r.kernel.org, netdev@...r.kernel.org,
 Jens Axboe <axboe@...nel.dk>, Jakub Kicinski <kuba@...nel.org>,
 Paolo Abeni <pabeni@...hat.com>, "David S. Miller" <davem@...emloft.net>,
 Eric Dumazet <edumazet@...gle.com>, Jesper Dangaard Brouer
 <hawk@...nel.org>, David Ahern <dsahern@...nel.org>
Subject: Re: [PATCH v1 06/15] net: page_pool: add ->scrub mem provider
 callback

On 10/13/24 18:25, David Wei wrote:
> On 2024-10-10 10:54, Mina Almasry wrote:
>> On Wed, Oct 9, 2024 at 2:58 PM Pavel Begunkov <asml.silence@...il.com> wrote:
>>>
>>> On 10/9/24 22:00, Mina Almasry wrote:
>>>> On Mon, Oct 7, 2024 at 3:16 PM David Wei <dw@...idwei.uk> wrote:
>>>>>
>>>>> From: Pavel Begunkov <asml.silence@...il.com>
>>>>>
>>>>> page pool is now waiting for all ppiovs to return before destroying
>>>>> itself, and for that to happen the memory provider might need to push
>>>>> some buffers, flush caches and so on.
>>>>>
>>>>> todo: we'll try to get by without it before the final release
>>>>>
>>>>
>>>> Is the intention to drop this todo and stick with this patch, or to
>>>> move ahead with this patch?
>>>
>>> Heh, I overlooked this todo. The plan is to actually leave it
>>> as is, it's by far the simplest way and doesn't really gets
>>> into anyone's way as it's a slow path.
>>>
>>>> To be honest, I think I read in a follow up patch that you want to
>>>> unref all the memory on page_pool_destory, which is not how the
>>>> page_pool is used today. Tdoay page_pool_destroy does not reclaim
>>>> memory. Changing that may be OK.
>>>
>>> It doesn't because it can't (not breaking anything), which is a
>>> problem as the page pool might never get destroyed. io_uring
>>> doesn't change that, a buffer can't be reclaimed while anything
>>> in the kernel stack holds it. It's only when it's given to the
>>> user we can force it back out of there.
> 
> The page pool will definitely be destroyed, the call to
> netdev_rx_queue_restart() with mp_ops/mp_priv set to null and netdev
> core will ensure that.
> 
>>>
>>> And it has to happen one way or another, we can't trust the
>>> user to put buffers back, it's just devmem does that by temporarily
>>> attaching the lifetime of such buffers to a socket.
>>>
>>
>> (noob question) does io_uring not have a socket equivalent that you
>> can tie the lifetime of the buffers to? I'm thinking there must be
>> one, because in your patches IIRC you have the fill queues and the
>> memory you bind from the userspace, there should be something that
>> tells you that the userspace has exited/crashed and it's time to now
>> destroy the fill queue and unbind the memory, right?
>>
>> I'm thinking you may want to bind the lifetime of the buffers to that,
>> instead of the lifetime of the pool. The pool will not be destroyed
>> until the next driver/reset reconfiguration happens, right? That could
>> be long long after the userspace has stopped using the memory.
>>
> 
> Yes, there are io_uring objects e.g. interface queue that hold
> everything together. IIRC page pool destroy doesn't unref but it waits
> for all pages that are handed out to skbs to be returned. So for us,
> below might work:
> 
> 1. Call netdev_rx_queue_restart() which allocates a new pp for the rx
>     queue and tries to free the old pp
> 2. At this point we're guaranteed that any packets hitting this rx queue
>     will not go to user pages from our memory provider
> 3. Assume userspace is gone (either crash or gracefully terminating),
>     unref the uref for all pages, same as what scrub() is doing today
> 4. Any pages that are still in skb frags will get freed when the sockets
>     etc are closed
> 5. Rely on the pp delay release to eventually terminate and clean up
> 
> Let me know what you think Pavel.

I'll get to this comment a bit later when I get some time to
remember what races we have to deal with without the callback.

-- 
Pavel Begunkov