[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <af9a249a-1577-40fd-b1ba-be3737e86b18@gmail.com>
Date: Fri, 1 Nov 2024 21:09:14 +0000
From: Pavel Begunkov <asml.silence@...il.com>
To: Mina Almasry <almasrymina@...gle.com>, David Wei <dw@...idwei.uk>
Cc: io-uring@...r.kernel.org, netdev@...r.kernel.org,
Jens Axboe <axboe@...nel.dk>, Jakub Kicinski <kuba@...nel.org>,
Paolo Abeni <pabeni@...hat.com>, "David S. Miller" <davem@...emloft.net>,
Eric Dumazet <edumazet@...gle.com>, Jesper Dangaard Brouer
<hawk@...nel.org>, David Ahern <dsahern@...nel.org>,
Stanislav Fomichev <stfomichev@...il.com>, Joe Damato <jdamato@...tly.com>,
Pedro Tammela <pctammela@...atatu.com>
Subject: Re: [PATCH v7 11/15] io_uring/zcrx: implement zerocopy receive pp
memory provider
On 11/1/24 20:06, Mina Almasry wrote:
...
>> +__maybe_unused
>> +static const struct memory_provider_ops io_uring_pp_zc_ops;
>> +
>> +static inline struct io_zcrx_area *io_zcrx_iov_to_area(const struct net_iov *niov)
>> +{
>> + struct net_iov_area *owner = net_iov_owner(niov);
>> +
>> + return container_of(owner, struct io_zcrx_area, nia);
>> +}
>> +
>
> We discussed this before I disappeared on vacation but I'm not too
> convinced to be honest, sorry.
>
> It's invalid to call io_zcrx_iov_to_area on a devmem niov and vice
> versa, right? So current and future code has to be very careful to
Yes
> call the right helpers on the right niovs.
>
> At the very least there needs to be a comment above all these
> container_of helpers:
>
> /* caller must have verified that this niov is devmem/io_zcrx */.
>
> However I feel like even a comment is extremely error prone. These
> container_of's are inside of the call stack of some helpers. I would
> say we need a check. If we're concerned about performance, the check
> can be behind DEBUG_NET_WARN_ON(), although even that is a bit iffy,
> but could be fine. Doing this without a check seems too risky to me.
No, it doesn't need a check nor it needs a comment. The very
essence of virtual function tables is that they're coupled
together with objects for which those function make sense and
called only for those objects. The only way to get here with
invalid net_iovs is to take one page pool and feed it with
net_iovs from other another page pool that won't be sane in
the first place.
That would be an equivalent of:
struct file *f1 = ...;
struct file *f2 = ...;
f1->f_op->read(f2, ...);
Maybe it looks strange for you in C, but it's same as putting
comments that a virtual function that it should be called only
for objects of that class:
struct A {
virtual void foo() = 0;
};
struct B: public A {
void foo() override {
// we should only be called with objects of type
// struct B (or anything inheriting it), check that
if (!reinterpret_cast<struct B*>(this))
throw;
...
}
}
>> static int io_allocate_rbuf_ring(struct io_zcrx_ifq *ifq,
>> struct io_uring_zcrx_ifq_reg *reg)
>> {
>> @@ -99,6 +114,9 @@ static int io_zcrx_create_area(struct io_ring_ctx *ctx,
>> goto err;
>>
>> for (i = 0; i < nr_pages; i++) {
>> + struct net_iov *niov = &area->nia.niovs[i];
>> +
>> + niov->owner = &area->nia;
>> area->freelist[i] = i;
>> }
>>
>> @@ -230,3 +248,200 @@ void io_shutdown_zcrx_ifqs(struct io_ring_ctx *ctx)
>> {
>> lockdep_assert_held(&ctx->uring_lock);
>> }
>> +
>> +static bool io_zcrx_niov_put(struct net_iov *niov, int nr)
>> +{
>> + return atomic_long_sub_and_test(nr, &niov->pp_ref_count);
>> +}
>> +
>> +static bool io_zcrx_put_niov_uref(struct net_iov *niov)
>> +{
>> + if (atomic_long_read(&niov->pp_ref_count) < IO_ZC_RX_UREF)
>> + return false;
>> +
>> + return io_zcrx_niov_put(niov, IO_ZC_RX_UREF);
>> +}
>> +
>
> Sorry, I have to push back a bit against this. The refcounting of
> netmem is already complicated. the paged netmem has 2 refcounts and
> care needs to be taken when acquiring and dropping refcounts. net_iov
> inherited the pp_ref_count but not the paged refcount, and again need
> some special handling. skb_frag_unref takes very special care checking
Which is why it's using net_iovs.
> pp->recycle, is_pp_netmem, and others to figure out the correct
pp->recycle has nothing to do with the series. We don't add
it in any special way, and if it's broken it's broken even
for non-proivder buffers.
> refcount to put based on the type of the netmem and skb flag.
Just same as with the ->[un]readable flag, which is not
functionally needed, and if it's screwed many things can
go very wrong.
> This code ignores all these generic code
> skb_frag_unref/napi_pp_put_page/etc paths and uses raw access to
I don't see the point, they are not used because they're not
needed. Instead of checking whether it came from a page pool
and whether it's net_iov or not, in the path io_uring returns
it we already apriori know that they're from a specific page
pool, net_iov and from the current provider.
Same for optimisations provided by those helpers, they are
useful when you're transferring buffers from one context to
another, e.g. task recieve path -> napi / page_pool. In this
case they're already fetched in the right context without any
need to additionally jumping through the hoops. If anything,
it'd be odd to jump out of a window to climb a rope on the
other side of the building when you could've just walked 5
meters to the other room.
> niv->pp_ref_count. If this is merged as-is, for posterity any changes
Ok, let's add a helper then
> in netmem refcounting need to also account for this use case opting
> out of these generic code paths that handle all other skb reffing
> including devmem.
>
> Additionally since you're opting out of the generic unreffing paths
> you're also (as mentioned before) bypassing the pp recycling. AFAICT
> that may be hurting your performance. IIUC you refill
> PP_ALLOC_CACHE_REFILL (64) entries everytime _alloc_netmems is
> entered, and you don't recycle netmem any other way, so your slow path
> is entered 1/64 of the page_pool_alloc calls? That should be much
One virtual call per 64 buffers gets enough of ammortisation. The
cache size can be extended if necessary.
> worse than what the normal pp recycling does, which returns all freed
> netmem into its alloc.cache or the ptr_ring and hits *_alloc_netmems
You return it from a syscall (a special sockopt), I'm pretty sure
overhead of just that syscall without any handling would be more
expensive than one virtual function call. Then you need to hit the
fast cache, and it's not unconditional, it has to be lucky enough
so that napi is not run or scheduled, and even then it has to
be very careful to avoid races. That's the best case for <64 entries
recycling, otherwise it's ptr_ring and spinlocks.
Note, the normal (non-zc) recycling happens in the receive
syscall, but it's not the normal path, and just like devmem we
have to give the buffer to the user and wait until it's returned
back.
> much more rarely. There are also regular perf improvements and testing
> to the generic pool recycling paths you're also opting out of.
For performance, see above. As for testing, tests come after code
functionality, not the other way around. Why we're even adding any
zero copy and interface when it could be old good and well tested
non-zerocopy recv(2)
> I see a lot of downsides to opting out of the generic use cases. Is
> there any reason the normal freeing paths are not applicable to your
> use case?
>
>> +static inline void io_zc_add_pp_cache(struct page_pool *pp,
>> + struct net_iov *niov)
>> +{
>> +}
>> +
>
> Looks unused/empty.
Indeed, slipped through.
...
--
Pavel Begunkov
Powered by blists - more mailing lists