netdev - Re: [PATCH v7 11/15] io_uring/zcrx: implement zerocopy receive pp memory provider

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <8837c96b-f764-4ba7-ae9b-40299f8c266c@gmail.com>
Date: Mon, 4 Nov 2024 21:14:05 +0000
From: Pavel Begunkov <asml.silence@...il.com>
To: Mina Almasry <almasrymina@...gle.com>
Cc: David Wei <dw@...idwei.uk>, io-uring@...r.kernel.org,
 netdev@...r.kernel.org, Jens Axboe <axboe@...nel.dk>,
 Jakub Kicinski <kuba@...nel.org>, Paolo Abeni <pabeni@...hat.com>,
 "David S. Miller" <davem@...emloft.net>, Eric Dumazet <edumazet@...gle.com>,
 Jesper Dangaard Brouer <hawk@...nel.org>, David Ahern <dsahern@...nel.org>,
 Stanislav Fomichev <stfomichev@...il.com>, Joe Damato <jdamato@...tly.com>,
 Pedro Tammela <pctammela@...atatu.com>
Subject: Re: [PATCH v7 11/15] io_uring/zcrx: implement zerocopy receive pp
 memory provider

On 11/4/24 19:54, Mina Almasry wrote:
> On Fri, Nov 1, 2024 at 2:09 PM Pavel Begunkov <asml.silence@...il.com> wrote:
...
>>> However I feel like even a comment is extremely error prone. These
>>> container_of's are inside of the call stack of some helpers. I would
>>> say we need a check. If we're concerned about performance, the check
>>> can be behind DEBUG_NET_WARN_ON(), although even that is a bit iffy,
>>> but could be fine. Doing this without a check seems too risky to me.
>>
>> No, it doesn't need a check nor it needs a comment. The very
>> essence of virtual function tables is that they're coupled
>> together with objects for which those function make sense and
>> called only for those objects. The only way to get here with
>> invalid net_iovs is to take one page pool and feed it with
>> net_iovs from other another page pool that won't be sane in
>> the first place.
>>
> 
> That could happen. In fact the whole devmem tcp paths are very
> carefully written to handle that

What could happen? Calling ops of one page pool with net iovs
of another? Right, you can force yourself to write it this way,
but it's not sane code.

> net_iovs are allocated from the page_pool, put in skbs, and then sit
> in the sk receive queue. In pathological cases (user is
> re/misconfiguring flow steering) we can have 1 sk receive queue that
> has a mix of page skbs, devmem skbs, and io_uring skbs, and other
> skbs.
> 
> Code that is processing the skbs in the receive queue has no idea
> whether what kind of skb it is. That's why that code needs to check
> whether the skb has readable frags, and that's why in this very series
> you needed to add a check in tcp_recvmsg_dmabuf to make sure that its
> a dmabuf skb, and you need to add a check to io_zcrx_recv_frag that
> the frag inside it is io_uring niov. The code would be wrong without
> it.

Right, it's expanded to support multiple possible types instead of
"it's a devmem TCP thing and nothing else can ever use it". And it's
not new, devmem forks off the generic path, it just does it based on
skb->readable, which is no more than an optimisation and could've
been on the type of the buffer, e.g. is_net_iov(netmem).

> All I'm trying to say is that it's very error prone to rely on folks

It's really not, especially comparing to lots of other bits that
are much easier to screw up, skb->readable would be a stark
example, which we did catch failing many times.

> writing and reviewing code to check that whenever dmabuf/io_rcrx/etc
> handling is done, somewhere in the call stack a type verification
> check has been made, and a DEBUG_NET_WARN could help avoid some subtle
> memory corruption bugs.
> 
>> That would be an equivalent of:
>>
>> struct file *f1 = ...;
>> struct file *f2 = ...;
>>
>> f1->f_op->read(f2, ...);
>>
>> Maybe it looks strange for you in C, but it's same as putting
>> comments that a virtual function that it should be called only
>> for objects of that class:
>>
>> struct A {
>>          virtual void foo() = 0;
>> };
>> struct B: public A {
>>          void foo() override {
>>                  // we should only be called with objects of type
>>                  // struct B (or anything inheriting it), check that
>>                  if (!reinterpret_cast<struct B*>(this))
>>                          throw;
>>                  ...
>>          }
>> }
>>
>>
> 
> I'm not really sure I followed here. We do not get any type of
> compiler or type safety from this code because the dma-buf niovs and
> io_uring niovs are the same net_iov type.

Right, because it's C, but the basic idea of virtual dispatch
is in there.

> We can get type safety by defining new types for dmabuf_net_iov and
> io_uring_net_iov, then provide helpers:
> 
> dmabuf_net_iov *net_iov_to_dmabuf();
> io_uring_net_iov *net_iov_to_io_uring();

Directly aliasing it to parts of struct page doesn't leave
much space to extending types. The only option is for all
those types to have exactly same layout, but then there is
no much point in doing so.

> The helpers can check the niov is of the right type once and do a
> cast,  then the object with the specific type can be passed to all
> future heplers without additional checks. This is one way to do it I
> guess.
> 
...
>> Same for optimisations provided by those helpers, they are
>> useful when you're transferring buffers from one context to
>> another, e.g. task recieve path -> napi / page_pool. In this
>> case they're already fetched in the right context without any
>> need to additionally jumping through the hoops. If anything,
>> it'd be odd to jump out of a window to climb a rope on the
>> other side of the building when you could've just walked 5
>> meters to the other room.
>>
> 
> For me, "they are not used because they're not needed." is not enough
> justification to ignore the generic code paths that support generic
> use cases and add your own freeing path and recycling that needs to
> work adjacent to generic paths for posterity. You need to provide
> concrete reasons why the current code paths don't work for you and
> can't be made to work for you.

No, it more than justifies it, it's neither needed nor makes sense for
the chosen API, and the API is chosen so that it avoids those extra
steps of crossing contexts an extra time.

What you're saying is that It should work in a less efficient way and
(perhaps arguably) be less convenient to the user as it now needs to
care about batching, because that's how devmem TCP does it. It's not
really a good argument.

Let me give you a devmem TCP example of what you're saying. Why can't
you use the generic (copy) TCP path for devmem TCP? It's well
tested. The reason that it's about zero copy and copying adds... hmm...
a "copy" doesn't justify avoiding the generic path.

> Is it very complicated to napi_pp_put_page() niovs as the user puts
> them in the refill queue without adding a new syscall? If so, is it
> possible to do a niov equivalent of page_pool_put_page_bulk() of the
> refill queue while/as you process the RX path?

That adds an extra jump from one context to another for no apparent
reason just as mentioned above.

> If you've tested the generic code paths to be performance deficient
> and your recycling is indeed better, you could improve the page_pool
> to pull netmems when it needs to like you're doing here, but in a
> generic way that applies to the page allocator and other providers.
> Not a one-off implementation that only applies to your provider.

If I read it right, you're saying you need to improve devmem TCP
instead of doing an io_uring API, just like you indirectly declared
in the very beginning a couple of weeks ago. Again, if you're
against having an io_uring user API in general or against some
particular aspects of the API, then please state it clearly. If not,
I can leave the idea to you to entertain once it's merged.

> If you're absolutely set on ignoring the currently supported reffing
> and implementing your own reffing and recycling for your use case,
> sure, that could work, but please don't overload the
> niov->pp_ref_count reserved for the generic use cases for this. Add
> io_zcrx_area->io_uring_ref or something and do whatever you want with
> it. Since it's not sharing the pp_ref_count with the generic code
> paths I don't see them conflicting in the future.

That would be a performance problem, I don't believe they can't
live together.

-- 
Pavel Begunkov