linux-kernel - Re: [PATCH net-next v19 06/13] memory-provider: dmabuf devmem memory provider

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAHS8izPMC+XhXKbJOQ3ymizyKuARSOv_cO_xO+q1EG4zoy6Gig@mail.gmail.com>
Date: Wed, 14 Aug 2024 10:55:49 -0400
From: Mina Almasry <almasrymina@...gle.com>
To: Pavel Begunkov <asml.silence@...il.com>
Cc: netdev@...r.kernel.org, linux-kernel@...r.kernel.org, 
	linux-doc@...r.kernel.org, linux-alpha@...r.kernel.org, 
	linux-mips@...r.kernel.org, linux-parisc@...r.kernel.org, 
	sparclinux@...r.kernel.org, linux-trace-kernel@...r.kernel.org, 
	linux-arch@...r.kernel.org, linux-kselftest@...r.kernel.org, 
	bpf@...r.kernel.org, linux-media@...r.kernel.org, 
	dri-devel@...ts.freedesktop.org, "David S. Miller" <davem@...emloft.net>, 
	Eric Dumazet <edumazet@...gle.com>, Jakub Kicinski <kuba@...nel.org>, Paolo Abeni <pabeni@...hat.com>, 
	Donald Hunter <donald.hunter@...il.com>, Jonathan Corbet <corbet@....net>, 
	Richard Henderson <richard.henderson@...aro.org>, Ivan Kokshaysky <ink@...assic.park.msu.ru>, 
	Matt Turner <mattst88@...il.com>, Thomas Bogendoerfer <tsbogend@...ha.franken.de>, 
	"James E.J. Bottomley" <James.Bottomley@...senpartnership.com>, Helge Deller <deller@....de>, 
	Andreas Larsson <andreas@...sler.com>, Jesper Dangaard Brouer <hawk@...nel.org>, 
	Ilias Apalodimas <ilias.apalodimas@...aro.org>, Steven Rostedt <rostedt@...dmis.org>, 
	Masami Hiramatsu <mhiramat@...nel.org>, Mathieu Desnoyers <mathieu.desnoyers@...icios.com>, 
	Arnd Bergmann <arnd@...db.de>, Steffen Klassert <steffen.klassert@...unet.com>, 
	Herbert Xu <herbert@...dor.apana.org.au>, David Ahern <dsahern@...nel.org>, 
	Willem de Bruijn <willemdebruijn.kernel@...il.com>, Shuah Khan <shuah@...nel.org>, 
	Alexei Starovoitov <ast@...nel.org>, Daniel Borkmann <daniel@...earbox.net>, 
	John Fastabend <john.fastabend@...il.com>, Sumit Semwal <sumit.semwal@...aro.org>, 
	Christian König <christian.koenig@....com>, 
	Bagas Sanjaya <bagasdotme@...il.com>, Christoph Hellwig <hch@...radead.org>, 
	Nikolay Aleksandrov <razor@...ckwall.org>, Taehee Yoo <ap420073@...il.com>, David Wei <dw@...idwei.uk>, 
	Jason Gunthorpe <jgg@...pe.ca>, Yunsheng Lin <linyunsheng@...wei.com>, 
	Shailend Chand <shailend@...gle.com>, Harshitha Ramamurthy <hramamurthy@...gle.com>, 
	Shakeel Butt <shakeel.butt@...ux.dev>, Jeroen de Borst <jeroendb@...gle.com>, 
	Praveen Kaligineedi <pkaligineedi@...gle.com>, Willem de Bruijn <willemb@...gle.com>, 
	Kaiyuan Zhang <kaiyuanz@...gle.com>
Subject: Re: [PATCH net-next v19 06/13] memory-provider: dmabuf devmem memory provider

On Wed, Aug 14, 2024 at 10:11 AM Pavel Begunkov <asml.silence@...il.com> wrote:
...
> > diff --git a/net/core/devmem.c b/net/core/devmem.c
> > index 301f4250ca82..2f2a7f4dee4c 100644
> > --- a/net/core/devmem.c
> > +++ b/net/core/devmem.c
> > @@ -17,6 +17,7 @@
> >   #include <linux/genalloc.h>
> >   #include <linux/dma-buf.h>
> >   #include <net/devmem.h>
> > +#include <net/mp_dmabuf_devmem.h>
> >   #include <net/netdev_queues.h>
> >
> >   #include "page_pool_priv.h"
> > @@ -153,6 +154,10 @@ int net_devmem_bind_dmabuf_to_queue(struct net_device *dev, u32 rxq_idx,
> >       if (err)
> >               goto err_xa_erase;
> >
> > +     err = page_pool_check_memory_provider(dev, rxq, binding);
>
> Frankly, I pretty much don't like it.
>
> 1. We do it after reconfiguring the queue just to fail and reconfigure
> it again.
>

I don't see an issue with that? Or is it just me?

> 2. It should be a part of the common path like netdev_rx_queue_restart(),
> not specific to devmem TCP.
>
> These two can be fixed by moving the check into
> netdev_rx_queue_restart() just after ->ndo_queue_mem_alloc, assuming
> that the callback where we init page pools.
>

The only reason is that the page_pool_check_memory_provider() needs to
know the memory provider to check for. Separating them keep
netdev_rx_queue_restart() usable for other future use cases that don't
expect a memory provider to be bound, but you are correct in that this
can be easily resolved by passing the binding to
netdev_rx_queue_restart() and doing the
page_pool_check_memory_providers() check inside of that function.

> 3. That implicit check gives me bad feeling, instead of just getting
> direct feedback from the driver, either it's a flag or an error
> returned, we have to try to figure what exactly the driver did, with
> a high chance this inference will fail us at some point.
>

This is where I get a bit confused. Jakub did mention that it is
desirable for core to verify that the driver did the right thing,
instead of trusting that a driver did the right thing without
verifying. Relying on a flag from the driver opens the door for the
driver to say "I support this" but actually not create the mp
page_pool. In my mind the explicit check is superior to getting
feedback from the driver.

Additionally this approach lets us detect support in core using 10
lines of code or so, rather than ask every driver that wants to
support mp to add boilerplate code to declare support (and run into
subtle bugs when this boilerplate is missing). There are minor pros
and cons to each approach; I don't see a showstopping reason to go
with one over the other.

> And page_pool_check_memory_provider() is not that straightforward,
> it doesn't walk through pools of a queue.

Right, we don't save the pp of a queue, only a netdev. The outer loop
checks all the pps of the netdev to find one with the correct binding,
and the inner loop checks that this binding is attached to the correct
queue.

> Not looking too deep,
> but it seems like the nested loop can be moved out with the same
> effect, so it first looks for a pool in the device and the follows
> with the bound_rxqs. And seems the bound_rxqs check would always turn
> true, you set the binding into the map in
> net_devmem_bind_dmabuf_to_queue() before the restart and it'll be there
> after restart for page_pool_check_memory_provider(). Maybe I missed
> something, but it's not super clear.
>
> 4. And the last thing Jakub mentioned is that we need to be prepared
> to expose a flag to the userspace for whether a queue supports
> netiov. Not really doable in a sane manner with such implicit
> post configuration checks.
>

I don't see a very strong reason to expose the flag to the userspace
now. userspace can try to bind dmabuf and get an EOPNOTSUPP if the
operation is not supported, right? In the future if passing the flag
to userspace becomes needed for some usecase, we do need feedback from
the driver, and it would be trivial to add similarly to what you
suggested.

> And that brings us back to the first approach I mentioned, where
> we have a flag in the queue structure, drivers set it, and
> netdev_rx_queue_restart() checks it before any callback. That's
> where the thread with Jakub stopped, and it reads like at least
> he's not against the idea.

Hmm, the netdev_rx_queue array is created in core, not by the driver,
does the driver set this flag during initialization? We could run into
subtle bugs with races if a code path checks for support after core
has allocated the netdev_rx_queue array but before the driver has had
a chance to declare support, right? Maybe a minor issue. Instead we
could add an ndo to the queue API that lets the driver tell us that it
could support binding on a given rx queue, and check that in
net_devmem_bind_dmabuf_to_queue() right before we do the bind?

But this is only if declaring support to userspace becomes needed for
some use case. At the moment I'm under the impression that verifying
in core that the driver did the right thing is preferred, and I'd like
to minimize the boilerplate the driver needs to implement if possible.

Additionally this series is big and blocks multiple interesting follow
up work; maybe going forward with an approach that works - and can
easily be iterated on later if we run into issues - could be wise. I
do not see an issue with adding a driver signal in the future (if
needed) and deprecating the core check (if needed), right?

--
Thanks,
Mina