linux-kernel - Re: [net-next v1 06/16] netdev: support binding dma-buf to netdevice

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAHS8izPsQ2XoJy-vYWkn051Yc=D_kSprtQcG4mmPutf1G3+-aw@mail.gmail.com>
Date:   Fri, 8 Dec 2023 11:22:08 -0800
From:   Mina Almasry <almasrymina@...gle.com>
To:     David Ahern <dsahern@...nel.org>
Cc:     Shailend Chand <shailend@...gle.com>, netdev@...r.kernel.org,
        linux-kernel@...r.kernel.org, linux-doc@...r.kernel.org,
        linux-arch@...r.kernel.org, linux-kselftest@...r.kernel.org,
        bpf@...r.kernel.org, linux-media@...r.kernel.org,
        dri-devel@...ts.freedesktop.org,
        "David S. Miller" <davem@...emloft.net>,
        Eric Dumazet <edumazet@...gle.com>,
        Jakub Kicinski <kuba@...nel.org>,
        Paolo Abeni <pabeni@...hat.com>,
        Jonathan Corbet <corbet@....net>,
        Jeroen de Borst <jeroendb@...gle.com>,
        Praveen Kaligineedi <pkaligineedi@...gle.com>,
        Jesper Dangaard Brouer <hawk@...nel.org>,
        Ilias Apalodimas <ilias.apalodimas@...aro.org>,
        Arnd Bergmann <arnd@...db.de>,
        Willem de Bruijn <willemdebruijn.kernel@...il.com>,
        Shuah Khan <shuah@...nel.org>,
        Sumit Semwal <sumit.semwal@...aro.org>,
        Christian König <christian.koenig@....com>,
        Yunsheng Lin <linyunsheng@...wei.com>,
        Harshitha Ramamurthy <hramamurthy@...gle.com>,
        Shakeel Butt <shakeelb@...gle.com>,
        Willem de Bruijn <willemb@...gle.com>,
        Kaiyuan Zhang <kaiyuanz@...gle.com>
Subject: Re: [net-next v1 06/16] netdev: support binding dma-buf to netdevice

On Fri, Dec 8, 2023 at 9:48 AM David Ahern <dsahern@...nel.org> wrote:
>
> On 12/7/23 5:52 PM, Mina Almasry wrote:
...
> > +
> > +     xa_for_each(&binding->bound_rxq_list, xa_idx, rxq) {
> > +             if (rxq->binding == binding) {
> > +                     /* We hold the rtnl_lock while binding/unbinding
> > +                      * dma-buf, so we can't race with another thread that
> > +                      * is also modifying this value. However, the driver
> > +                      * may read this config while it's creating its
> > +                      * rx-queues. WRITE_ONCE() here to match the
> > +                      * READ_ONCE() in the driver.
> > +                      */
> > +                     WRITE_ONCE(rxq->binding, NULL);
> > +
> > +                     rxq_idx = get_netdev_rx_queue_index(rxq);
> > +
> > +                     netdev_restart_rx_queue(binding->dev, rxq_idx);
>
> Blindly restarting a queue when a dmabuf is heavy handed. If the dmabuf
> has no outstanding references (ie., no references in the RxQ), then no
> restart is needed.
>

I think I need to stop the queue while binding to a dmabuf for the
sake of concurrency, no? I.e. the softirq thread may be delivering a
packet, and in parallel a separate thread holds rtnl_lock and tries to
bind the dma-buf. At that point the page_pool recreation will race
with the driver doing page_pool_alloc_page(). I don't think I can
insert a lock to handle this into the rx fast path, no?

Also, this sounds like it requires (lots of) more changes. The
page_pool + driver need to report how many pending references there
are (with locking so we don't race with incoming packets), and have
them reported via an ndo so that we can skip restarting the queue.
Implementing the changes in to a huge issue but handling the
concurrency may be a genuine blocker. Not sure it's worth the upside
of not restarting the single rx queue?

> > +             }
> > +     }
> > +
> > +     xa_erase(&netdev_dmabuf_bindings, binding->id);
> > +
> > +     netdev_dmabuf_binding_put(binding);
> > +}
> > +
> > +int netdev_bind_dmabuf_to_queue(struct net_device *dev, u32 rxq_idx,
> > +                             struct netdev_dmabuf_binding *binding)
> > +{
> > +     struct netdev_rx_queue *rxq;
> > +     u32 xa_idx;
> > +     int err;
> > +
> > +     rxq = __netif_get_rx_queue(dev, rxq_idx);
> > +
> > +     if (rxq->binding)
> > +             return -EEXIST;
> > +
> > +     err = xa_alloc(&binding->bound_rxq_list, &xa_idx, rxq, xa_limit_32b,
> > +                    GFP_KERNEL);
> > +     if (err)
> > +             return err;
> > +
> > +     /* We hold the rtnl_lock while binding/unbinding dma-buf, so we can't
> > +      * race with another thread that is also modifying this value. However,
> > +      * the driver may read this config while it's creating its * rx-queues.
> > +      * WRITE_ONCE() here to match the READ_ONCE() in the driver.
> > +      */
> > +     WRITE_ONCE(rxq->binding, binding);
> > +
> > +     err = netdev_restart_rx_queue(dev, rxq_idx);
>
> Similarly, here binding a dmabuf to a queue. I was expecting the dmabuf
> binding to add entries to the page pool for the queue.

To be honest, I think maybe there's a slight disconnect between how
you think the page_pool works, and my primitive understanding of how
it works. Today, I see a 1:1 mapping between rx-queue and page_pool in
the code. I don't see 1:many or many:1 mappings.

In theory mapping 1 rx-queue to n page_pools is trivial: the driver
can call page_pool_create() multiple times to generate n queues and
decide for incoming packets which one to use.

However, mapping n rx-queues to 1 page_pool seems like a can of worms.
I see code in the page_pool that looks to me (and Willem) like it's
safe only because the page_pool is used from the same napi context.
with a n rx-queueue: 1 page_pool mapping, that is no longer true, no?
There is a tail end of issues to resolve to be able to map 1 page_pool
to n queues as I understand and even if resolved I'm not sure the
maintainers are interested in taking the code.

So, per my humble understanding there is no such thing as "add entries
to the page pool for the (specific) queue", the page_pool is always
used by 1 queue.

Note that even though this limitation exists, we still support binding
1 dma-buf to multiple queues, because multiple page pools can use the
same netdev_dmabuf_binding. I should add that to the docs.

> If the pool was
> previously empty, then maybe the queue needs to be "started" in the
> sense of creating with h/w or just pushing buffers into the queue and
> moving the pidx.
>
>

I don't think it's enough to add buffers to the page_pool, no? The
existing buffers in the page_pool (host mem) must be purged. I think
maybe the queue needs to be stopped as well so that we don't race with
incoming packets and end up with skbs with devmem and non-devmem frags
(unless you're thinking it becomes a requirement to support that, I
think things are complicated as-is and it's a good simplification).
When we already purge the existing buffers & restart the queue, it's
little effort to migrate this to become in line with Jakub's queue-api
that he also wants to use for per-queue configuration & ndo_stop/open.

-- 
Thanks,
Mina