netdev - Re: [PATCH vhost v2 4/7] virtio_net: big mode support premapped

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <1713937418.3984292-1-xuanzhuo@linux.alibaba.com>
Date: Wed, 24 Apr 2024 13:43:38 +0800
From: Xuan Zhuo <xuanzhuo@...ux.alibaba.com>
To: Jason Wang <jasowang@...hat.com>
Cc: virtualization@...ts.linux.dev,
 "Michael S. Tsirkin" <mst@...hat.com>,
 "David S. Miller" <davem@...emloft.net>,
 Eric Dumazet <edumazet@...gle.com>,
 Jakub Kicinski <kuba@...nel.org>,
 Paolo Abeni <pabeni@...hat.com>,
 netdev@...r.kernel.org
Subject: Re: [PATCH vhost v2 4/7] virtio_net: big mode support premapped

On Wed, 24 Apr 2024 11:50:44 +0800, Jason Wang <jasowang@...hat.com> wrote:
> On Wed, Apr 24, 2024 at 10:58 AM Xuan Zhuo <xuanzhuo@...ux.alibaba.com> wrote:
> >
> > On Wed, 24 Apr 2024 10:45:49 +0800, Jason Wang <jasowang@...hat.com> wrote:
> > > On Wed, Apr 24, 2024 at 10:42 AM Xuan Zhuo <xuanzhuo@...ux.alibaba.com> wrote:
> > > >
> > > > On Wed, 24 Apr 2024 10:34:56 +0800, Jason Wang <jasowang@...hat.com> wrote:
> > > > > On Wed, Apr 24, 2024 at 9:10 AM Xuan Zhuo <xuanzhuo@...ux.alibaba.com> wrote:
> > > > > >
> > > > > > On Wed, 24 Apr 2024 08:43:21 +0800, Jason Wang <jasowang@...hat.com> wrote:
> > > > > > > On Tue, Apr 23, 2024 at 8:38 PM Xuan Zhuo <xuanzhuo@...ux.alibaba.com> wrote:
> > > > > > > >
> > > > > > > > On Tue, 23 Apr 2024 12:36:42 +0800, Jason Wang <jasowang@...hat.com> wrote:
> > > > > > > > > On Mon, Apr 22, 2024 at 3:24 PM Xuan Zhuo <xuanzhuo@...ux.alibaba.com> wrote:
> > > > > > > > > >
> > > > > > > > > > In big mode, pre-mapping DMA is beneficial because if the pages are not
> > > > > > > > > > used, we can reuse them without needing to unmap and remap.
> > > > > > > > > >
> > > > > > > > > > We require space to store the DMA address. I use the page.dma_addr to
> > > > > > > > > > store the DMA address from the pp structure inside the page.
> > > > > > > > > >
> > > > > > > > > > Every page retrieved from get_a_page() is mapped, and its DMA address is
> > > > > > > > > > stored in page.dma_addr. When a page is returned to the chain, we check
> > > > > > > > > > the DMA status; if it is not mapped (potentially having been unmapped),
> > > > > > > > > > we remap it before returning it to the chain.
> > > > > > > > > >
> > > > > > > > > > Based on the following points, we do not use page pool to manage these
> > > > > > > > > > pages:
> > > > > > > > > >
> > > > > > > > > > 1. virtio-net uses the DMA APIs wrapped by virtio core. Therefore,
> > > > > > > > > >    we can only prevent the page pool from performing DMA operations, and
> > > > > > > > > >    let the driver perform DMA operations on the allocated pages.
> > > > > > > > > > 2. But when the page pool releases the page, we have no chance to
> > > > > > > > > >    execute dma unmap.
> > > > > > > > > > 3. A solution to #2 is to execute dma unmap every time before putting
> > > > > > > > > >    the page back to the page pool. (This is actually a waste, we don't
> > > > > > > > > >    execute unmap so frequently.)
> > > > > > > > > > 4. But there is another problem, we still need to use page.dma_addr to
> > > > > > > > > >    save the dma address. Using page.dma_addr while using page pool is
> > > > > > > > > >    unsafe behavior.
> > > > > > > > > >
> > > > > > > > > > More:
> > > > > > > > > >     https://lore.kernel.org/all/CACGkMEu=Aok9z2imB_c5qVuujSh=vjj1kx12fy9N7hqyi+M5Ow@mail.gmail.com/
> > > > > > > > > >
> > > > > > > > > > Signed-off-by: Xuan Zhuo <xuanzhuo@...ux.alibaba.com>
> > > > > > > > > > ---
> > > > > > > > > >  drivers/net/virtio_net.c | 123 ++++++++++++++++++++++++++++++++++-----
> > > > > > > > > >  1 file changed, 108 insertions(+), 15 deletions(-)
> > > > > > > > > >
> > > > > > > > > > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > > > > > > > > > index 2c7a67ad4789..d4f5e65b247e 100644
> > > > > > > > > > --- a/drivers/net/virtio_net.c
> > > > > > > > > > +++ b/drivers/net/virtio_net.c
> > > > > > > > > > @@ -439,6 +439,81 @@ skb_vnet_common_hdr(struct sk_buff *skb)
> > > > > > > > > >         return (struct virtio_net_common_hdr *)skb->cb;
> > > > > > > > > >  }
> > > > > > > > > >
> > > > > > > > > > +static void sg_fill_dma(struct scatterlist *sg, dma_addr_t addr, u32 len)
> > > > > > > > > > +{
> > > > > > > > > > +       sg->dma_address = addr;
> > > > > > > > > > +       sg->length = len;
> > > > > > > > > > +}
> > > > > > > > > > +
> > > > > > > > > > +/* For pages submitted to the ring, we need to record its dma for unmap.
> > > > > > > > > > + * Here, we use the page.dma_addr and page.pp_magic to store the dma
> > > > > > > > > > + * address.
> > > > > > > > > > + */
> > > > > > > > > > +static void page_chain_set_dma(struct page *p, dma_addr_t addr)
> > > > > > > > > > +{
> > > > > > > > > > +       if (sizeof(dma_addr_t) > sizeof(unsigned long)) {
> > > > > > > > >
> > > > > > > > > Need a macro like PAGE_POOL_32BIT_ARCH_WITH_64BIT_DMA.
> > > > > > > > >
> > > > > > > > > > +               p->dma_addr = lower_32_bits(addr);
> > > > > > > > > > +               p->pp_magic = upper_32_bits(addr);
> > > > > > > > >
> > > > > > > > > And this uses three fields on page_pool which I'm not sure the other
> > > > > > > > > maintainers are happy with. For example, re-using pp_maing might be
> > > > > > > > > dangerous. See c07aea3ef4d40 ("mm: add a signature in struct page").
> > > > > > > > >
> > > > > > > > > I think a more safe way is to reuse page pool, for example introducing
> > > > > > > > > a new flag with dma callbacks?
> > > > > > > >
> > > > > > > > If we use page pool, how can we chain the pages allocated for a packet?
> > > > > > >
> > > > > > > I'm not sure I get this, it is chained via the descriptor flag.
> > > > > >
> > > > > >
> > > > > > In the big mode, we will commit many pages to the virtio core by
> > > > > > virtqueue_add_inbuf().
> > > > > >
> > > > > > By virtqueue_get_buf_ctx(), we got the data. That is the first page.
> > > > > > Other pages are chained by the "private".
> > > > > >
> > > > > > If we use the page pool, how can we chain the pages.
> > > > > > After virtqueue_add_inbuf(), we need to get the pages to fill the skb.
> > > > >
> > > > > Right, technically it could be solved by providing helpers in the
> > > > > virtio core, but considering it's an optimization for big mode which
> > > > > is not popular, it's not worth to bother.
> > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > Yon know the "private" can not be used.
> > > > > > > >
> > > > > > > >
> > > > > > > > If the pp struct inside the page is not safe, how about:
> > > > > > > >
> > > > > > > >                 struct {        /* Page cache and anonymous pages */
> > > > > > > >                         /**
> > > > > > > >                          * @lru: Pageout list, eg. active_list protected by
> > > > > > > >                          * lruvec->lru_lock.  Sometimes used as a generic list
> > > > > > > >                          * by the page owner.
> > > > > > > >                          */
> > > > > > > >                         union {
> > > > > > > >                                 struct list_head lru;
> > > > > > > >
> > > > > > > >                                 /* Or, for the Unevictable "LRU list" slot */
> > > > > > > >                                 struct {
> > > > > > > >                                         /* Always even, to negate PageTail */
> > > > > > > >                                         void *__filler;
> > > > > > > >                                         /* Count page's or folio's mlocks */
> > > > > > > >                                         unsigned int mlock_count;
> > > > > > > >                                 };
> > > > > > > >
> > > > > > > >                                 /* Or, free page */
> > > > > > > >                                 struct list_head buddy_list;
> > > > > > > >                                 struct list_head pcp_list;
> > > > > > > >                         };
> > > > > > > >                         /* See page-flags.h for PAGE_MAPPING_FLAGS */
> > > > > > > >                         struct address_space *mapping;
> > > > > > > >                         union {
> > > > > > > >                                 pgoff_t index;          /* Our offset within mapping. */
> > > > > > > >                                 unsigned long share;    /* share count for fsdax */
> > > > > > > >                         };
> > > > > > > >                         /**
> > > > > > > >                          * @private: Mapping-private opaque data.
> > > > > > > >                          * Usually used for buffer_heads if PagePrivate.
> > > > > > > >                          * Used for swp_entry_t if PageSwapCache.
> > > > > > > >                          * Indicates order in the buddy system if PageBuddy.
> > > > > > > >                          */
> > > > > > > >                         unsigned long private;
> > > > > > > >                 };
> > > > > > > >
> > > > > > > > Or, we can map the private space of the page as a new structure.
> > > > > > >
> > > > > > > It could be a way. But such allocation might be huge if we are using
> > > > > > > indirect descriptors or I may miss something.
> > > > > >
> > > > > > No. we only need to store the "chain next" and the dma as this patch set did.
> > > > > > The size of the private space inside the page is  20(32bit)/40(64bit) bytes.
> > > > > > That is enough for us.
> > > > > >
> > > > > > If you worry about the change of the pp structure, we can use the "private" as
> > > > > > origin and use the "struct list_head lru" to store the dma.
> > > > >
> > > > > This looks even worse, as it uses fields belonging to the different
> > > > > structures in the union.
> > > >
> > > > I mean we do not use the elems from the pp structure inside the page,
> > > > if we worry the change of the pp structure.
> > > >
> > > > I mean use the "private" and "lru", these in the same structure.
> > > >
> > > > I think this is a good way.
> > > >
> > > > Thanks.
> > >
> > > See this:
> > >
> > > https://lore.kernel.org/netdev/20210411114307.5087f958@carbon/
> >
> >
> > I think that is because that the page pool will share the page with
> > the skbs.  I'm not entirely sure.
> >
> > In our case, virtio-net fully owns the page. After the page is referenced by skb,
> > virtio-net no longer references the page. I don't think there is any problem
> > here.
>
> Well, in the rx path, though the page is allocated by the virtio-net,
> unlike the page pool, those pages are not freed by virtio-net. So it
> may leave things in the page structure which is problematic. I don't
> think we can introduce a virtio-net specific hook for kfree_skb() in
> this case. That's why I think leveraging the page pool is better.
>
> For reusing page pool. Maybe we can reuse __pp_mapping_pad for
> virtio-net specific use cases like chaining, and clear it in
> page_pool_clear_pp_info(). And we need to make sure we don't break
> things like TCP RX zerocopy since mapping is aliasied with
> __pp_mapping_pad at a first glance.
>
> >
> > The key is that who owns the page, who can use the page private space (20/40 bytes).
> >
> > Is that?
>
> I'm not saying we can't investigate in this direction. But it needs
> more comments from mm guys and we need to evaluate the price we pay
> for that.
>
> The motivation is to drop the fallback code when pre mapping is not
> supported to improve the maintainability of the code and ease the
> AF_XDP support for virtio-net. But it turns out to be not easy.
>
> Considering the rx fallback code we need to maintain is not too huge,
> maybe we can leave it as is, for example forbid AF_XDP in big modes.

I see.

Thanks.


>
> Thanks
>
> >
> > Thanks.
> >
> >
> > >
> > > Thanks
> > >
> >
>