[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAHApi-mEwbAcYzHHTR0QMSA8EqByvmwzLCBkMw2_6hxqkiCBmQ@mail.gmail.com>
Date: Wed, 29 Mar 2023 18:03:35 +0200
From: Kal Cutter Conley <kal.conley@...tris.com>
To: Magnus Karlsson <magnus.karlsson@...il.com>
Cc: Björn Töpel <bjorn@...nel.org>,
Magnus Karlsson <magnus.karlsson@...el.com>,
Maciej Fijalkowski <maciej.fijalkowski@...el.com>,
Jonathan Lemon <jonathan.lemon@...il.com>,
"David S. Miller" <davem@...emloft.net>,
Eric Dumazet <edumazet@...gle.com>,
Jakub Kicinski <kuba@...nel.org>,
Paolo Abeni <pabeni@...hat.com>,
Alexei Starovoitov <ast@...nel.org>,
Daniel Borkmann <daniel@...earbox.net>,
Jesper Dangaard Brouer <hawk@...nel.org>,
John Fastabend <john.fastabend@...il.com>,
netdev@...r.kernel.org, bpf@...r.kernel.org,
linux-kernel@...r.kernel.org,
Toke Høiland-Jørgensen <toke@...nel.org>
Subject: Re: [PATCH bpf-next 1/3] xsk: Support UMEM chunk_size > PAGE_SIZE
> > Add core AF_XDP support for chunk sizes larger than PAGE_SIZE. This
> > enables sending/receiving jumbo ethernet frames up to the theoretical
> > maxiumum of 64 KiB. For chunk sizes > PAGE_SIZE, the UMEM is required
> > to consist of HugeTLB VMAs (and be hugepage aligned). Initially, only
> > XDP_COPY mode is usuable pending future driver work.
>
> nit: useable
Fixed in v2.
>
> > For consistency, check for HugeTLB pages during UMEM registration. This
> > implies that hugepages are required for XDP_COPY mode despite DMA not
> > being used. This restriction is desirable since it ensures user software
> > can take advantage of future driver support.
> >
> > Even in HugeTLB mode, continue to do page accounting using order-0
> > (4 KiB) pages. This minimizes the size of this change and reduces the
> > risk of impacting driver code. Taking full advantage of hugepages for
> > accounting should improve XDP performance in the general case.
>
> Thank you Kal for working on this. Interesting stuff.
>
> First some general comments and questions on the patch set:
>
> * Please document this new feature in Documentation/networking/af_xdp.rst
Fixed in v2.
> * Have you verified the SKB path for Rx? Tx was exercised by running l2fwd.
This patchset allows sending/receiving 9000 MTU packets with xdpsock
(slightly modified). The benchmark numbers show the results for rxdrop
(-r).
> * Have you checked that an XDP program can access the full >4K packet?
> The xdp_buff has no problem with this as the buffer is consecutive,
> but just wondering if there is some other check or limit in there?
> Jesper and Toke will likely know, so roping them in.
Yes, the full packet can be accessed from a SEC("xdp") BPF program
(only tested in SKB mode).
> * Would be interesting to know your thoughts about taking this to
> zero-copy mode too. It would be good if you could support all modes
> from the get go, instead of partial support for some unknown amount of
> time and then zero-copy support. Partial support makes using the
> feature more cumbersome when an app is deployed on various systems.
> The continuity checking code you have at the end of the patch is a
> step in the direction of zero-copy support, it seems.
I think this patchset is enough to support zero-copy as long as the
driver allows it. Currently, no drivers will work out of the box AFAIK
since they all validate the chunk_size or the MTU size. I would
absolutely love for drivers to support this. Hopefully this patchset
is enough inspiration? :-) Do you think it's absolutely necessary to
have driver ZC support ready to land this?
> * What happens if I try to run this in zero-copy mode with a chunk_size > 4K?
AFAIK drivers check for this and throw an error. Maybe there are some
drivers that don't check this properly?
> * There are some compilation errors to fix from the kernel test robot
Fixed in v2.
>
> require_hugetlb would be a clearer name
Fixed in v2. Renamed to `need_hugetlb`.
>
> next_mapping? n as a name is not very descriptive.
Fixed in v2. Renamed to `stride`.
>
> > u32 i;
> >
> > - for (i = 0; i < dma_map->dma_pages_cnt - 1; i++) {
> > - if (dma_map->dma_pages[i] + PAGE_SIZE == dma_map->dma_pages[i + 1])
> > + for (i = 0; i + n < dma_map->dma_pages_cnt; i++) {
> > + if (dma_map->dma_pages[i] + page_size == dma_map->dma_pages[i + n])
> > dma_map->dma_pages[i] |= XSK_NEXT_PG_CONTIG_MASK;
> > else
> > dma_map->dma_pages[i] &= ~XSK_NEXT_PG_CONTIG_MASK;
> > }
> > + for (; i < dma_map->dma_pages_cnt; i++)
> > + dma_map->dma_pages[i] &= ~XSK_NEXT_PG_CONTIG_MASK;
>
> Is this not too conservative? If your umem consists of two huge pages
> mappings but they are not mapped consecutively in physical memory, you
> are going to mark all the chunks as non-consecutive. Would it not be
> better to just look chunk_size ahead of you instead of page_size
> above? The only thing you care about is that the chunk you are in is
> in consecutive physical memory, and that is strictly only true for
> zero-copy mode. So this seems to be in preparation for zero-copy mode.
>
It is slightly too conservative. I have updated the logic a bit in v2.
If the packet doesn't cross a page boundary, then this array is not
read anyway.
Thanks!
Kal
Powered by blists - more mailing lists