linux-kernel - Re: [PATCH RFC net-next 05/34] idpf: convert header split mode to libie + napi_build

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <659d513f5c0f6_161283294f5@willemb.c.googlers.com.notmuch>
Date: Tue, 09 Jan 2024 08:59:27 -0500
From: Willem de Bruijn <willemdebruijn.kernel@...il.com>
To: Alexander Lobakin <aleksander.lobakin@...el.com>, 
 Willem de Bruijn <willemdebruijn.kernel@...il.com>
Cc: "David S. Miller" <davem@...emloft.net>, 
 Eric Dumazet <edumazet@...gle.com>, 
 Jakub Kicinski <kuba@...nel.org>, 
 Paolo Abeni <pabeni@...hat.com>, 
 Maciej Fijalkowski <maciej.fijalkowski@...el.com>, 
 Michal Kubiak <michal.kubiak@...el.com>, 
 Larysa Zaremba <larysa.zaremba@...el.com>, 
 Alexei Starovoitov <ast@...nel.org>, 
 Daniel Borkmann <daniel@...earbox.net>, 
 intel-wired-lan@...ts.osuosl.org, 
 netdev@...r.kernel.org, 
 linux-kernel@...r.kernel.org
Subject: Re: [PATCH RFC net-next 05/34] idpf: convert header split mode to
 libie + napi_build_skb()

Alexander Lobakin wrote:
> From: Willem De Bruijn <willemdebruijn.kernel@...il.com>
> Date: Wed, 27 Dec 2023 10:30:48 -0500
> 
> > Alexander Lobakin wrote:
> >> Currently, idpf uses the following model for the header buffers:
> >>
> >> * buffers are allocated via dma_alloc_coherent();
> >> * when receiving, napi_alloc_skb() is called and then the header is
> >>   copied to the newly allocated linear part.
> >>
> >> This is far from optimal as DMA coherent zone is slow on many systems
> >> and memcpy() neutralizes the idea and benefits of the header split.
> > 
> > Do you have data showing this?
> 
> Showing slow coherent DMA or memcpy()?
> Try MIPS for the first one.
> For the second -- try comparing performance on ice with the "legacy-rx"
> private flag disabled and enabled.
> 
> > 
> > The assumption for the current model is that the headers will be
> > touched shortly after, so the copy just primes the cache.
> 
> They won't be touched in many cases. E.g. XDP_DROP.
> Or headers can be long. memcpy(32) != memcpy(128).
> The current model allocates a new skb with a linear part, which is a
> real memory allocation. napi_build_skb() doesn't allocate anything
> except struct sk_buff, which is usually available in the NAPI percpu cache.
> If build_skb() wasn't more effective, it wouldn't be introduced.
> The current model just assumes default socket traffic with ~40-byte
> headers and no XDP etc.
> 
> > 
> > The single coherently allocated region for all headers reduces
> > IOTLB pressure.
> 
> page_pool pages are mapped once at allocation.
> 
> > 
> > It is possible that the alternative model is faster. But that is not
> > trivially obvious.
> > 
> > I think patches like this can stand on their own. Probably best to
> > leave them out of the dependency series to enable XDP and AF_XDP.
> 
> You can't do XDP on DMA coherent zone. To do this memcpy(), you need
> allocate a new skb with a linear part, which is usually done after XDP,
> otherwise it's too much overhead and little-to-no benefits comparing to
> generic skb XDP.
> The current idpf code is just not compatible with the XDP code in this
> series, it's pointless to do double work.
> 
> Disabling header split when XDP is enabled (alternative option) means
> disabling TCP zerocopy and worse performance in general, I don't
> consider this.

My concern is if optimizations for XDP might degrade the TCP/IP common
path. XDP_DROP and all of XDP even is a niche feature by comparison.

The current driver behavior was not the first for IDPF, but arrived
at based on extensive performance debugging. An earlier iteration used
separate header buffers. Switching to a single coherent allocated
buffer region significantly increased throughput / narrowed the gap
between header-split and non-header-split mode.

I follow your argument and the heuristics are reasonable. My request
is only that this decision is based on real data for this driver and
modern platforms. We cannot regress TCP/IP hot path performance.