[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <6633c62c7a59f887eb15d3309c9cc408a74fb46f.camel@redhat.com>
Date: Fri, 23 Sep 2022 18:19:07 +0200
From: Paolo Abeni <pabeni@...hat.com>
To: Alexander Duyck <alexander.duyck@...il.com>
Cc: netdev@...r.kernel.org, Eric Dumazet <edumazet@...gle.com>,
"David S. Miller" <davem@...emloft.net>,
Jakub Kicinski <kuba@...nel.org>
Subject: Re: [PATCH net-next v2] net: skb: introduce and use a single page
frag cache
On Fri, 2022-09-23 at 08:22 -0700, Alexander Duyck wrote:
> On Fri, Sep 23, 2022 at 12:41 AM Paolo Abeni <pabeni@...hat.com> wrote:
> >
> > Hello,
> >
> > On Thu, 2022-09-22 at 14:17 -0700, Alexander Duyck wrote:
> > [...]
> > > My suggestion earlier was to just make the 1k cache a page_frag_cache.
> > > It will allow you to reuse the same structure members and a single
> > > pointer to track them. Waste would be minimal since the only real
> > > difference between the structures is about 8B for the structure, and
> > > odds are the napi_alloc_cache allocation is being rounded up anyway.
> > >
> > > > unsigned int skb_count;
> > > > void *skb_cache[NAPI_SKB_CACHE_SIZE];
> > > > };
> > > > @@ -143,6 +202,23 @@ struct napi_alloc_cache {
> > > > static DEFINE_PER_CPU(struct page_frag_cache, netdev_alloc_cache);
> > > > static DEFINE_PER_CPU(struct napi_alloc_cache, napi_alloc_cache);
> > > >
> > > > +/* Double check that napi_get_frags() allocates skbs with
> > > > + * skb->head being backed by slab, not a page fragment.
> > > > + * This is to make sure bug fixed in 3226b158e67c
> > > > + * ("net: avoid 32 x truesize under-estimation for tiny skbs")
> > > > + * does not accidentally come back.
> > > > + */
> > > > +void napi_get_frags_check(struct napi_struct *napi)
> > > > +{
> > > > + struct sk_buff *skb;
> > > > +
> > > > + local_bh_disable();
> > > > + skb = napi_get_frags(napi);
> > > > + WARN_ON_ONCE(!NAPI_HAS_SMALL_PAGE_FRAG && skb && skb->head_frag);
> > > > + napi_free_frags(napi);
> > > > + local_bh_enable();
> > > > +}
> > > > +
> > > > void *__napi_alloc_frag_align(unsigned int fragsz, unsigned int align_mask)
> > > > {
> > > > struct napi_alloc_cache *nc = this_cpu_ptr(&napi_alloc_cache);
> > > > @@ -561,6 +637,7 @@ struct sk_buff *__napi_alloc_skb(struct napi_struct *napi, unsigned int len,
> > > > {
> > > > struct napi_alloc_cache *nc;
> > > > struct sk_buff *skb;
> > > > + bool pfmemalloc;
> > >
> > > Rather than adding this I think you would be better off adding a
> > > struct page_frag_cache pointer. I will reference it here as "pfc".
> > >
> > > > void *data;
> > > >
> > > > DEBUG_NET_WARN_ON_ONCE(!in_softirq());
> > > > @@ -568,8 +645,10 @@ struct sk_buff *__napi_alloc_skb(struct napi_struct *napi, unsigned int len,
> > > >
> > > > /* If requested length is either too small or too big,
> > > > * we use kmalloc() for skb->head allocation.
> > > > + * When the small frag allocator is available, prefer it over kmalloc
> > > > + * for small fragments
> > > > */
> > > > - if (len <= SKB_WITH_OVERHEAD(1024) ||
> > > > + if ((!NAPI_HAS_SMALL_PAGE_FRAG && len <= SKB_WITH_OVERHEAD(1024)) ||
> > > > len > SKB_WITH_OVERHEAD(PAGE_SIZE) ||
> > > > (gfp_mask & (__GFP_DIRECT_RECLAIM | GFP_DMA))) {
> > > > skb = __alloc_skb(len, gfp_mask, SKB_ALLOC_RX | SKB_ALLOC_NAPI,
> > > > @@ -580,13 +659,30 @@ struct sk_buff *__napi_alloc_skb(struct napi_struct *napi, unsigned int len,
> > > > }
> > > >
> > > > nc = this_cpu_ptr(&napi_alloc_cache);
> > > > - len += SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
> > > > - len = SKB_DATA_ALIGN(len);
> > > >
> > > > if (sk_memalloc_socks())
> > > > gfp_mask |= __GFP_MEMALLOC;
> > > >
> > > > - data = page_frag_alloc(&nc->page, len, gfp_mask);
> > > > + if (NAPI_HAS_SMALL_PAGE_FRAG && len <= SKB_WITH_OVERHEAD(1024)) {
> > >
> > > Then here you would add a line that would be:
> > > pfc = &nc->page_small;
> > >
> > > > + /* we are artificially inflating the allocation size, but
> > > > + * that is not as bad as it may look like, as:
> > > > + * - 'len' less then GRO_MAX_HEAD makes little sense
> > > > + * - larger 'len' values lead to fragment size above 512 bytes
> > > > + * as per NAPI_HAS_SMALL_PAGE_FRAG definition
> > > > + * - kmalloc would use the kmalloc-1k slab for such values
> > > > + */
> > > > + len = SZ_1K;
> > > > +
> > > > + data = page_frag_alloc_1k(&nc->page_small, gfp_mask);
> > > > + pfmemalloc = nc->page_small.pfmemalloc;
> > >
> > > Instead of setting pfmemalloc you could just update the line below. In
> > > addition you would just be passing pfc as the parameter.
> > >
> > > > + } else {
> > >
> > > Likewise here you would have the line:
> > > pfc = &nc->page;
> >
> > Probaly I was not clear in my previois email, sorry: before posting
> > this version I tried locally exactly all the above, and the generated
> > code with gcc 11.3.1 is a little bigger (a couple of instructions) than
> > what this version produces (with gcc 11.3.1-2). It has the same number
> > of conditionals and a slightly larger napi_alloc_cache.
> >
> > Additionally the suggested alternative needs more pre-processor
> > conditionals to handle the !NAPI_HAS_SMALL_PAGE_FRAG case - avoiding
> > adding a 2nd, unused in that case, page_frag_cache.
> >
> > [...]
>
> Why would that be? You should still be using the pointer to the
> page_frag_cache in the standard case. Like I was saying what you are
> doing is essentially replacing the use of napi_alloc_cache with the
> page_frag_cache, so for example with the existing setup all references
> to "nc->page" would become "pfc->" so there shouldn't be any extra
> unused variables in such a case since it would be used for both the
> frag allocation and the pfmemalloc check.
The problem is that regardless of the NAPI_HAS_SMALL_PAGE_FRAG value,
under the branch:
if (NAPI_HAS_SMALL_PAGE_FRAG && len <= SKB_WITH_OVERHEAD(1024)) {
gcc sees nc->page_small so we need page_small to exists and be 0 bytes
wide for !NAPI_HAS_SMALL_PAGE_FRAG.
> One alternate way that occured to me to handle this would be to look
> at possibly having napi_alloc_cache contain an array of
> page_frag_cache structures. With that approach you could just size the
> array and have it stick with a size of 1 in the case that small
> doesn't exist, and support a size of 2 if it does. You could define
> them via an enum so that the max would vary depending on if you add a
> small frag cache or not. With that you could just bump the pointer
> with a ++ so it goes from large to small and you wouldn't have any
> warnings about items not existing in your structures, and the code
> with the ++ could be kept from being called in the
> !NAPI_HAS_SMALL_PAGE_FRAG case.
Yes, the above works nicely from a 'no additional preprocessor
conditional perspective', and the generate code for __napi_alloc_skb()
is 3 bytes shorted then my variant - which boils down to nothing due to
alignment - but the most critical path (small allocations) requires
more instructions and the diff is IMHO less readable, touching all the
other nc->page users.
Allocations >1K should be a less relevant code path, as e.g. there is
still a ~ 32x truesize underestimation there...
> > > > @@ -596,7 +692,7 @@ struct sk_buff *__napi_alloc_skb(struct napi_struct *napi, unsigned int len,
> > > > return NULL;
> > > > }
> > > >
> > > > - if (nc->page.pfmemalloc)
> > > > + if (pfmemalloc)
> > >
> > > Instead of passing pfmemalloc you could just check pfc->pfmemalloc.
> > > Alternatively I wonder if it wouldn't be faster to just set the value
> > > directly based on frag_cache->pfmemalloc and avoid the conditional
> > > heck entirely.
> >
> > Note that:
> >
> > skb->pfmemalloc = pfmemalloc;
> >
> > avoids a branch but requires more instructions than the current code
> > (verified with gcc). The gain looks doubtful?!? Additionally we have
> > statement alike:
> >
> > if (<pfmemalloc>)
> > skb->pfmemalloc = 1;
> >
> > in other functions in skbuff.c - still fast-path - and would be better
> > updating all the places together for consistency - if that is really
> > considered an improvement. IMHO it should at least land in a different
> > patch.
>
> We cannot really use "skb->pfmemalloc = pfmemalloc" because one is a
> bitfield and the other is a boolean value. I suspect the complexity
> would be greatly reduced if we converted the pfmemalloc to a bitfield
> similar to skb->pfmemalloc. It isn't important though. I was mostly
> just speculating on possible future optimizations.
>
> > I'll post a v3 with your updated email address, but I think the current
> > code is the better option.
>
> That's fine. Like I mentioned I am mostly just trying to think things
> out and identify any possible gaps we may have missed. I will keep an
> eye out for the next version.
Yep, let me go aheat with this...
Thanks,
Paolo
Powered by blists - more mailing lists