netdev - Re: [PATCH net-next v2] net: skb: introduce and use a single page frag cache

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKgT0Ud+U4NO+adYeUdegVbbS2EMaqTg2B-a0Z8Q2n9H7MaePg@mail.gmail.com>
Date:   Fri, 23 Sep 2022 08:22:43 -0700
From:   Alexander Duyck <alexander.duyck@...il.com>
To:     Paolo Abeni <pabeni@...hat.com>
Cc:     netdev@...r.kernel.org, Eric Dumazet <edumazet@...gle.com>,
        "David S. Miller" <davem@...emloft.net>,
        Jakub Kicinski <kuba@...nel.org>
Subject: Re: [PATCH net-next v2] net: skb: introduce and use a single page
 frag cache

On Fri, Sep 23, 2022 at 12:41 AM Paolo Abeni <pabeni@...hat.com> wrote:
>
> Hello,
>
> On Thu, 2022-09-22 at 14:17 -0700, Alexander Duyck wrote:
> [...]
> > My suggestion earlier was to just make the 1k cache a page_frag_cache.
> > It will allow you to reuse the same structure members and a single
> > pointer to track them. Waste would be minimal since the only real
> > difference between the structures is about 8B for the structure, and
> > odds are the napi_alloc_cache allocation is being rounded up anyway.
> >
> > >         unsigned int skb_count;
> > >         void *skb_cache[NAPI_SKB_CACHE_SIZE];
> > >  };
> > > @@ -143,6 +202,23 @@ struct napi_alloc_cache {
> > >  static DEFINE_PER_CPU(struct page_frag_cache, netdev_alloc_cache);
> > >  static DEFINE_PER_CPU(struct napi_alloc_cache, napi_alloc_cache);
> > >
> > > +/* Double check that napi_get_frags() allocates skbs with
> > > + * skb->head being backed by slab, not a page fragment.
> > > + * This is to make sure bug fixed in 3226b158e67c
> > > + * ("net: avoid 32 x truesize under-estimation for tiny skbs")
> > > + * does not accidentally come back.
> > > + */
> > > +void napi_get_frags_check(struct napi_struct *napi)
> > > +{
> > > +       struct sk_buff *skb;
> > > +
> > > +       local_bh_disable();
> > > +       skb = napi_get_frags(napi);
> > > +       WARN_ON_ONCE(!NAPI_HAS_SMALL_PAGE_FRAG && skb && skb->head_frag);
> > > +       napi_free_frags(napi);
> > > +       local_bh_enable();
> > > +}
> > > +
> > >  void *__napi_alloc_frag_align(unsigned int fragsz, unsigned int align_mask)
> > >  {
> > >         struct napi_alloc_cache *nc = this_cpu_ptr(&napi_alloc_cache);
> > > @@ -561,6 +637,7 @@ struct sk_buff *__napi_alloc_skb(struct napi_struct *napi, unsigned int len,
> > >  {
> > >         struct napi_alloc_cache *nc;
> > >         struct sk_buff *skb;
> > > +       bool pfmemalloc;
> >
> > Rather than adding this I think you would be better off adding a
> > struct page_frag_cache pointer. I will reference it here as "pfc".
> >
> > >         void *data;
> > >
> > >         DEBUG_NET_WARN_ON_ONCE(!in_softirq());
> > > @@ -568,8 +645,10 @@ struct sk_buff *__napi_alloc_skb(struct napi_struct *napi, unsigned int len,
> > >
> > >         /* If requested length is either too small or too big,
> > >          * we use kmalloc() for skb->head allocation.
> > > +        * When the small frag allocator is available, prefer it over kmalloc
> > > +        * for small fragments
> > >          */
> > > -       if (len <= SKB_WITH_OVERHEAD(1024) ||
> > > +       if ((!NAPI_HAS_SMALL_PAGE_FRAG && len <= SKB_WITH_OVERHEAD(1024)) ||
> > >             len > SKB_WITH_OVERHEAD(PAGE_SIZE) ||
> > >             (gfp_mask & (__GFP_DIRECT_RECLAIM | GFP_DMA))) {
> > >                 skb = __alloc_skb(len, gfp_mask, SKB_ALLOC_RX | SKB_ALLOC_NAPI,
> > > @@ -580,13 +659,30 @@ struct sk_buff *__napi_alloc_skb(struct napi_struct *napi, unsigned int len,
> > >         }
> > >
> > >         nc = this_cpu_ptr(&napi_alloc_cache);
> > > -       len += SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
> > > -       len = SKB_DATA_ALIGN(len);
> > >
> > >         if (sk_memalloc_socks())
> > >                 gfp_mask |= __GFP_MEMALLOC;
> > >
> > > -       data = page_frag_alloc(&nc->page, len, gfp_mask);
> > > +       if (NAPI_HAS_SMALL_PAGE_FRAG && len <= SKB_WITH_OVERHEAD(1024)) {
> >
> > Then here you would add a line that would be:
> > pfc = &nc->page_small;
> >
> > > +               /* we are artificially inflating the allocation size, but
> > > +                * that is not as bad as it may look like, as:
> > > +                * - 'len' less then GRO_MAX_HEAD makes little sense
> > > +                * - larger 'len' values lead to fragment size above 512 bytes
> > > +                *   as per NAPI_HAS_SMALL_PAGE_FRAG definition
> > > +                * - kmalloc would use the kmalloc-1k slab for such values
> > > +                */
> > > +               len = SZ_1K;
> > > +
> > > +               data = page_frag_alloc_1k(&nc->page_small, gfp_mask);
> > > +               pfmemalloc = nc->page_small.pfmemalloc;
> >
> > Instead of setting pfmemalloc you could just update the line below. In
> > addition you would just be passing pfc as the parameter.
> >
> > > +       } else {
> >
> > Likewise here you would have the line:
> > pfc = &nc->page;
>
> Probaly I was not clear in my previois email, sorry: before posting
> this version I tried locally exactly all the above, and the generated
> code with gcc 11.3.1 is a little bigger (a couple of instructions) than
> what this version produces (with gcc 11.3.1-2). It has the same number
> of conditionals and a slightly larger napi_alloc_cache.
>
> Additionally the suggested alternative needs more pre-processor
> conditionals to handle the !NAPI_HAS_SMALL_PAGE_FRAG case - avoiding
> adding a 2nd, unused in that case, page_frag_cache.
>
> [...]

Why would that be? You should still be using the pointer to the
page_frag_cache in the standard case. Like I was saying what you are
doing is essentially replacing the use of napi_alloc_cache with the
page_frag_cache, so for example with the existing setup all references
to "nc->page" would become "pfc->" so there shouldn't be any extra
unused variables in such a case since it would be used for both the
frag allocation and the pfmemalloc check.

One alternate way that occured to me to handle this would be to look
at possibly having napi_alloc_cache contain an array of
page_frag_cache structures. With that approach you could just size the
array and have it stick with a size of 1 in the case that small
doesn't exist, and support a size of 2 if it does. You could define
them via an enum so that the max would vary depending on if you add a
small frag cache or not. With that you could just bump the pointer
with a ++ so it goes from large to small and you wouldn't have any
warnings about items not existing in your structures, and the code
with the ++ could be kept from being called in the
!NAPI_HAS_SMALL_PAGE_FRAG case.

> >
> > > @@ -596,7 +692,7 @@ struct sk_buff *__napi_alloc_skb(struct napi_struct *napi, unsigned int len,
> > >                 return NULL;
> > >         }
> > >
> > > -       if (nc->page.pfmemalloc)
> > > +       if (pfmemalloc)
> >
> > Instead of passing pfmemalloc you could just check pfc->pfmemalloc.
> > Alternatively I wonder if it wouldn't be faster to just set the value
> > directly based on frag_cache->pfmemalloc and avoid the conditional
> > heck entirely.
>
> Note that:
>
>         skb->pfmemalloc = pfmemalloc;
>
> avoids a branch but requires more instructions than the current code
> (verified with gcc). The gain looks doubtful?!? Additionally we have
> statement alike:
>
>         if (<pfmemalloc>)
>                 skb->pfmemalloc = 1;
>
> in other functions in skbuff.c - still fast-path - and would be better
> updating all the places together for consistency - if that is really
> considered an improvement. IMHO it should at least land in a different
> patch.

We cannot really use "skb->pfmemalloc = pfmemalloc" because one is a
bitfield and the other is a boolean value. I suspect the complexity
would be greatly reduced if we converted the pfmemalloc to a bitfield
similar to skb->pfmemalloc. It isn't important though. I was mostly
just speculating on possible future optimizations.

> I'll post a v3 with your updated email address, but I think the current
> code is the better option.

That's fine. Like I mentioned I am mostly just trying to think things
out and identify any possible gaps we may have missed. I will keep an
eye out for the next version.