netdev - Re: [PATCH net-next v4] net: skb: introduce and use a single page frag cache

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANn89iKXcagXP=QkYtrsk5jFXiyvhNL=bdXJZLc5T3e3=+jSiA@mail.gmail.com>
Date:   Mon, 3 Oct 2022 15:30:57 -0700
From:   Eric Dumazet <edumazet@...gle.com>
To:     Paolo Abeni <pabeni@...hat.com>
Cc:     patchwork-bot+netdevbpf@...nel.org,
        netdev <netdev@...r.kernel.org>,
        David Miller <davem@...emloft.net>,
        Jakub Kicinski <kuba@...nel.org>,
        Alexander Duyck <alexanderduyck@...com>
Subject: Re: [PATCH net-next v4] net: skb: introduce and use a single page
 frag cache

On Mon, Oct 3, 2022 at 2:40 PM Paolo Abeni <pabeni@...hat.com> wrote:
>
> On Sun, 2022-10-02 at 16:55 +0200, Paolo Abeni wrote:
> > On Fri, 2022-09-30 at 10:45 -0700, Eric Dumazet wrote:
> > > On Fri, Sep 30, 2022 at 10:30 AM Paolo Abeni <pabeni@...hat.com> wrote:
> > > > On Fri, 2022-09-30 at 09:43 -0700, Eric Dumazet wrote:
> > > > > Paolo, this patch adds a regression for TCP RPC workloads (aka TCP_RR)
> > > > >
> > > > > Before the patch, cpus servicing NIC interrupts were allocating
> > > > > SLAB/SLUB objects for incoming packets,
> > > > > but they were also freeing skbs from TCP rtx queues when ACK packets
> > > > > were processed. SLAB/SLUB caches
> > > > > were efficient (hit ratio close to 100%)
> > > >
> > > > Thank you for the report. Is that reproducible with netperf TCP_RR and
> > > > CONFIG_DEBUG_SLAB, I guess? Do I need specific request/response sizes?
> > >
> > > No CONFIG_DEBUG_SLAB, simply standard SLAB, and tcp_rr tests on an AMD
> > > host with 256 cpus...
> >
> > The most similar host I can easily grab is a 2 numa nodes AMD with 16
> > cores/32 threads each.
> >
> > I tried tcp_rr with different number of flows in (1-2048) range with
> > both slub (I tried first that because is the default allocator in my
> > builds) and slab but so far I can't reproduce it: results differences
> > between pre-patch and post-patch kernel are within noise and numastat
> > never show misses.
> >
> > I'm likely missing some incredient to the recipe. I'll try next to pin
> > the netperf/netserver processes on a numa node different from the NIC's
> > one and to increase the number of concurrent flows.
>
> I'm still stuck trying to reproduce the issue. I tried pinning and
> increasing the flows numbers, but I could not find a scenario with a
> clear regression. I see some contention, but it's related to the timer
> wheel spinlock, and independent from my patch.

I was using neper ( https://github.com/google/neper ), with 100
threads and 4000 flows.

I suspect contention is hard to see unless you use a host with at
least 128 cores.
Also you need a lot of NIC receive queues (or use RPS to spread
incoming packets to many cores)


>
> Which kind of delta should I observe? Could you please share any
> additional setup hints?
>
> > I'm also wondering, after commit 68822bdf76f10 ("net: generalize skb
> > freeing deferral to per-cpu lists") the CPUs servicing the NIC
> > interrupts both allocate and (defer) free the memory for the incoming
> > packets,  so they should not have to access remote caches ?!? Or at
> > least isn't the allocator behavior always asymmetric, with rx alloc, rx
> > free, tx free on the same core and tx alloc possibly on a different
> > one?
> >
> > >
> > > We could first try in tcp_stream_alloc_skb()
> >
> > I'll try to test something alike the following - after reproducing the
> > issue.
> > ---
> > diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> > index f15d5b62539b..d5e9be98e8bd 100644
> > --- a/include/linux/skbuff.h
> > +++ b/include/linux/skbuff.h
> > @@ -1308,6 +1308,30 @@ static inline struct sk_buff *alloc_skb_fclone(unsigned int size,
> >       return __alloc_skb(size, priority, SKB_ALLOC_FCLONE, NUMA_NO_NODE);
> >  }
> >
> > +#if PAGE_SIZE == SZ_4K
> > +
> > +#define NAPI_HAS_SMALL_PAGE_FRAG     1
> > +
> > +struct sk_buff *__alloc_skb_fclone_frag(unsigned int size, gfp_t priority);
> > +
> > +static inline struct sk_buff *alloc_skb_fclone_frag(unsigned int size, gfp_t priority)
> > +{
> > +     if (size <= SKB_WITH_OVERHEAD(1024))
> > +             return __alloc_skb_fclone_frag(size, priority);
> > +
> > +     return alloc_skb_fclone(size, priority);
> > +}
> > +
> > +#else
> > +#define NAPI_HAS_SMALL_PAGE_FRAG     0
> > +
> > +static inline struct sk_buff *alloc_skb_fclone_frag(unsigned int size, gfp_t priority)
> > +{
> > +     return alloc_skb_fclone(size, priority);
> > +}
> > +
> > +#endif
> > +
> >  struct sk_buff *skb_morph(struct sk_buff *dst, struct sk_buff *src);
> >  void skb_headers_offset_update(struct sk_buff *skb, int off);
> >  int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask);
> > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > index 81d63f95e865..0c63653c9951 100644
> > --- a/net/core/skbuff.c
> > +++ b/net/core/skbuff.c
> > @@ -134,9 +134,8 @@ static void skb_under_panic(struct sk_buff *skb, unsigned int sz, void *addr)
> >  #define NAPI_SKB_CACHE_BULK  16
> >  #define NAPI_SKB_CACHE_HALF  (NAPI_SKB_CACHE_SIZE / 2)
> >
> > -#if PAGE_SIZE == SZ_4K
> > +#if NAPI_HAS_SMALL_PAGE_FRAG
> >
> > -#define NAPI_HAS_SMALL_PAGE_FRAG     1
> >  #define NAPI_SMALL_PAGE_PFMEMALLOC(nc)       ((nc).pfmemalloc)
> >
> >  /* specialized page frag allocator using a single order 0 page
> > @@ -173,12 +172,12 @@ static void *page_frag_alloc_1k(struct page_frag_1k *nc, gfp_t gfp)
> >       nc->offset = offset;
> >       return nc->va + offset;
> >  }
> > +
> >  #else
> >
> >  /* the small page is actually unused in this build; add dummy helpers
> >   * to please the compiler and avoid later preprocessor's conditionals
> >   */
> > -#define NAPI_HAS_SMALL_PAGE_FRAG     0
> >  #define NAPI_SMALL_PAGE_PFMEMALLOC(nc)       false
> >
> >  struct page_frag_1k {
> > @@ -543,6 +542,52 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
> >  }
> >  EXPORT_SYMBOL(__alloc_skb);
> >
> > +#if NAPI_HAS_SMALL_PAGE_FRAG
> > +/* optimized skb fast-clone allocation variant, using the small
> > + * page frag cache
> > + */
> > +struct sk_buff *__alloc_skb_fclone_frag(unsigned int size, gfp_t gfp_mask)
> > +{
> > +     struct sk_buff_fclones *fclones;
> > +     struct napi_alloc_cache *nc;
> > +     struct sk_buff *skb;
> > +     bool pfmemalloc;
> > +     void *data;
> > +
> > +     /* Get the HEAD */
> > +     skb = kmem_cache_alloc_node(skbuff_fclone_cache, gfp_mask & ~GFP_DMA, NUMA_NO_NODE);
> > +     if (unlikely(!skb))
> > +             return NULL;
> > +     prefetchw(skb);
> > +
> > +     /* We can access the napi cache only in BH context */
> > +     nc = this_cpu_ptr(&napi_alloc_cache);
> > +     local_bh_disable();
>
> For the record, the above is wrong, this_cpu_ptr must be after the
> local_bh_disable() call.

Can you send me (privately maybe) an updated patch ?
Thanks.