netdev - Re: [PATCH net] net: avoid 32 x truesize under-estimation for tiny skbs

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <0ad4ba2bc157a2d1fa8a898056bea431fc244122.camel@redhat.com>
Date:   Thu, 08 Sep 2022 16:26:13 +0200
From:   Paolo Abeni <pabeni@...hat.com>
To:     Eric Dumazet <edumazet@...gle.com>
Cc:     Eric Dumazet <eric.dumazet@...il.com>,
        "David S . Miller" <davem@...emloft.net>,
        Jakub Kicinski <kuba@...nel.org>,
        netdev <netdev@...r.kernel.org>,
        Alexander Duyck <alexanderduyck@...com>,
        "Michael S . Tsirkin" <mst@...hat.com>,
        Greg Thelen <gthelen@...gle.com>
Subject: Re: [PATCH net] net: avoid 32 x truesize under-estimation for tiny
 skbs

On Thu, 2022-09-08 at 05:20 -0700, Eric Dumazet wrote:
> On Thu, Sep 8, 2022 at 3:48 AM Paolo Abeni <pabeni@...hat.com> wrote:
> > On Wed, 2022-09-07 at 13:40 -0700, Eric Dumazet wrote:
> > > On 9/7/22 13:19, Paolo Abeni wrote:
> > > > reviving an old thread...
> > > > On Wed, 2021-01-13 at 08:18 -0800, Eric Dumazet wrote:
> > > > > While using page fragments instead of a kmalloc backed skb->head might give
> > > > > a small performance improvement in some cases, there is a huge risk of
> > > > > under estimating memory usage.
> > > > [...]
> > > > 
> > > > > Note that we might in the future use the sk_buff napi cache,
> > > > > instead of going through a more expensive __alloc_skb()
> > > > > 
> > > > > Another idea would be to use separate page sizes depending
> > > > > on the allocated length (to never have more than 4 frags per page)
> > > > I'm investigating a couple of performance regressions pointing to this
> > > > change and I'd like to have a try to the 2nd suggestion above.
> > > > 
> > > > If I read correctly, it means:
> > > > - extend the page_frag_cache alloc API to allow forcing max order==0
> > > > - add a 2nd page_frag_cache into napi_alloc_cache (say page_order0 or
> > > > page_small)
> > > > - in __napi_alloc_skb(), when len <= SKB_WITH_OVERHEAD(1024), use the
> > > > page_small cache with order 0 allocation.
> > > > (all the above constrained to host with 4K pages)
> > > > 
> > > > I'm not quite sure about the "never have more than 4 frags per page"
> > > > part.
> > > > 
> > > > What outlined above will allow for 10 min size frags in page_order0, as
> > > > (SKB_DATA_ALIGN(0) + SKB_DATA_ALIGN(struct skb_shared_info) == 384. I'm
> > > > not sure that anything will allocate such small frags.
> > > > With a more reasonable GRO_MAX_HEAD, there will be 6 frags per page.
> > > 
> > > Well, some arches have PAGE_SIZE=65536 :/
> > 
> > Yes, the idea is to implement all the above only for arches with
> > PAGE_SIZE==4K. Would that be reasonable?
> 
> Well, we also have changed MAX_SKB_FRAGS from 17 to 45 for BIG TCP.
> 
> And locally we have
> 
> #define GRO_MAX_HEAD 192

default allocation size for napi_get_frags() is ~960b in google kernel,
right? It looks like it should fit the above quite nicely with 4 frags
per page?!?

Vanilla kernel may hit a larger number of fragments per page, even if
very likely not as high as the theoretical maximum mentioned in my
previous email (as noted by Alex). 

If in that case excessive truesize underestimation would still be
problematic (with a order0 4k page) __napi_alloc_skb() could be patched
to increase smaller sizes to some reasonable minimum. 

Likely there is some point in your reply I did not get. Luckily LPC is
coming :) 

> Reference:
> 
> commit fd9ea57f4e9514f9d0f0dec505eefd99a8faa148
> Author: Eric Dumazet <edumazet@...gle.com>
> Date:   Wed Jun 8 09:04:38 2022 -0700
> 
>     net: add napi_get_frags_check() helper

I guess such check should be revisited with all the above. 

Thanks,

Paolo