netdev - Re: [PATCH net] net: avoid 32 x truesize under-estimation for tiny skbs

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <dcffcf6fde8272975e44124f55fba3936833360e.camel@gmail.com>
Date:   Wed, 07 Sep 2022 14:36:07 -0700
From:   Alexander H Duyck <alexander.duyck@...il.com>
To:     Paolo Abeni <pabeni@...hat.com>,
        Eric Dumazet <eric.dumazet@...il.com>,
        "David S . Miller" <davem@...emloft.net>,
        Jakub Kicinski <kuba@...nel.org>
Cc:     netdev <netdev@...r.kernel.org>,
        Eric Dumazet <edumazet@...gle.com>,
        Alexander Duyck <alexanderduyck@...com>,
        "Michael S . Tsirkin" <mst@...hat.com>,
        Greg Thelen <gthelen@...gle.com>
Subject: Re: [PATCH net] net: avoid 32 x truesize under-estimation for tiny
 skbs

On Wed, 2022-09-07 at 22:19 +0200, Paolo Abeni wrote:
> Hello,
> 
> reviving an old thread...
> On Wed, 2021-01-13 at 08:18 -0800, Eric Dumazet wrote:
> > While using page fragments instead of a kmalloc backed skb->head might give
> > a small performance improvement in some cases, there is a huge risk of
> > under estimating memory usage.
> 
> [...]
> 
> > Note that we might in the future use the sk_buff napi cache,
> > instead of going through a more expensive __alloc_skb()
> > 
> > Another idea would be to use separate page sizes depending
> > on the allocated length (to never have more than 4 frags per page)
> 
> I'm investigating a couple of performance regressions pointing to this
> change and I'd like to have a try to the 2nd suggestion above. 
> 
> If I read correctly, it means:
> - extend the page_frag_cache alloc API to allow forcing max order==0
> - add a 2nd page_frag_cache into napi_alloc_cache (say page_order0 or
> page_small)
> - in __napi_alloc_skb(), when len <= SKB_WITH_OVERHEAD(1024), use the
> page_small cache with order 0 allocation.
> (all the above constrained to host with 4K pages)
> 
> I'm not quite sure about the "never have more than 4 frags per page"
> part.
> 
> What outlined above will allow for 10 min size frags in page_order0, as
> (SKB_DATA_ALIGN(0) + SKB_DATA_ALIGN(struct skb_shared_info) == 384. I'm
> not sure that anything will allocate such small frags.
> With a more reasonable GRO_MAX_HEAD, there will be 6 frags per page. 

That doesn't account for any headroom though. Most of the time you have
to reserve some space for headroom so that if this buffer ends up
getting routed off somewhere to be tunneled there is room for adding to
the header. I think the default ends up being NET_SKB_PAD, though many
NICs use larger values. So adding any data onto that will push you up
to a minimum of 512 per skb for the first 64B for header data.

With that said it would probably put you in the range of 8 or fewer
skbs per page assuming at least 1 byte for data:
  512 =	SKB_DATA_ALIGN(NET_SKB_PAD + 1) +
	SKB_DATA_ALIGN(struct skb_shared_info)

> The maximum truesize underestimation in both cases will be lower than
> what we can get with the current code in the worst case (almost 32x
> AFAICS). 
> 
> Is the above schema safe enough or should the requested size
> artificially inflatted to fit at most 4 allocations per page_order0?
> Am I miss something else? Apart from omitting a good deal of testing in
> the above list ;) 

If we are working with an order 0 page we may just want to split it up
into a fixed 1K fragments and not bother with a variable pagecnt bias.
Doing that we would likely simplify this quite a bit and avoid having
to do as much page count manipulation which could get expensive if we
are not getting many uses out of the page. An added advantage is that
we can get rid of the pagecnt_bias and just work based on the page
offset.

As such I am not sure the page frag cache would really be that good of
a fit since we have quite a bit of overhead in terms of maintaining the
pagecnt_bias which assumes the page is a bit longer lived so the ratio
of refcnt updates vs pagecnt_bias updates is better.