netdev - Re: [PATCH net-next v2] net: cache for same cpu skb_attempt_defer

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <3087f631-6342-49de-a5ae-ceb24b3a51b4@gmail.com>
Date: Mon, 18 Mar 2024 23:12:04 +0000
From: Pavel Begunkov <asml.silence@...il.com>
To: Eric Dumazet <edumazet@...gle.com>
Cc: netdev@...r.kernel.org, davem@...emloft.net, dsahern@...nel.org,
 pabeni@...hat.com, kuba@...nel.org
Subject: Re: [PATCH net-next v2] net: cache for same cpu
 skb_attempt_defer_free

On 3/18/24 14:33, Eric Dumazet wrote:
> On Mon, Mar 18, 2024 at 12:43 PM Pavel Begunkov <asml.silence@...il.com> wrote:
>>
>> On 3/18/24 10:11, Eric Dumazet wrote:
>>> On Mon, Mar 18, 2024 at 1:46 AM Pavel Begunkov <asml.silence@...il.com> wrote:
>>>>
>>>> Optimise skb_attempt_defer_free() when run by the same CPU the skb was
>>>> allocated on. Instead of __kfree_skb() -> kmem_cache_free() we can
>>>> disable softirqs and put the buffer into cpu local caches.
>>>>
>>>> CPU bound TCP ping pong style benchmarking (i.e. netbench) showed a 1%
>>>> throughput increase (392.2 -> 396.4 Krps). Cross checking with profiles,
>>>> the total CPU share of skb_attempt_defer_free() dropped by 0.6%. Note,
>>>> I'd expect the win doubled with rx only benchmarks, as the optimisation
>>>> is for the receive path, but the test spends >55% of CPU doing writes.
>>>>
>>>> Signed-off-by: Pavel Begunkov <asml.silence@...il.com>
>>>> ---
>>>>
>>>> v2: pass @napi_safe=true by using __napi_kfree_skb()
>>>>
>>>>    net/core/skbuff.c | 15 ++++++++++++++-
>>>>    1 file changed, 14 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
>>>> index b99127712e67..35d37ae70a3d 100644
>>>> --- a/net/core/skbuff.c
>>>> +++ b/net/core/skbuff.c
>>>> @@ -6995,6 +6995,19 @@ void __skb_ext_put(struct skb_ext *ext)
>>>>    EXPORT_SYMBOL(__skb_ext_put);
>>>>    #endif /* CONFIG_SKB_EXTENSIONS */
>>>>
>>>> +static void kfree_skb_napi_cache(struct sk_buff *skb)
>>>> +{
>>>> +       /* if SKB is a clone, don't handle this case */
>>>> +       if (skb->fclone != SKB_FCLONE_UNAVAILABLE) {
>>>> +               __kfree_skb(skb);
>>>> +               return;
>>>> +       }
>>>> +
>>>> +       local_bh_disable();
>>>> +       __napi_kfree_skb(skb, SKB_DROP_REASON_NOT_SPECIFIED);
>>>> +       local_bh_enable();
>>>> +}
>>>> +
>>>>    /**
>>>>     * skb_attempt_defer_free - queue skb for remote freeing
>>>>     * @skb: buffer
>>>> @@ -7013,7 +7026,7 @@ void skb_attempt_defer_free(struct sk_buff *skb)
>>>>           if (WARN_ON_ONCE(cpu >= nr_cpu_ids) ||
>>>>               !cpu_online(cpu) ||
>>>>               cpu == raw_smp_processor_id()) {
>>>> -nodefer:       __kfree_skb(skb);
>>>> +nodefer:       kfree_skb_napi_cache(skb);
>>>>                   return;
>>>>           }
>>>>
>>>> --
>>>> 2.44.0
>>>>
>>>
>>> 1) net-next is currently closed.
>>
>> Ok
>>
>>> 2) No NUMA awareness. SLUB does not guarantee the sk_buff was on the
>>> correct node.
>>
>> Let me see if I read you right. You're saying that SLUB can
>> allocate an skb from a different node, so skb->alloc_cpu
>> might be not indicative of the node, and so we might locally
>> cache an skb of a foreign numa node?
>>
>> If that's the case I don't see how it's different from the
>> cpu != raw_smp_processor_id() path, which will transfer the
>> skb to another cpu and still put it in the local cache in
>> softirq.
> 
> The big win for skb_attempt_defer_free() is for the many pages that are attached
> to TCP incoming GRO packets.
> 
> A typical GRO packet can have 45 page fragments.

That's great and probably a dominating optimisation for
GRO, however it's a path that all TCP reads would go through,
large GRO'ed skbs or not. Even if there is a single frag, the
patch at least enables the skb cache and gives the frag a
chance to hit the pp's lockless cache.

> Pages are not recycled by a NIC if the NUMA node does not match.

Great, looking it up it seems that it's only true for pages
ending up in the pp's ptr_ring but not recycled directly.

> If the win was only for sk_buff itself, I think we should have asked
> SLUB maintainers
> why SLUB needs another cache to optimize SLUB fast cache !

Well, it's just slow for hot paths, not awfully slow but enough
to be noticeable and take 1-2% per allocation per request. There
are caches in io_uring, because it was slow, there are cache
in the block layer. And there are skb and other caches in net,
I assume all for the same reasons. Not like I'm adding a new
one.

I remember someone was poking into similar questions, but I
haven't seen any drastic SLUB performance changes.

Eric, I'm seriously getting lost in the arguments and don't
really see what you're ultimately against. Summing up topics:

1) Is there a NUMA awareness problem in the patch that I missed?
If so, can you elaborate? I'll try to address it.
Is it a regression comparing to the current state or something
that would be nice to have?

2) Do you trust the numbers for the test described? I.e. the
rx server have multiple connections, each does small packet
(<100B) ping pong style traffic. It's pinned to a CPU and all
completions are ensured to run on that CPU.

3) Do you agree that it's a valid use case? Or are you shrugging
it off as something that nobody cares about?

4) Maybe you're against the design?


>>> 3) Given that many skbs (like TCP ACK) are freed using __kfree_skb(),  I wonder
>>> why trying to cache the sk_buff in this particular path is needed.
>>>
>>> Why not change __kfree_skb() instead ?
>>
>> IIRC kfree_skb() can be called from any context including irqoff,
>> it's convenient to have a function that just does the job without
>> too much of extra care. Theoretically it can have a separate path
>> inside based on irqs_disabled(), but that would be ugly.
> 
> Well, adding one case here, another here, and another here is also ugly.

And fwiw, irqs_disabled() is not great for hot paths. Not to the
extent of irq_disable(), but it still trashes up a bit the pipeline
or so IIRC.

-- 
Pavel Begunkov