[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <AABBC143-F665-44F8-8C5F-79805429A53E@nutanix.com>
Date: Wed, 3 Dec 2025 04:34:39 +0000
From: Jon Kohler <jon@...anix.com>
To: Jason Wang <jasowang@...hat.com>
CC: Jesper Dangaard Brouer <hawk@...nel.org>,
"netdev@...r.kernel.org"
<netdev@...r.kernel.org>,
Willem de Bruijn <willemdebruijn.kernel@...il.com>,
Andrew Lunn <andrew+netdev@...n.ch>,
"David S. Miller" <davem@...emloft.net>,
Eric Dumazet <edumazet@...gle.com>, Jakub Kicinski <kuba@...nel.org>,
Paolo
Abeni <pabeni@...hat.com>, Alexei Starovoitov <ast@...nel.org>,
Daniel
Borkmann <daniel@...earbox.net>,
John Fastabend <john.fastabend@...il.com>,
Stanislav Fomichev <sdf@...ichev.me>,
open list
<linux-kernel@...r.kernel.org>,
"open list:XDP (eXpress Data
Path):Keyword:(?:b|_)xdp(?:b|_)" <bpf@...r.kernel.org>,
Sebastian Andrzej
Siewior <bigeasy@...utronix.de>,
Alexander Lobakin
<aleksander.lobakin@...el.com>
Subject: Re: [PATCH net-next v2 5/9] tun: use bulk NAPI cache allocation in
tun_xdp_one
> On Dec 2, 2025, at 11:10 PM, Jason Wang <jasowang@...hat.com> wrote:
>
> On Wed, Dec 3, 2025 at 1:46 AM Jon Kohler <jon@...anix.com> wrote:
>>
>>
>>
>>> On Dec 2, 2025, at 12:32 PM, Jesper Dangaard Brouer <hawk@...nel.org> wrote:
>>>
>>>
>>>
>>> On 02/12/2025 17.49, Jon Kohler wrote:
>>>>> On Nov 27, 2025, at 10:02 PM, Jason Wang <jasowang@...hat.com> wrote:
>>>>>
>>>>> On Wed, Nov 26, 2025 at 3:19 AM Jon Kohler <jon@...anix.com> wrote:
>>>>>>
>>>>>> Optimize TUN_MSG_PTR batch processing by allocating sk_buff structures
>>>>>> in bulk from the per-CPU NAPI cache using napi_skb_cache_get_bulk.
>>>>>> This reduces allocation overhead and improves efficiency, especially
>>>>>> when IFF_NAPI is enabled and GRO is feeding entries back to the cache.
>>>>>
>>>>> Does this mean we should only enable this when NAPI is used?
>>>> No, it does not mean that at all, but I see what that would be confusing.
>>>> I can clean up the commit msg on the next go around
>>>>>>
>>>>>> If bulk allocation cannot fully satisfy the batch, gracefully drop only
>>>>>> the uncovered portion, allowing the rest of the batch to proceed, which
>>>>>> is what already happens in the previous case where build_skb() would
>>>>>> fail and return -ENOMEM.
>>>>>>
>>>>>> Signed-off-by: Jon Kohler <jon@...anix.com>
>>>>>
>>>>> Do we have any benchmark result for this?
>>>> Yes, it is in the cover letter:
>>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__patchwork.kernel.org_project_netdevbpf_cover_20251125200041.1565663-2D1-2Djon-40nutanix.com_&d=DwIDaQ&c=s883GpUCOChKOHiocYtGcg&r=NGPRGGo37mQiSXgHKm5rCQ&m=D7piJwOOQSj7C1puBlbh5dmAc-qsLw6E660yC5jJXWZk9ppvjOqT9Xc61ewYSmod&s=yUPhRdqt2lVnW5FxiOpvKE34iXKyGEWk502Dko1i3PI&e=
>
> Ok but it only covers UDP, I think we want to see how it performs for
> TCP as well as latency. Btw is the test for IFF_NAPI or not?
This test was without IFF_NAPI, but I could get the NAPI numbers too
More on that below
>
>>>>>> ---
>>>>>> drivers/net/tun.c | 30 ++++++++++++++++++++++++------
>>>>>> 1 file changed, 24 insertions(+), 6 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
>>>>>> index 97f130bc5fed..64f944cce517 100644
>>>>>> --- a/drivers/net/tun.c
>>>>>> +++ b/drivers/net/tun.c
>>> [...]
>>>>>> @@ -2454,6 +2455,7 @@ static int tun_xdp_one(struct tun_struct *tun,
>>>>>> ret = tun_xdp_act(tun, xdp_prog, xdp, act);
>>>>>> if (ret < 0) {
>>>>>> /* tun_xdp_act already handles drop statistics */
>>>>>> + kfree_skb_reason(skb, SKB_DROP_REASON_XDP);
>>>>>
>>>>> This should belong to previous patches?
>>>> Well, not really, as we did not even have an SKB to free at this point
>>>> in the previous code
>>>>>
>>>>>> put_page(virt_to_head_page(xdp->data));
>>>
>>> This calling put_page() directly also looks dubious.
>>>
>>>>>> return ret;
>>>>>> }
>>>>>> @@ -2463,6 +2465,7 @@ static int tun_xdp_one(struct tun_struct *tun,
>>>>>> *flush = true;
>>>>>> fallthrough;
>>>>>> case XDP_TX:
>>>>>> + napi_consume_skb(skb, 1);
>>>>>> return 0;
>>>>>> case XDP_PASS:
>>>>>> break;
>>>>>> @@ -2475,13 +2478,15 @@ static int tun_xdp_one(struct tun_struct *tun,
>>>>>> tpage->page = page;
>>>>>> tpage->count = 1;
>>>>>> }
>>>>>> + napi_consume_skb(skb, 1);
>>>>>
>>>>> I wonder if this would have any side effects since tun_xdp_one() is
>>>>> not called by a NAPI.
>>>> As far as I can tell, this napi_consume_skb is really just an artifact of
>>>> how it was named and how it was traditionally used.
>>>> Now this is really just a napi_consume_skb within a bh disable/enable
>>>> section, which should meet the requirements of how that interface
>>>> should be used (again, AFAICT)
>>>
>>> Yicks - this sounds super ugly. Just wrapping napi_consume_skb() in bh
>>> disable/enable section and then assuming you get the same protection as
>>> NAPI is really dubious.
>>>
>>> Cc Sebastian as he is trying to cleanup these kind of use-case, to make
>>> kernel preemption work.
>>>
>>>
>>>>>
>>>>>> return 0;
>>>>>> }
>>>>>> }
>>>>>>
>>>>>> build:
>>>>>> - skb = build_skb(xdp->data_hard_start, buflen);
>>>>>> + skb = build_skb_around(skb, xdp->data_hard_start, buflen);
>>>>>> if (!skb) {
>>>>>> + kfree_skb_reason(skb, SKB_DROP_REASON_NOMEM);
>>>> Though to your point, I dont think this actually does anything given
>>>> that if the skb was somehow nuked as part of build_skb_around, there
>>>> would not be an skb to free. Doesn’t hurt though, from a self documenting
>>>> code perspective tho?
>>>>>> dev_core_stats_rx_dropped_inc(tun->dev);
>>>>>> return -ENOMEM;
>>>>>> }
>>>>>> @@ -2566,9 +2571,11 @@ static int tun_sendmsg(struct socket *sock, struct msghdr *m, size_t total_len)
>>>>>> if (m->msg_controllen == sizeof(struct tun_msg_ctl) &&
>>>>>> ctl && ctl->type == TUN_MSG_PTR) {
>>>>>> struct bpf_net_context __bpf_net_ctx, *bpf_net_ctx;
>>>>>> + int flush = 0, queued = 0, num_skbs = 0;
>>>>>> struct tun_page tpage;
>>>>>> int n = ctl->num;
>>>>>> - int flush = 0, queued = 0;
>>>>>> + /* Max size of VHOST_NET_BATCH */
>>>>>> + void *skbs[64];
>>>>>
>>>>> I think we need some tweaks
>>>>>
>>>>> 1) TUN is decoupled from vhost, so it should have its own value (a
>>>>> macro is better)
>>>> Sure, I can make another constant that does a similar thing
>>>>> 2) Provide a way to fail or handle the case when more than 64
>>>> What if we simply assert that the maximum here is 64, which I think
>>>> is what it actually is in practice?
>
> I still prefer a fallback.
Ack, will chew on that for the next one, let’s settle on the larger
elephant in the room which is the NAPI stuff below, as none of this
goes anywhere without resolving that first.
>
>>>>>
>>>>>>
>>>>>> memset(&tpage, 0, sizeof(tpage));
>>>>>>
>>>>>> @@ -2576,13 +2583,24 @@ static int tun_sendmsg(struct socket *sock, struct msghdr *m, size_t total_len)
>>>>>> rcu_read_lock();
>>>>>> bpf_net_ctx = bpf_net_ctx_set(&__bpf_net_ctx);
>>>>>>
>>>>>> - for (i = 0; i < n; i++) {
>>>>>> + num_skbs = napi_skb_cache_get_bulk(skbs, n);
>>>>>
>>>>> Its document said:
>>>>>
>>>>> """
>>>>> * Must be called *only* from the BH context.
>>>>> “"”
>>>> We’re in a bh_disable section here, is that not good enough?
>>>
>>> Again this feels very ugly and prone to painting ourselves into a
>>> corner, assuming BH-disabled sections have same protection as NAPI.
>>>
>>> (The napi_skb_cache_get/put function are operating on per CPU arrays
>>> without any locking.)
>>
>> Happy to take suggestions on an alternative approach.
>>
>> Thoughts:
>> 1. Instead of having IFF_NAPI be an opt-in thing, clean up tun so it
>> is *always* NAPI’d 100% of the time?
>
> IFF_NAPI will have some overheads and it is introduced basically for
> testing if I was not wrong.
IIRC it was originally introduced for testing, but under some circumstances
can be wildly faster, see commit fb3f903769e805221eb19209b3d9128d398038a1
("tun: support NAPI for packets received from batched XDP buffs")
You may be thinking of IFF_NAPI_FRAGS, which seems very much “test only”
at this point.
Anyhow, assuming you are thinking of IFF_NAPI itself:
- Are the overheads you’ve got in mind completely structural/unavoidable?
- Or is that something that would be worth while looking at?
As a side note, one thing I did play with that is absolutely silly faster
is using IFF_NAPI with NAPI threads. Under certain scenarios (high tput
that is normally copy bound), gains were nutty (like ~75%+), so the point
is there may be some very interesting juice to squeeze going down that
path.
Coming back to the main path here, the whole reason I’m going down this
patchset is to try to pickup optimizations that are available in other
general purpose drivers, which are all NAPI-ized. We’re at the point now
where tun is getting left behind for things like this because of non-full
time NAPI.
Said another way, I think it would be an advantage to NAPI-ize tun and
make it more like regular ole network drivers, so that the generic core
work being done will benefit tun by default.
>> Outside of people who have
>> wired this up in their apps manually, on the virtualization side
>> there is currently no support from QEMU/Libvirt to enable IFF_NAPI.
>> Might be a nice simplification/cleanup to just “do it” full time?
>> Then we can play all these sorts of games under the protection of
>> NAPI?
>
> A full benchmark needs to be run for this to see.
Do you have a suggested test/suite of tests you’d prefer me to run
so that I can make sure I’m gathering the data that you’d like to
see?
>> 2. (Some other non-dubious way of protecting this, without refactoring
>> for either conditional NAPI (yuck?) or refactoring for full time
>> NAPI? This would be nice, happy to take tips!
>> 3. ... ?
>>
>
Powered by blists - more mailing lists