[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <755110eb-9dea-4df6-b207-21bc06491498@bytedance.com>
Date: Mon, 14 Jul 2025 17:26:19 -0700
From: Zijian Zhang <zijianzhang@...edance.com>
To: Jakub Sitnicki <jakub@...udflare.com>,
Cong Wang <xiyou.wangcong@...il.com>
Cc: netdev@...r.kernel.org, bpf@...r.kernel.org, john.fastabend@...il.com,
zhoufeng.zf@...edance.com, Amery Hung <amery.hung@...edance.com>,
Cong Wang <cong.wang@...edance.com>
Subject: Re: [Patch bpf-next v4 4/4] tcp_bpf: improve ingress redirection
performance with message corking
On 7/8/25 1:51 AM, Jakub Sitnicki wrote:
> On Thu, Jul 03, 2025 at 09:20 PM -07, Cong Wang wrote:
>> On Thu, Jul 03, 2025 at 01:32:08PM +0200, Jakub Sitnicki wrote:
>>> I'm all for reaping the benefits of batching, but I'm not thrilled about
>>> having a backlog worker on the path. The one we have on the sk_skb path
>>> has been a bottleneck:
>>
>> It depends on what you compare with. If you compare it with vanilla
>> TCP_BPF, we did see is 5% latency increase. If you compare it with
>> regular TCP, it is still much better. Our goal is to make Cillium's
>> sockops-enable competitive with regular TCP, hence we compare it with
>> regular TCP.
>>
>> I hope this makes sense to you. Sorry if this was not clear in our cover
>> letter.
>
> Latency-wise I think we should be comparing sk_msg send-to-local against
> UDS rather than full-stack TCP.
>
> There is quite a bit of guessing on my side as to what you're looking
> for because the cover letter doesn't say much about the use case.
>
Let me add more details to the use cases,
Assume user space code uses TCP to connect to a peer which may be
local or remote. We are trying to use sockmap to transparently
accelerate the TCP connection where both the sender and the receiver are
on the same machine. User space code does not need to be modified, local
connections will be accelerated, remote connections remain the same.
Because of the transparency here, UDS is not an option here. UDS
requires user-space code change, and it means users know they are
talking to local peer.
We assume that since we bypass the Linux network stack, better tput,
latency and cpu usage will be observed. However, it's not ths case, tput
is worse when the message size is small (<64k).
It's similar to cilium "sockops-enable" config, which is deprecated
mostly because of performance. The config uses sockmap to manage the
TCP connection between pods in the same machine.
https://github.com/cilium/cilium/blob/v1.11.4/bpf/sockops/bpf_sockops.c
> For instance, do you control the sender? Why not do big writes on the
> sender side if raw throughput is what you care about?
>
As described above, we assume user space uses TCP, and we cannot change
the user space code.
>>> 1) There's no backpressure propagation so you can have a backlog
>>> build-up. One thing to check is what happens if the receiver closes its
>>> window.
>>
>> Right, I am sure there are still a lot of optimizations we can further
>> improve. The only question is how much we need for now. How about
>> optimizing it one step each time? :)
>
> This is introducing a quite a bit complexity from the start. I'd like to
> least explore if it can be done in a simpler fashion before committing to
> it.
>
> You point at wake-ups as being the throughput killer. As an alternative,
> can we wake up the receiver conditionally? That is only if the receiver
> has made progress since on the queue since the last notification. This
> could also be a form of wakeup moderation.
>
wake-up is indeed one of the throughput killer, and I agree it can be
mitigated by waking up the receiver conditionally.
IIRC, sock lock is another __main__ throughput killer,
In the tcp_bpf_sendmsg, the context of sender process,
we need to lock_sock(sender) -> release_sock(sender) -> lock_sock(recv)
-> release_sock(recv) -> lock_sock(sender) -> release_sock(sender).
This makes the sender somewhat dependent to the receiver, when the
receiver is working, the sender will be blocked.
sender receiver
tcp_bpf_sendmsg
tcp_bpf_recvmsg (working)
tcp_bpf_sendmsg (blocked)
We introduce kworker here mainly to solve the sock lock issue, we want
to have senders only need to acquire sender sock lock, receivers only
need to acquire receiver sock lock. Only the kworker, as a middle man,
needs to have both sender and receiver lock to transfer the data from
the sender to the receiver. As a result, tcp_bpf_sendmsg and
tcp_bpf_recvmsg can be independent to each other.
sender receiver
tcp_bpf_sendmsg
tcp_bpf_recvmsg (working)
tcp_bpf_sendmsg
tcp_bpf_sendmsg
...
>>> 2) There's a scheduling latency. That's why the performance of splicing
>>> sockets with sockmap (ingress-to-egress) looks bleak [1].
>>
>> Same for regular TCP, we have to wakeup the receiver/worker. But I may
>> misunderstand this point?
>
> What I meant is that, in the pessimistic case, to deliver a message we
> now have to go through two wakeups:
>
> sender -wakeup-> kworker -wakeup-> receiver
>
>>> So I have to dig deeper...
>>>
>>> Have you considered and/or evaluated any alternative designs? For
>>> instance, what stops us from having an auto-corking / coalescing
>>> strategy on the sender side?
>>
>> Auto corking _may_ be not as easy as TCP, since essentially we have no
>> protocol here, just a pure socket layer.
>
> You're right. We don't have a flush signal for auto-corking on the
> sender side with sk_msg's.
>
> What about what I mentioned above - can we moderate the wakeups based on
> receiver making progress? Does that sound feasible to you?
>
> Thanks,
> -jkbs
Powered by blists - more mailing lists