netdev - Re: [Patch bpf-next v4 4/4] tcp_bpf: improve ingress redirection performance with message corking

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <87cyabhotr.fsf@cloudflare.com>
Date: Mon, 07 Jul 2025 19:51:12 +0200
From: Jakub Sitnicki <jakub@...udflare.com>
To: Cong Wang <xiyou.wangcong@...il.com>
Cc: Zijian Zhang <zijianzhang@...edance.com>,  netdev@...r.kernel.org,
  bpf@...r.kernel.org,  john.fastabend@...il.com,
  zhoufeng.zf@...edance.com,  Amery Hung <amery.hung@...edance.com>,  Cong
 Wang <cong.wang@...edance.com>
Subject: Re: [Patch bpf-next v4 4/4] tcp_bpf: improve ingress redirection
 performance with message corking

On Thu, Jul 03, 2025 at 09:20 PM -07, Cong Wang wrote:
> On Thu, Jul 03, 2025 at 01:32:08PM +0200, Jakub Sitnicki wrote:
>> I'm all for reaping the benefits of batching, but I'm not thrilled about
>> having a backlog worker on the path. The one we have on the sk_skb path
>> has been a bottleneck:
>
> It depends on what you compare with. If you compare it with vanilla
> TCP_BPF, we did see is 5% latency increase. If you compare it with
> regular TCP, it is still much better. Our goal is to make Cillium's
> sockops-enable competitive with regular TCP, hence we compare it with
> regular TCP.
>
> I hope this makes sense to you. Sorry if this was not clear in our cover
> letter.

Latency-wise I think we should be comparing sk_msg send-to-local against
UDS rather than full-stack TCP.

There is quite a bit of guessing on my side as to what you're looking
for because the cover letter doesn't say much about the use case.

For instance, do you control the sender?  Why not do big writes on the
sender side if raw throughput is what you care about?

>> 1) There's no backpressure propagation so you can have a backlog
>> build-up. One thing to check is what happens if the receiver closes its
>> window.
>
> Right, I am sure there are still a lot of optimizations we can further
> improve. The only question is how much we need for now. How about
> optimizing it one step each time? :)

This is introducing a quite a bit complexity from the start. I'd like to
least explore if it can be done in a simpler fashion before committing to
it.

You point at wake-ups as being the throughput killer. As an alternative,
can we wake up the receiver conditionally? That is only if the receiver
has made progress since on the queue since the last notification. This
could also be a form of wakeup moderation.

>> 2) There's a scheduling latency. That's why the performance of splicing
>> sockets with sockmap (ingress-to-egress) looks bleak [1].
>
> Same for regular TCP, we have to wakeup the receiver/worker. But I may
> misunderstand this point?

What I meant is that, in the pessimistic case, to deliver a message we
now have to go through two wakeups:

sender -wakeup-> kworker -wakeup-> receiver

>> So I have to dig deeper...
>> 
>> Have you considered and/or evaluated any alternative designs? For
>> instance, what stops us from having an auto-corking / coalescing
>> strategy on the sender side?
>
> Auto corking _may_ be not as easy as TCP, since essentially we have no
> protocol here, just a pure socket layer.

You're right. We don't have a flush signal for auto-corking on the
sender side with sk_msg's.

What about what I mentioned above - can we moderate the wakeups based on
receiver making progress? Does that sound feasible to you?

Thanks,
-jkbs