netdev - Re: [PATCH bpf-next v8 10/12] bpf: make TCP tx timestamp bpf extension work

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <b158a837-d46c-4ae0-8130-7aa288422182@linux.dev>
Date: Wed, 5 Feb 2025 22:12:29 -0800
From: Martin KaFai Lau <martin.lau@...ux.dev>
To: Jason Xing <kerneljasonxing@...il.com>
Cc: Willem de Bruijn <willemdebruijn.kernel@...il.com>,
 Jakub Kicinski <kuba@...nel.org>, davem@...emloft.net, edumazet@...gle.com,
 pabeni@...hat.com, dsahern@...nel.org, willemb@...gle.com, ast@...nel.org,
 daniel@...earbox.net, andrii@...nel.org, eddyz87@...il.com, song@...nel.org,
 yonghong.song@...ux.dev, john.fastabend@...il.com, kpsingh@...nel.org,
 sdf@...ichev.me, haoluo@...gle.com, jolsa@...nel.org, horms@...nel.org,
 bpf@...r.kernel.org, netdev@...r.kernel.org
Subject: Re: [PATCH bpf-next v8 10/12] bpf: make TCP tx timestamp bpf
 extension work

On 2/5/25 7:41 PM, Jason Xing wrote:
> On Thu, Feb 6, 2025 at 11:25 AM Willem de Bruijn
> <willemdebruijn.kernel@...il.com> wrote:
>>
>>>>> I think we can split the whole idea into two parts: for now, because
>>>>> of the current series implementing the same function as SO_TIMETAMPING
>>>>> does, I will implement the selective sample feature in the series.
>>>>> After someday we finish tracing all the skb, then we will add the
>>>>> corresponding selective sample feature.
>>>>
>>>> Are you saying that you will include selective sampling now or want to
>>>> postpone it?
>>>
>>> A few months ago, I planned to do it after this series. Since you all
>>> ask, it's not complex to have it included in this series :)
>>>
>>> Selective sampling has two kinds of meaning like I mentioned above, so
>>> in the next re-spin I will implement the cmsg feature for bpf
>>> extension in this series.
>>
>> Great thanks.
> 
> I have to rephrase a bit in case Martin visits here soon: I will
> compare two approaches 1) reply value, 2) bpf kfunc and then see which
> way is better.

I have already explained in details why the 1) reply value from the bpf prog 
won't work. Please go back to that reply which has the context.

> 
>>
>>> I'm doing the test right now. And leave
>>> another selective sampling small feature until the feature of tracing
>>> all the skbs is implemented if possible.
>>
>> Can you elaborate on this other feature?
> 
> Do you recall oneday I asked your opinion privately about whether we
> can trace _all the skbs_ (not the last skb from each sendmsg) to have
> a better insight of kernel behaviour? I can also see a couple of
> latency issues in the kernel. If it is approved, then corresponding
> selective sampling should be supported. It's what I was trying to
> describe.
> 
> The advantage of relying on the timestamping feature is that we can
> isolate normal flows and monitored flow so that normal flows wouldn't
> be affected because of enabling the monitoring feature, compared to so
> many open source monitoring applications I've dug into. They usually
> directly hook the hot path like __tcp_transmit_skb() or
> dev_queue_xmit, which will surely influence the normal flows and cause
> performance degradation to some extent. I noticed that after
> conducting some tests a few months ago. The principle behind the bpf
> fentry is to replace some instructions at the very beginning of the
> hooked function, so every time even normal flows entering the
> monitored function will get affected.

I sort of guess this while stalled in the traffic... :/

I was not asking to be able to "selective on all skb of a large msg". This will 
be a separate topic. If we really wanted to support this case (tbh, I am not 
convinced) in the future, there is more reason the default behavior should be 
"off" now for consistency reason.

The comment was on the existing tcp_tx_timestamp(). First focus on allowing 
selective tracking of the skb that the current tcp_tx_timestamp() also tracks 
because it is the most understood use case. This will allow the bpf prog to 
select which tcp_sendmsg call it should track/sample. Perhaps the bpf prog will 
limit tracking X numbers of packets and then will stop there. Perhaps the bpf 
prog will only allocate X numbers of sample spaces in the bpf_sk_storage to 
track packet. There are many reasons that bpf prog may want to sample and stop 
tracking at some point even in the current tcp_tx_timestamp().