netdev - Re: [PATCH bpf-next v8 10/12] bpf: make TCP tx timestamp bpf extension work

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAL+tcoCUjxvE-DaQ8AMxMgjLnV+J1jpYMh7BCOow4AohW1FFSg@mail.gmail.com>
Date: Thu, 6 Feb 2025 14:56:43 +0800
From: Jason Xing <kerneljasonxing@...il.com>
To: Martin KaFai Lau <martin.lau@...ux.dev>
Cc: Willem de Bruijn <willemdebruijn.kernel@...il.com>, Jakub Kicinski <kuba@...nel.org>, 
	davem@...emloft.net, edumazet@...gle.com, pabeni@...hat.com, 
	dsahern@...nel.org, willemb@...gle.com, ast@...nel.org, daniel@...earbox.net, 
	andrii@...nel.org, eddyz87@...il.com, song@...nel.org, 
	yonghong.song@...ux.dev, john.fastabend@...il.com, kpsingh@...nel.org, 
	sdf@...ichev.me, haoluo@...gle.com, jolsa@...nel.org, horms@...nel.org, 
	bpf@...r.kernel.org, netdev@...r.kernel.org
Subject: Re: [PATCH bpf-next v8 10/12] bpf: make TCP tx timestamp bpf
 extension work

On Thu, Feb 6, 2025 at 2:12 PM Martin KaFai Lau <martin.lau@...ux.dev> wrote:
>
> On 2/5/25 7:41 PM, Jason Xing wrote:
> > On Thu, Feb 6, 2025 at 11:25 AM Willem de Bruijn
> > <willemdebruijn.kernel@...il.com> wrote:
> >>
> >>>>> I think we can split the whole idea into two parts: for now, because
> >>>>> of the current series implementing the same function as SO_TIMETAMPING
> >>>>> does, I will implement the selective sample feature in the series.
> >>>>> After someday we finish tracing all the skb, then we will add the
> >>>>> corresponding selective sample feature.
> >>>>
> >>>> Are you saying that you will include selective sampling now or want to
> >>>> postpone it?
> >>>
> >>> A few months ago, I planned to do it after this series. Since you all
> >>> ask, it's not complex to have it included in this series :)
> >>>
> >>> Selective sampling has two kinds of meaning like I mentioned above, so
> >>> in the next re-spin I will implement the cmsg feature for bpf
> >>> extension in this series.
> >>
> >> Great thanks.
> >
> > I have to rephrase a bit in case Martin visits here soon: I will
> > compare two approaches 1) reply value, 2) bpf kfunc and then see which
> > way is better.
>
> I have already explained in details why the 1) reply value from the bpf prog
> won't work. Please go back to that reply which has the context.

Yes, of course I saw this, but I said I need to implement and dig more
into this on my own. One of my replies includes a little code snippet
regarding reply value approach. I didn't expect you to misunderstand
that I would choose reply value, so I rephrase it like above :)

>
> >
> >>
> >>> I'm doing the test right now. And leave
> >>> another selective sampling small feature until the feature of tracing
> >>> all the skbs is implemented if possible.
> >>
> >> Can you elaborate on this other feature?
> >
> > Do you recall oneday I asked your opinion privately about whether we
> > can trace _all the skbs_ (not the last skb from each sendmsg) to have
> > a better insight of kernel behaviour? I can also see a couple of
> > latency issues in the kernel. If it is approved, then corresponding
> > selective sampling should be supported. It's what I was trying to
> > describe.
> >
> > The advantage of relying on the timestamping feature is that we can
> > isolate normal flows and monitored flow so that normal flows wouldn't
> > be affected because of enabling the monitoring feature, compared to so
> > many open source monitoring applications I've dug into. They usually
> > directly hook the hot path like __tcp_transmit_skb() or
> > dev_queue_xmit, which will surely influence the normal flows and cause
> > performance degradation to some extent. I noticed that after
> > conducting some tests a few months ago. The principle behind the bpf
> > fentry is to replace some instructions at the very beginning of the
> > hooked function, so every time even normal flows entering the
> > monitored function will get affected.
>
> I sort of guess this while stalled in the traffic... :/
>
> I was not asking to be able to "selective on all skb of a large msg". This will
> be a separate topic. If we really wanted to support this case (tbh, I am not
> convinced) in the future, there is more reason the default behavior should be
> "off" now for consistency reason.

Yep, another topic. At that time, I particularly noticed that Jakub
said "all skbs" which you agreed with, so I felt reluctant...

> The comment was on the existing tcp_tx_timestamp(). First focus on allowing
> selective tracking of the skb that the current tcp_tx_timestamp() also tracks
> because it is the most understood use case. This will allow the bpf prog to
> select which tcp_sendmsg call it should track/sample. Perhaps the bpf prog will
> limit tracking X numbers of packets and then will stop there. Perhaps the bpf
> prog will only allocate X numbers of sample spaces in the bpf_sk_storage to
> track packet. There are many reasons that bpf prog may want to sample and stop
> tracking at some point even in the current tcp_tx_timestamp().

Completely agreed, this is also what I did in my kernel module. Willem
once mentioned that Google also uses the sample feature, IIRC. So for
sure I will complete it soon in this series. Thanks for your
information, BTW. I will quote it :)

---
More discussions/suggestions are welcome since I've already proposed
_tracing all the skbs_. The idea behind this is to let this bpf
extension help us get enough information about kernel (especially
stack) behaviour.

The goals are:
1) much less side effects on the normal flows due to so_timestamping
feature compared to normal bpf* related trace method.
2) have a deep insight/understanding of where those latencies exactly
come from. I've encountered TSQ limits (please see tcp_write_xmit(),
there are more controls) with virtio_net driver. So maybe hacking the
tcp_write_xmit() is needed.
3) Only trace the last skb from each sendmsg might not be enough if
the latency arises in the kernel. If one of skbs is missed, we will
never know what happens to that skb.

Based on the above, I'd like to make it into an optional choice
provided to users if they want to take a deep look inside the kernel.

Thanks,
Jason