[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <671840a23227e_1420e529466@willemb.c.googlers.com.notmuch>
Date: Tue, 22 Oct 2024 20:17:38 -0400
From: Willem de Bruijn <willemdebruijn.kernel@...il.com>
To: Martin KaFai Lau <martin.lau@...ux.dev>,
Willem de Bruijn <willemdebruijn.kernel@...il.com>,
Jason Xing <kerneljasonxing@...il.com>
Cc: davem@...emloft.net,
edumazet@...gle.com,
kuba@...nel.org,
pabeni@...hat.com,
dsahern@...nel.org,
willemb@...gle.com,
ast@...nel.org,
daniel@...earbox.net,
andrii@...nel.org,
eddyz87@...il.com,
song@...nel.org,
yonghong.song@...ux.dev,
john.fastabend@...il.com,
kpsingh@...nel.org,
sdf@...ichev.me,
haoluo@...gle.com,
jolsa@...nel.org,
bpf@...r.kernel.org,
netdev@...r.kernel.org,
Jason Xing <kernelxing@...cent.com>
Subject: Re: [PATCH net-next v2 04/12] net-timestamp: add static key to
control the whole bpf extension
Martin KaFai Lau wrote:
> On 10/20/24 2:51 PM, Willem de Bruijn wrote:
> > Jason Xing wrote:
> >> From: Jason Xing <kernelxing@...cent.com>
> >>
> >> Willem suggested that we use a static key to control. The advantage
> >> is that we will not affect the existing applications at all if we
> >> don't load BPF program.
> >>
> >> In this patch, except the static key, I also add one logic that is
> >> used to test if the socket has enabled its tsflags in order to
> >> support bpf logic to allow both cases to happen at the same time.
> >> Or else, the skb carring related timestamp flag doesn't know which
> >> way of printing is desirable.
> >>
> >> One thing important is this patch allows print from both applications
> >> and bpf program at the same time. Now we have three kinds of print:
> >> 1) only BPF program prints
> >> 2) only application program prints
> >> 3) both can print without side effect
> >>
> >> Signed-off-by: Jason Xing <kernelxing@...cent.com>
> >
> > Getting back to this thread. It is long, instead of responding to
> > multiple messages, let me combine them in a single response.
> >
> >
> > * On future extensions:
> >
> > +1 that the UDP case, and datagrams more broadly, must have a clear
> > development path, before we can merge TCP.
> >
> > Similarly, hardware timestamps need not be supported from the start,
> > but must clearly be supportable.
> >
> >
> > * On queueing packets to userspace:
> >
> >>> the current behavior is to just queue to the sk_error_queue as long
> >>> as there is "SOF_TIMESTAMPING_TX_*" set in the skb's tx_flags and it
> >>> is regardless of the sk_tsflags. "
> >
> >> Totally correct. SOF_TIMESTAMPING_SOFTWARE is a report flag while
> >> SOF_TIMESTAMPING_TX_* are generation flags. Without former, users can
> >> read the skb from the errqueue but are not able to parse the
> >> timestamps
> >
> > Before queuing a packet to userspace on the error queue, the relevant
> > reporting flag is always tested. sock_recv_timestamp has:
> >
> > /*
> > * generate control messages if
> > * - receive time stamping in software requested
> > * - software time stamp available and wanted
> > * - hardware time stamps available and wanted
> > */
> > if (sock_flag(sk, SOCK_RCVTSTAMP) ||
> > (tsflags & SOF_TIMESTAMPING_RX_SOFTWARE) ||
> > (kt && tsflags & SOF_TIMESTAMPING_SOFTWARE) ||
> > (hwtstamps->hwtstamp &&
> > (tsflags & SOF_TIMESTAMPING_RAW_HARDWARE)))
> > __sock_recv_timestamp(msg, sk, skb);
> >
> > Otherwise applications could get error messages queued, and
> > epoll/poll/select would unexpectedly behave differently.
>
> I just tried the following diff to remove setsockopt from txtimestamp.c and run
> "./txtimestamp -6 -c 1 -C -N -L ::1". It is getting the skb from the error queue
> with only cmsg flag.
That it surprising and against the API intent as I understand it.
Let me reproduce and take a closer look.
> I did a printk in __skb_tstamp_tx to ensure the
> sk->sk_tsflags is empty also.
>
> diff --git i/tools/testing/selftests/net/txtimestamp.c
> w/tools/testing/selftests/net/txtimestamp.c
> index dae91eb97d69..5d9d2773b076 100644
> --- i/tools/testing/selftests/net/txtimestamp.c
> +++ w/tools/testing/selftests/net/txtimestamp.c
> @@ -319,6 +319,8 @@ static void __recv_errmsg_cmsg(struct msghdr *msg, int
> payload_len)
> for (cm = CMSG_FIRSTHDR(msg);
> cm && cm->cmsg_len;
> cm = CMSG_NXTHDR(msg, cm)) {
> + printf("cm->cmsg_level %d cm->cmsg_type %d\n",
> + cm->cmsg_level, cm->cmsg_type);
> if (cm->cmsg_level == SOL_SOCKET &&
> cm->cmsg_type == SCM_TIMESTAMPING) {
> tss = (void *) CMSG_DATA(cm);
> @@ -362,7 +364,7 @@ static void __recv_errmsg_cmsg(struct msghdr *msg, int
> payload_len)
> if (batch > 1) {
> fprintf(stderr, "batched %d timestamps\n", batch);
> } else if (!batch) {
> - fprintf(stderr, "Failed to report timestamps\n");
> + fprintf(stderr, "Failed to report timestamps. payload_len %d\n", payload_len);
> test_failed = true;
> }
> }
> @@ -578,9 +580,12 @@ static void do_test(int family, unsigned int report_opt)
> if (cfg_loop_nodata)
> sock_opt |= SOF_TIMESTAMPING_OPT_TSONLY;
>
> + (void)sock_opt;
> +/*
> if (setsockopt(fd, SOL_SOCKET, SO_TIMESTAMPING,
> (char *) &sock_opt, sizeof(sock_opt)))
> error(1, 0, "setsockopt timestamping");
> +*/
>
> for (i = 0; i < cfg_num_pkts; i++) {
> memset(&msg, 0, sizeof(msg));
> >
> >> SOF_TIMESTAMPING_SOFTWARE is only used in traditional SO_TIMESTAMPING
> >> features including cmsg mode. But it will not be used in bpf mode.
> >
> > For simplicity, the two uses of the API are best kept identical. If
> > there is a technical reason why BPF has to diverge from established
> > behavior, this needs to be explicitly called out in the commit
> > message.
>
> SOF_TIMESTAMPING_OPT_TSONLY will not be supported. The orig_skb can always be
> passed directly to the bpf if needed without extra cost. The same probably goes
> for SOF_TIMESTAMPING_OPT_PKTINFO. SOF_TIMESTAMPING_SOFTWARE does not seem to be
> useful either. I think only a subset of SOF_* will be supported, probably only
> the TX_* and RX_* ones.
>
> >
> > Also, if you want to extend the API for BPF in the future, good to
> > call this out now and ideally extensions will apply to both, to
> > maintain a uniform API.
> >
> >
> > * On extra measurement points, at sendmsg or tcp_write_xmit:
> >
> > The first is interesting. For application timestamping, this was
> > never needed, as the application can just call clock_gettime before
> > sendmsg.
> >
> > In general, additional measurement points are not only useful if the
> > interval between is not constant. So far, we have seen no need for
> > any additional points.
> >
> >
> > * On skb state:
> >
> >>> For now, is there thing we can explore to share in the skb_shared_info?
> >
> > skb_shinfo space is at a premium. I don't think we can justify two
> > extra fields just for this use case.
> >
> >> My initial thought is just to reuse these fields in skb. It can work
> >> without interfering one another.
> >
> > I'm skeptical that two methods can work at the same time. If they are
> > started at different times, their sk_tskey will be different, for one.
>
> For the skb's tx_flags, Jason seems to be able to figure out by only using the
> new sk_tsflags_bpf. In the worst case, it seems there is still one bit left in
> tx_flags.
>
> I am also not very positive on the skb's tskey for now.
>
> Willem, I recalled I had tried to reuse the tx_flags and hwtstamp when keeping
> the delivery time in skb->tstamp for a skb redirecting from egress to ingress. I
> think that approach was stalled because the tx_flags could be changed by the
> netdevice like "skb_shinfo(skb)->tx_flags |= SKBTX_IN_PROGRESS". How about the
> skb_shinfo(skb)->hwtstamps? At least for the TX path, it should not be changed
> until the netdevice calling skb_tstamp_tx() to report the hwtstamp? or the clone
> in the tcp stack will still break things if the hwtstamps is reused for other
> purpose?
True. I think on Tx hwtstamps is only used on the path from the driver
tx completion handler to when it calls skb_tstamp_tx.
It does not even really have to be an skb field. The first driver
cscope happens to point me to indeed just allocates it on the stack:
tsnep_tx_poll.
> >
> > There may be workarounds. Maybe BPF can store its state in some BPF
> > specific field, indeed. Or perhaps it can store per-sk shadow state
> > that resolves the conflict. For instance, the offset between sk_tskey
> > and bpf_tskey.
>
> I have also been proposing to explore other way for the key since bpf has direct
> access to the skb (also the sk, bpf prog can store data in the sk).
>
> The bpf prog can learn what is the seq_no of the egress-ing skb. When the ack
> comes back, it can also learn the ack seq no. Does it help? It will be harder to
> use because it probably needs to store this info in the bpf map (or in the bpf
> sk storage). However, if it needs to learn the timestamp at the
> tcp_sendmsg/tcp_transmit_skb/tcp_write_xmit, this timestamp has to be stored
> somewhere also. Either in a bpf map or in a bpf sk storage.
>
> SEC("cgroup/setsockopt") prog can also enforce the user space setsockopt. e.g.
> it can add SOF_TIMESTAMPING_OPT_ID_TCP when user space only use
> SOF_TIMESTAMPING_OPT_ID.
Powered by blists - more mailing lists