netdev - Re: [PATCH net-next v3 02/14] net-timestamp: allow two features to work parallelly

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAL+tcoDLOhu4tGkUY7ofqh9hQj5PmLC5UJd55RSGS-MdWRUmgg@mail.gmail.com>
Date: Tue, 5 Nov 2024 11:13:28 +0800
From: Jason Xing <kerneljasonxing@...il.com>
To: Martin KaFai Lau <martin.lau@...ux.dev>
Cc: Willem de Bruijn <willemdebruijn.kernel@...il.com>, willemb@...gle.com, 
	davem@...emloft.net, edumazet@...gle.com, kuba@...nel.org, pabeni@...hat.com, 
	dsahern@...nel.org, ast@...nel.org, daniel@...earbox.net, andrii@...nel.org, 
	eddyz87@...il.com, song@...nel.org, yonghong.song@...ux.dev, 
	john.fastabend@...il.com, kpsingh@...nel.org, sdf@...ichev.me, 
	haoluo@...gle.com, jolsa@...nel.org, shuah@...nel.org, ykolal@...com, 
	bpf@...r.kernel.org, netdev@...r.kernel.org, 
	Jason Xing <kernelxing@...cent.com>
Subject: Re: [PATCH net-next v3 02/14] net-timestamp: allow two features to
 work parallelly

On Tue, Nov 5, 2024 at 9:51 AM Martin KaFai Lau <martin.lau@...ux.dev> wrote:
>
> On 11/1/24 12:47 AM, Jason Xing wrote:
>
> >> If it prefers to stay with bpf_setsockopt (which is fine), it needs a bpf
> >> specific optname like the current TCP_BPF_SOCK_OPS_CB_FLAGS which currently sets
> >> the tp->bpf_sock_ops_cb_flags. May be a new SK_BPF_CB_FLAGS optname for setting
> >> the sk->sk_bpf_cb_flags, like bpf_setsockopt(skops_ctx, SOL_SOCKET,
> >
> >> SK_BPF_CB_FLAGS, &val, sizeof(val)) and handle it in the sol_socket_sockopt()
> >> alone without calling into sk_{set,get}sockopt. Add a new enum for the optval
> >> for the sk_bpf_cb_flags:
> >>
> >> enum {
> >>          SK_BPF_CB_TX_TIMESTAMPING = (1 << 0),
> >>          SK_BPF_CB_RX_TIEMSTAMPING = (1 << 1),
> >> };
> >
> > Then it will involve more strange modification in sol_socket_sockopt()
> > to retrieve the opt value like what I did in V2 (see
> > https://lore.kernel.org/all/20241012040651.95616-3-kerneljasonxing@gmail.com/).
> > It's the reason why I did set and get operation in
> > sk_{set,get}sockopt() in this series to keep align with other flags.
> > Handling it in sk_{set,get}sockopt() is not a bad idea and easy to
> > implement, I feel.
>
> This will look very different now. It is handling bpf specific
> optname and accessing the bpf specific field in sk->sk_bpf_cb_flags.
>
> I really don't see why it needs to spill over to sk_{set,get}sockopt()
> to handle sk->sk_bpf_cb_flags.
>
> I have quickly typed out a small part of discussion so far.
> It is likely buggy and not compiler tested. Pieces are still missing.
> The bpf_tstamp_ack will need a few more changes in the
> tcp_{input,output}.c. May be merging with the tstamp_ack to become
> 2 bits will be cleaner, not sure.
>
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index 39f1d16f3628..0b4913315854 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -488,6 +488,7 @@ enum {
>
>         /* generate software time stamp when entering packet scheduling */
>         SKBTX_SCHED_TSTAMP = 1 << 6,
> +       SKBTX_BPF = 1 << 7,
>   };
>
>   #define SKBTX_ANY_SW_TSTAMP   (SKBTX_SW_TSTAMP    | \
> diff --git a/include/net/sock.h b/include/net/sock.h
> index f29c14448938..4ec27c524f49 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -234,6 +234,20 @@ struct sock_common {
>   struct bpf_local_storage;
>   struct sk_filter;
>
> +enum {
> +       SK_BPF_CB_TX_TIMESTAMPING = BIT(0),
> +       SK_BPF_CB_RX_TIEMSTAMPING = BIT(1),
> +       SK_BPF_CB_MASK          = BIT(2) - 1,
> +};
> +
> +#ifdef CONFIG_BPF_SYSCALL
> +#define SK_BPF_CB_FLAG_TEST(SK, FLAG) ((SK)->sk_bpf_cb_flags & (FLAG))
> +void bpf_skops_tx_timestamping(struct sock *sk, struct sk_buff *skb);
> +#else
> +#define SK_BPF_CB_FLAG_TEST(SK, FLAG)
> +static inline void bpf_skops_timestamping(struct sock *sk, struct sk_buff *skb) {}
> +#endif

Until now, I know that I misunderstood what you meant in the previous
thread. I thought you were suggesting we need to use bpf_setsockopt.
Sorry.

I completely agree with this approach! Thanks!

> +
>   /**
>     *   struct sock - network layer representation of sockets
>     *   @__sk_common: shared layout with inet_timewait_sock
> @@ -444,7 +458,10 @@ struct sock {
>         socket_lock_t           sk_lock;
>         u32                     sk_reserved_mem;
>         int                     sk_forward_alloc;
> -       u32                     sk_tsflags;
> +       u16                     sk_tsflags;
> +#ifdef CONFIG_BPF_SYSCALL
> +       u16                     sk_bpf_cb_flags;

We cannot use u16 for sk_tsflags because the
SOF_TIMESTAMPING_OPT_RX_FILTER uses the 17th bit already. I will
handle it.

> +#endif
>         __cacheline_group_end(sock_write_rxtx);
>
>         __cacheline_group_begin(sock_write_tx);
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index d1948d357dad..224b697bae9d 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -961,7 +961,8 @@ struct tcp_skb_cb {
>         __u8            txstamp_ack:1,  /* Record TX timestamp for ack? */
>                         eor:1,          /* Is skb MSG_EOR marked? */
>                         has_rxtstamp:1, /* SKB has a RX timestamp       */
> -                       unused:5;
> +                       bpf_txstamp_ack:1,
> +                       unused:4;
>         __u32           ack_seq;        /* Sequence number ACK'd        */
>         union {
>                 struct {
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index f28b6527e815..2ff7ff0ebdab 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -7014,6 +7014,7 @@ enum {
>                                          * by the kernel or the
>                                          * earlier bpf-progs.
>                                          */
> +       BPF_SOCK_OPS_TX_TIMESTAMPING_CB,
>   };
>
>   /* List of TCP states. There is a build check in net/ipv4/tcp.c to detect
> @@ -7080,6 +7081,7 @@ enum {
>         TCP_BPF_SYN_IP          = 1006, /* Copy the IP[46] and TCP header */
>         TCP_BPF_SYN_MAC         = 1007, /* Copy the MAC, IP[46], and TCP header */
>         TCP_BPF_SOCK_OPS_CB_FLAGS = 1008, /* Get or Set TCP sock ops flags */
> +       SK_BPF_CB_FLAGS         = 1009,
>   };
>
>   enum {
> diff --git a/net/core/filter.c b/net/core/filter.c
> index e31ee8be2de0..81a36e50047b 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -5206,6 +5206,19 @@ static const struct bpf_func_proto bpf_get_socket_uid_proto = {
>         .arg1_type      = ARG_PTR_TO_CTX,
>   };
>
> +static int sk_bpf_cb_flags(struct sock *sk, int sk_bpf_cb_flags, bool getopt)
> +{
> +       if (getopt)
> +               return -EINVAL;
> +
> +       if (sk_bpf_cb_flags & ~SK_BPF_CB_MASK)
> +               return -EINVAL;
> +
> +       sk->sk_bpf_cb_flags = sk->sk_bpf_cb_flags;
> +
> +       return 0;
> +}
> +
>   static int sol_socket_sockopt(struct sock *sk, int optname,
>                               char *optval, int *optlen,
>                               bool getopt)
> @@ -5222,6 +5235,7 @@ static int sol_socket_sockopt(struct sock *sk, int optname,
>         case SO_MAX_PACING_RATE:
>         case SO_BINDTOIFINDEX:
>         case SO_TXREHASH:
> +       case SK_BPF_CB_FLAGS:
>                 if (*optlen != sizeof(int))
>                         return -EINVAL;
>                 break;
> @@ -5231,6 +5245,9 @@ static int sol_socket_sockopt(struct sock *sk, int optname,
>                 return -EINVAL;
>         }
>
> +       if (optname == SK_BPF_CB_FLAGS)
> +               return sk_bpf_cb_flags(sk, *(int *)optval, getopt);
> +
>         if (getopt) {
>                 if (optname == SO_BINDTODEVICE)
>                         return -EINVAL;
> diff --git a/net/core/sock.c b/net/core/sock.c
> index 039be95c40cf..d0406639cee9 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -137,6 +137,7 @@
>   #include <linux/sock_diag.h>
>
>   #include <linux/filter.h>
> +#include <linux/bpf-cgroup.h>
>   #include <net/sock_reuseport.h>
>   #include <net/bpf_sk_storage.h>
>
> @@ -946,6 +947,20 @@ int sock_set_timestamping(struct sock *sk, int optname,
>         return 0;
>   }
>
> +#ifdef CONFIG_BPF_SYSCALL
> +void bpf_skops_tx_timestamping(struct sock *sk, struct sk_buff *skb)
> +{
> +       struct bpf_sock_ops_kern sock_ops;
> +
> +       memset(&sock_ops, 0, offsetof(struct bpf_sock_ops_kern, temp));
> +       sock_ops.op = BPF_SOCK_OPS_TX_TIMESTAMPING_CB;
> +       sock_ops.is_fullsock = 1;
> +       sock_ops.sk = sk;
> +       sock_ops.skb = skb;
> +       __cgroup_bpf_run_filter_sock_ops(sk, &sock_ops, CGROUP_SOCK_OPS);
> +}
> +#endif
> +
>   void sock_set_keepalive(struct sock *sk)
>   {
>         lock_sock(sk);
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 4f77bd862e95..1e7f2d5fd792 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -491,6 +491,15 @@ static void tcp_tx_timestamp(struct sock *sk, u16 tsflags)
>                 if (tsflags & SOF_TIMESTAMPING_TX_RECORD_MASK)
>                         shinfo->tskey = TCP_SKB_CB(skb)->seq + skb->len - 1;
>         }
> +
> +       if (cgroup_bpf_enabled(CGROUP_SOCK_OPS) &&
> +           SK_BPF_CB_FLAG_TEST(sk, SK_BPF_CB_TX_TIMESTAMPING))
> +               /* The bpf prog can do:
> +                * shinfo->tx_flags |= SKBTX_BPF,
> +                * tcb->bpf_txstamp_ack = 1,
> +                * shinfo->tskey = TCP_SKB_CB(skb)->seq + skb->len - 1 (if tskey not set)
> +                */
> +               bpf_skops_tx_timestamping(sk, skb);
>   }
>
>
> >
> > Overall the suggestion looks good to me. I can give it a try :)
> >
> > I'm thinking of another approach to using bpf_sock_ops_cb_flags_set()
> > instead of bpf_setsockopt() when sockops like
> > BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB is triggered. I can modify the
> > bpf_sock_ops_cb_flags_set like this:
> > diff --git a/net/core/filter.c b/net/core/filter.c
> > index 58761263176c..001140067c1a 100644
> > --- a/net/core/filter.c
> > +++ b/net/core/filter.c
> > @@ -5770,14 +5770,25 @@ BPF_CALL_2(bpf_sock_ops_cb_flags_set, struct
> > bpf_sock_ops_kern *, bpf_sock,
> >             int, argval)
> >   {
> >          struct sock *sk = bpf_sock->sk;
> > -       int val = argval & BPF_SOCK_OPS_ALL_CB_FLAGS;
> > +       int val = argval;
> >
> > -       if (!IS_ENABLED(CONFIG_INET) || !sk_fullsock(sk))
> > +       if (!IS_ENABLED(CONFIG_INET))
> >                  return -EINVAL;
> >
> > -       tcp_sk(sk)->bpf_sock_ops_cb_flags = val;
> > +       if (sk_is_tcp(sk)) {
> > +               val = argval & BPF_SOCK_OPS_ALL_CB_FLAGS;
> > +               if (!sk_fullsock(sk))
> > +                       return -EINVAL;
> > +
> > +               tcp_sk(sk)->bpf_sock_ops_cb_flags = val;
> > +
> > +               val = argval & (~BPF_SOCK_OPS_ALL_CB_FLAGS);
> > +       } else {
> > +               sk->bpf_sock_ops_cb_flags = val;
>
> Why separate tcp vs non-tcp case? The tcp_sk(sk)->bpf_sock_ops_cb_flags
> is running out of bits anyway for tcp specific callback.
>
> just keep the SK_BPF_CB_{TX,RX}_TIEMSTAMPING in sk->sk_bpf_cb_flags
> for all tcp/udp/raw/...

Agreed!

>
> > +               val = argval &
> > (~(SK_BPF_CB_TX_TIEMSTAMPING|SK_BPF_CB_RX_TIEMSTAMPING));
>
> imo, we also don't need to return val to tell the caller what
> is not supported in the running kernel. The BPF CO-RE can
> handle this also, so less reason to keep extending the
> bpf_sock_ops_cb_flags_set API for non tcp.
>
> >>>> For datagrams (UDP as well as RAW and many non IP protocols), an
> >>>> alternative still needs to be found.
> >>
> >> In udp/raw/..., I don't know how likely is the user space having "cork->tx_flags
> >> & SKBTX_ANY_TSTAMP" set but has neither "READ_ONCE(sk->sk_tsflags) &
> >> SOF_TIMESTAMPING_OPT_ID" nor "cork->flags & IPCORK_TS_OPT_ID" set. If it is
> >> unlikely, may be we can just disallow bpf prog from directly setting
> >> skb_shinfo(skb)->tskey for this particular skb.
> >>
> >> For all other cases, in __ip[6]_append_data, directly call a bpf prog and also
> >> pass the kernel decided tskey to the bpf prog.
> >
> > I'm a bit confused here. IIUC, we need to support the tskey like what
> > we did in this series to handle non TCP cases?
>
> Like tcp, I don't think it really needs to use the sk->sk_tskey to mark the
> ID of a skb for the non tcp cases also. will comment on another thread.

Fine, I will let go of the tskey logic in v4.

Thanks,
Jason