netdev - Re: [PATCH bpf-next v12 01/12] bpf: add networking timestamping support to bpf

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAL+tcoCHsJ9KQf5w6TLHmQy9DrkhPHChRPQb=+9L_WKTTd8FQA@mail.gmail.com>
Date: Wed, 19 Feb 2025 14:29:47 +0800
From: Jason Xing <kerneljasonxing@...il.com>
To: Willem de Bruijn <willemdebruijn.kernel@...il.com>
Cc: Martin KaFai Lau <martin.lau@...ux.dev>, davem@...emloft.net, edumazet@...gle.com, 
	kuba@...nel.org, pabeni@...hat.com, dsahern@...nel.org, willemb@...gle.com, 
	ast@...nel.org, daniel@...earbox.net, andrii@...nel.org, eddyz87@...il.com, 
	song@...nel.org, yonghong.song@...ux.dev, john.fastabend@...il.com, 
	kpsingh@...nel.org, sdf@...ichev.me, haoluo@...gle.com, jolsa@...nel.org, 
	shuah@...nel.org, ykolal@...com, bpf@...r.kernel.org, netdev@...r.kernel.org
Subject: Re: [PATCH bpf-next v12 01/12] bpf: add networking timestamping
 support to bpf_get/setsockopt()

On Wed, Feb 19, 2025 at 10:32 AM Willem de Bruijn
<willemdebruijn.kernel@...il.com> wrote:
>
> Jason Xing wrote:
> > On Wed, Feb 19, 2025 at 5:55 AM Martin KaFai Lau <martin.lau@...ux.dev> wrote:
> > >
> > > On 2/18/25 6:22 AM, Willem de Bruijn wrote:
> > > > Jason Xing wrote:
> > > >> The new SK_BPF_CB_FLAGS and new SK_BPF_CB_TX_TIMESTAMPING are
> > > >> added to bpf_get/setsockopt. The later patches will implement the
> > > >> BPF networking timestamping. The BPF program will use
> > > >> bpf_setsockopt(SK_BPF_CB_FLAGS, SK_BPF_CB_TX_TIMESTAMPING) to
> > > >> enable the BPF networking timestamping on a socket.
> > > >>
> > > >> Signed-off-by: Jason Xing <kerneljasonxing@...il.com>
> > > >> ---
> > > >>   include/net/sock.h             |  3 +++
> > > >>   include/uapi/linux/bpf.h       |  8 ++++++++
> > > >>   net/core/filter.c              | 23 +++++++++++++++++++++++
> > > >>   tools/include/uapi/linux/bpf.h |  1 +
> > > >>   4 files changed, 35 insertions(+)
> > > >>
> > > >> diff --git a/include/net/sock.h b/include/net/sock.h
> > > >> index 8036b3b79cd8..7916982343c6 100644
> > > >> --- a/include/net/sock.h
> > > >> +++ b/include/net/sock.h
> > > >> @@ -303,6 +303,7 @@ struct sk_filter;
> > > >>     *        @sk_stamp: time stamp of last packet received
> > > >>     *        @sk_stamp_seq: lock for accessing sk_stamp on 32 bit architectures only
> > > >>     *        @sk_tsflags: SO_TIMESTAMPING flags
> > > >> +  * @sk_bpf_cb_flags: used in bpf_setsockopt()
> > > >>     *        @sk_use_task_frag: allow sk_page_frag() to use current->task_frag.
> > > >>     *                           Sockets that can be used under memory reclaim should
> > > >>     *                           set this to false.
> > > >> @@ -445,6 +446,8 @@ struct sock {
> > > >>      u32                     sk_reserved_mem;
> > > >>      int                     sk_forward_alloc;
> > > >>      u32                     sk_tsflags;
> > > >> +#define SK_BPF_CB_FLAG_TEST(SK, FLAG) ((SK)->sk_bpf_cb_flags & (FLAG))
> > > >> +    u32                     sk_bpf_cb_flags;
> > > >>      __cacheline_group_end(sock_write_rxtx);
> > > >
> > > > So far only one bit is defined. Does this have to be a 32-bit field in
> > > > every socket?
> > >
> > > iirc, I think there were multiple callback (cb) flags/bits in the earlier
> > > revisions, but it had been simplified to one bit in the later revisions.
> > >
> > > It's an internal implementation detail. We can reuse some free bits from another
> > > variable for now. Probably get a bit from sk_tsflags? SOCKCM_FLAG_TS_OPT_ID uses
> > > BIT(31). Maybe a new SK_TS_FLAG_BPF_TX that uses BIT(30)? I don't have a strong
> > > preference on the name.
> > >
> > > When the BPF program calls bpf_setsockopt(SK_BPF_CB_FLAGS,
> > > SK_BPF_CB_TX_TIMESTAMPING), the kernel will set/test the BIT(30) of sk_tsflags.
> > >
> > > We can wait until there are more socket-level cb flags in the future (e.g., more
> > > SK_BPF_CB_XXX will be needed) before adding a dedicated int field in the sock.
> >
> > Sorry, I still preferred the way we've discussed already:
>
> Adding fields to structs in the hot path is a tragedy of the commons.
>
> Every developer focuses on their specific workload and pet feature,
> while imposing a cost on everyone else.
>
> We have a duty to be frugal and mitigate this cost where possible.
> Especially for a feature that is likely to be used sparingly.

Point taken and I agree on this point all along. Don't get me wrong. I
insisted on using sk_bpf_cb_flags only, not the size of this field.
Please see details below.

>
> > 1) Introducing a new field sk_bpf_cb_flags extends the use for bpf
> > timestamping, more than SK_BPF_CB_TX_TIMESTAMPING one flag. I think
> > SK_BPF_CB_RX_TIMESTAMPING is also needed in the next move. And more
> > subfeatures (like bpf extension for OPT_ID) will use it. It gives us a
> > separate way to do more things based on this bpf timestamping.
> > 2) sk_bpf_cb_flags provides a way to let the socket-level use new
> > features for bpf while now we only have a tcp_sock-level, namely,
> > bpf_sock_ops_cb_flags. It's obviously good for others.
> >
> > It's the first move to open the gate for socket-level usage for BPF,
>
> Can you give a short list of bits that you could see being used, to
> get an idea of the count. In my mind this is a very short list, not
> worth reserving 32 bits for. But you might have more developed plans.

RX is one flag that we're going to add. Another new bit might be used
for bpf extension of OPT_ID to handle the udp lockless case, or
tracing every skb (which needs more official discussion) 'm not that
sure.

>
> > just like how TCP_BPF_SOCK_OPS_CB_FLAGS works in sol_tcp_sockopt(). So
> > I hope we will not abandon this good approach :(
> >
> > Now I wonder if I should use the u8 sk_bpf_cb_flags in V13 or just
> > keep it as-is? Either way is fine with me :) bpf_sock_ops_cb_flags
> > uses u8 as an example, thus I think we prefer the former?
>
> If it fits in a u8 and that in practice also results in less memory
> and cache pressure (i.e., does not just add a 24b hole), then it is a
> net improvement.

Probably I didn't state it clearly. I agree with you on saving memory:)

In the previous response, I was trying to keep the sk_bpf_cb_flags
flag and use a u8 instead. I admit u32 is too large after you noticed
this.

Would the following diff on top of this series be acceptable for you?
And would it be a proper place to put the u8 sk_bpf_cb_flags in struct
sock?
diff --git a/include/net/sock.h b/include/net/sock.h
index 6f4d54faba92..e85d6fb3a2ba 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -447,7 +447,7 @@ struct sock {
        int                     sk_forward_alloc;
        u32                     sk_tsflags;
 #define SK_BPF_CB_FLAG_TEST(SK, FLAG) ((SK)->sk_bpf_cb_flags & (FLAG))
-       u32                     sk_bpf_cb_flags;
+       u8                      sk_bpf_cb_flags;
        __cacheline_group_end(sock_write_rxtx);

        __cacheline_group_begin(sock_write_tx);

The following output is the result of running 'pahole --hex -C sock vmlinux'.
Before this series:
        u32                        sk_tsflags;           /* 0x168   0x4 */
        __u8
__cacheline_group_end__sock_write_rxtx[0]; /* 0x16c     0 */
        __u8
__cacheline_group_begin__sock_write_tx[0]; /* 0x16c     0 */
        int                        sk_write_pending;     /* 0x16c   0x4 */
        atomic_t                   sk_omem_alloc;        /* 0x170   0x4 */
        int                        sk_sndbuf;            /* 0x174   0x4 */
        int                        sk_wmem_queued;       /* 0x178   0x4 */
        refcount_t                 sk_wmem_alloc;        /* 0x17c   0x4 */
        /* --- cacheline 6 boundary (384 bytes) --- */
        long unsigned int          sk_tsq_flags;         /* 0x180   0x8 */
...
/* sum members: 773, holes: 1, sum holes: 1 */

After this diff patch:
        u32                        sk_tsflags;           /* 0x168   0x4 */
        u8                         sk_bpf_cb_flags;      /* 0x16c   0x1 */
        __u8
__cacheline_group_end__sock_write_rxtx[0]; /* 0x16d     0 */
        __u8
__cacheline_group_begin__sock_write_tx[0]; /* 0x16d     0 */

        /* XXX 3 bytes hole, try to pack */

        int                        sk_write_pending;     /* 0x170   0x4 */
        atomic_t                   sk_omem_alloc;        /* 0x174   0x4 */
        int                        sk_sndbuf;            /* 0x178   0x4 */
        int                        sk_wmem_queued;       /* 0x17c   0x4 */
        /* --- cacheline 6 boundary (384 bytes) --- */
        refcount_t                 sk_wmem_alloc;        /* 0x180   0x4 */

        /* XXX 4 bytes hole, try to pack */

        long unsigned int          sk_tsq_flags;         /* 0x188   0x8 */
...
/* sum members: 774, holes: 3, sum holes: 8 */

It will introduce 7 extra sum holes if this series with this u8 change
gets applied. I think it's a proper position because this new
sk_bpf_cb_flags will be used in the tx and rx path just like
sk_tsflags, aligned with rules introduced by the commit[1].

I'd like to know your opinion here. Hopefully we can let this series
pass in the next re-spin:)

Thanks in advance.

[1]
commit 5d4cc87414c5d11345c4b11d61377d351b5c28a2
Author: Eric Dumazet <edumazet@...gle.com>
Date:   Fri Feb 16 16:20:06 2024 +0000

    net: reorganize "struct sock" fields

    Last major reorg happened in commit 9115e8cd2a0c ("net: reorganize
    struct sock for better data locality")

    Since then, many changes have been done.

    Before SO_PEEK_OFF support is added to TCP, we need
    to move sk_peek_off to a better location.

    It is time to make another pass, and add six groups,
    without explicit alignment.

    - sock_write_rx (following sk_refcnt) read-write fields in rx path.
    - sock_read_rx read-mostly fields in rx path.
    - sock_read_rxtx read-mostly fields in both rx and tx paths.
    - sock_write_rxtx read-write fields in both rx and tx paths.
    - sock_write_tx read-write fields in tx paths.
    - sock_read_tx read-mostly fields in tx paths.

Thanks,
Jason