netdev - Re: [RFC PATCH 0/2] udp: avoid false sharing on sk

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANn89iLaDEjuDAE-Bupi4iDjt4wa90NA8bRjH8_0qWOQpHJ98Q@mail.gmail.com>
Date: Mon, 10 Feb 2025 17:37:24 +0100
From: Eric Dumazet <edumazet@...gle.com>
To: Paolo Abeni <pabeni@...hat.com>
Cc: Willem de Bruijn <willemdebruijn.kernel@...il.com>, netdev@...r.kernel.org, 
	Kuniyuki Iwashima <kuniyu@...zon.com>, "David S. Miller" <davem@...emloft.net>, 
	Jakub Kicinski <kuba@...nel.org>, Simon Horman <horms@...nel.org>, Neal Cardwell <ncardwell@...gle.com>, 
	David Ahern <dsahern@...nel.org>
Subject: Re: [RFC PATCH 0/2] udp: avoid false sharing on sk_tsflags

On Mon, Feb 10, 2025 at 5:16 PM Paolo Abeni <pabeni@...hat.com> wrote:
>
> On 2/10/25 4:13 PM, Eric Dumazet wrote:
> > On Mon, Feb 10, 2025 at 5:00 AM Willem de Bruijn
> > <willemdebruijn.kernel@...il.com> wrote:
> >>
> >> Paolo Abeni wrote:
> >>> While benchmarking the recently shared page frag revert, I observed a
> >>> lot of cache misses in the UDP RX path due to false sharing between the
> >>> sk_tsflags and the sk_forward_alloc sk fields.
> >>>
> >>> Here comes a solution attempt for such a problem, inspired by commit
> >>> f796feabb9f5 ("udp: add local "peek offset enabled" flag").
> >>>
> >>> The first patch adds a new proto op allowing protocol specific operation
> >>> on tsflags updates, and the 2nd one leverages such operation to cache
> >>> the problematic field in a cache friendly manner.
> >>>
> >>> The need for a new operation is possibly suboptimal, hence the RFC tag,
> >>> but I could not find other good solutions. I considered:
> >>> - moving the sk_tsflags just before 'sk_policy', in the 'sock_read_rxtx'
> >>>   group. It arguably belongs to such group, but the change would create
> >>>   a couple of holes, increasing the 'struct sock' size and would have
> >>>   side effects on other protocols
> >>> - moving the sk_tsflags just before 'sk_stamp'; similar to the above,
> >>>   would possibly reduce the side effects, as most of 'struct sock'
> >>>   layout will be unchanged. Could increase the number of cacheline
> >>>   accessed in the TX path.
> >>>
> >>> I opted for the present solution as it should minimize the side effects
> >>> to other protocols.
> >>
> >> The code looks solid at a high level to me.
> >>
> >> But if the issue can be adddressed by just moving a field, that is
> >> quite appealing. So have no reviewed closely yet.
> >>
> >
> > sk_tsflags has not been put in an optimal group, I would indeed move it,
> > even if this creates one hole.
> >
> > Holes tend to be used quite fast anyway with new fields.
> >
> > Perhaps sock_read_tx group would be the best location,
> > because tcp_recv_timestamp() is not called in the fast path.
>
> Just to wrap my head on the above reasoning: for UDP such a change could
> possibly increase the number of `struct sock` cache-line accessed in the
> RX path (the `sock_write_tx` group should not be touched otherwise) but
> that will not matter much, because we expect a low number of UDP sockets
> in the system, right?

Are you referring to UDP applications needing timestamps ?

Because sk_tsflags is mostly always used in TX

We have not seen this issue because 97dc7cd92ac67f6e05 ("ptp: Support
late timestamp determination")
was not in our kernels at that time.

Perhaps we could change netdev_get_tstamp() so that we read sk->sk_tsflags
only when really needed ?

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 5429581f22995bff639e6962a317adbd0ce30cff..848b70fb116421bf02159a53524a0700b87e851a
100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -5103,18 +5103,6 @@ static inline void netdev_rx_csum_fault(struct
net_device *dev,
 void net_enable_timestamp(void);
 void net_disable_timestamp(void);

-static inline ktime_t netdev_get_tstamp(struct net_device *dev,
-                                       const struct
skb_shared_hwtstamps *hwtstamps,
-                                       bool cycles)
-{
-       const struct net_device_ops *ops = dev->netdev_ops;
-
-       if (ops->ndo_get_tstamp)
-               return ops->ndo_get_tstamp(dev, hwtstamps, cycles);
-
-       return hwtstamps->hwtstamp;
-}
-
 #ifndef CONFIG_PREEMPT_RT
 static inline void netdev_xmit_set_more(bool more)
 {
diff --git a/net/socket.c b/net/socket.c
index 262a28b59c7f0f760fd29e207f270e65150abec8..6dc52c72fccd22f25c6e90d68de491863dc23689
100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -799,9 +799,22 @@ static bool skb_is_swtx_tstamp(const struct
sk_buff *skb, int false_tstamp)
        return skb->tstamp && !false_tstamp && skb_is_err_queue(skb);
 }

+static ktime_t netdev_get_tstamp(struct net_device *dev,
+                                const struct skb_shared_hwtstamps *hwtstamps,
+                                struct sock *sk)
+{
+       const struct net_device_ops *ops = dev->netdev_ops;
+
+       if (ops->ndo_get_tstamp) {
+               bool cycles = READ_ONCE(sk->sk_tsflags) &
SOF_TIMESTAMPING_BIND_PHC;
+
+               return ops->ndo_get_tstamp(dev, hwtstamps, cycles);
+       }
+       return hwtstamps->hwtstamp;
+}
+
 static ktime_t get_timestamp(struct sock *sk, struct sk_buff *skb,
int *if_index)
 {
-       bool cycles = READ_ONCE(sk->sk_tsflags) & SOF_TIMESTAMPING_BIND_PHC;
        struct skb_shared_hwtstamps *shhwtstamps = skb_hwtstamps(skb);
        struct net_device *orig_dev;
        ktime_t hwtstamp;
@@ -810,7 +823,7 @@ static ktime_t get_timestamp(struct sock *sk,
struct sk_buff *skb, int *if_index
        orig_dev = dev_get_by_napi_id(skb_napi_id(skb));
        if (orig_dev) {
                *if_index = orig_dev->ifindex;
-               hwtstamp = netdev_get_tstamp(orig_dev, shhwtstamps, cycles);
+               hwtstamp = netdev_get_tstamp(orig_dev, shhwtstamps, sk);
        } else {
                hwtstamp = shhwtstamps->hwtstamp;
        }


>
> Side note: FWIW I think we will have 2 holes, 4 bytes each, one after
> `sk_forward_alloc` and another one after `sk_mark`.
>
> I missed that explicit alignment of the `tcp_sock_write_tx` group; that
> will prevent the overall grow of `struct tcp_sock`, and will avoid bad
> side effects while changing the struct layout.
>
> I expect the change you propose would perform alike the RFC patches, but
> I'll try to do an explicit test later (and report here the results).
>
> Thanks,
>
> Paolo
>