[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANn89iLaDEjuDAE-Bupi4iDjt4wa90NA8bRjH8_0qWOQpHJ98Q@mail.gmail.com>
Date: Mon, 10 Feb 2025 17:37:24 +0100
From: Eric Dumazet <edumazet@...gle.com>
To: Paolo Abeni <pabeni@...hat.com>
Cc: Willem de Bruijn <willemdebruijn.kernel@...il.com>, netdev@...r.kernel.org,
Kuniyuki Iwashima <kuniyu@...zon.com>, "David S. Miller" <davem@...emloft.net>,
Jakub Kicinski <kuba@...nel.org>, Simon Horman <horms@...nel.org>, Neal Cardwell <ncardwell@...gle.com>,
David Ahern <dsahern@...nel.org>
Subject: Re: [RFC PATCH 0/2] udp: avoid false sharing on sk_tsflags
On Mon, Feb 10, 2025 at 5:16 PM Paolo Abeni <pabeni@...hat.com> wrote:
>
> On 2/10/25 4:13 PM, Eric Dumazet wrote:
> > On Mon, Feb 10, 2025 at 5:00 AM Willem de Bruijn
> > <willemdebruijn.kernel@...il.com> wrote:
> >>
> >> Paolo Abeni wrote:
> >>> While benchmarking the recently shared page frag revert, I observed a
> >>> lot of cache misses in the UDP RX path due to false sharing between the
> >>> sk_tsflags and the sk_forward_alloc sk fields.
> >>>
> >>> Here comes a solution attempt for such a problem, inspired by commit
> >>> f796feabb9f5 ("udp: add local "peek offset enabled" flag").
> >>>
> >>> The first patch adds a new proto op allowing protocol specific operation
> >>> on tsflags updates, and the 2nd one leverages such operation to cache
> >>> the problematic field in a cache friendly manner.
> >>>
> >>> The need for a new operation is possibly suboptimal, hence the RFC tag,
> >>> but I could not find other good solutions. I considered:
> >>> - moving the sk_tsflags just before 'sk_policy', in the 'sock_read_rxtx'
> >>> group. It arguably belongs to such group, but the change would create
> >>> a couple of holes, increasing the 'struct sock' size and would have
> >>> side effects on other protocols
> >>> - moving the sk_tsflags just before 'sk_stamp'; similar to the above,
> >>> would possibly reduce the side effects, as most of 'struct sock'
> >>> layout will be unchanged. Could increase the number of cacheline
> >>> accessed in the TX path.
> >>>
> >>> I opted for the present solution as it should minimize the side effects
> >>> to other protocols.
> >>
> >> The code looks solid at a high level to me.
> >>
> >> But if the issue can be adddressed by just moving a field, that is
> >> quite appealing. So have no reviewed closely yet.
> >>
> >
> > sk_tsflags has not been put in an optimal group, I would indeed move it,
> > even if this creates one hole.
> >
> > Holes tend to be used quite fast anyway with new fields.
> >
> > Perhaps sock_read_tx group would be the best location,
> > because tcp_recv_timestamp() is not called in the fast path.
>
> Just to wrap my head on the above reasoning: for UDP such a change could
> possibly increase the number of `struct sock` cache-line accessed in the
> RX path (the `sock_write_tx` group should not be touched otherwise) but
> that will not matter much, because we expect a low number of UDP sockets
> in the system, right?
Are you referring to UDP applications needing timestamps ?
Because sk_tsflags is mostly always used in TX
We have not seen this issue because 97dc7cd92ac67f6e05 ("ptp: Support
late timestamp determination")
was not in our kernels at that time.
Perhaps we could change netdev_get_tstamp() so that we read sk->sk_tsflags
only when really needed ?
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 5429581f22995bff639e6962a317adbd0ce30cff..848b70fb116421bf02159a53524a0700b87e851a
100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -5103,18 +5103,6 @@ static inline void netdev_rx_csum_fault(struct
net_device *dev,
void net_enable_timestamp(void);
void net_disable_timestamp(void);
-static inline ktime_t netdev_get_tstamp(struct net_device *dev,
- const struct
skb_shared_hwtstamps *hwtstamps,
- bool cycles)
-{
- const struct net_device_ops *ops = dev->netdev_ops;
-
- if (ops->ndo_get_tstamp)
- return ops->ndo_get_tstamp(dev, hwtstamps, cycles);
-
- return hwtstamps->hwtstamp;
-}
-
#ifndef CONFIG_PREEMPT_RT
static inline void netdev_xmit_set_more(bool more)
{
diff --git a/net/socket.c b/net/socket.c
index 262a28b59c7f0f760fd29e207f270e65150abec8..6dc52c72fccd22f25c6e90d68de491863dc23689
100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -799,9 +799,22 @@ static bool skb_is_swtx_tstamp(const struct
sk_buff *skb, int false_tstamp)
return skb->tstamp && !false_tstamp && skb_is_err_queue(skb);
}
+static ktime_t netdev_get_tstamp(struct net_device *dev,
+ const struct skb_shared_hwtstamps *hwtstamps,
+ struct sock *sk)
+{
+ const struct net_device_ops *ops = dev->netdev_ops;
+
+ if (ops->ndo_get_tstamp) {
+ bool cycles = READ_ONCE(sk->sk_tsflags) &
SOF_TIMESTAMPING_BIND_PHC;
+
+ return ops->ndo_get_tstamp(dev, hwtstamps, cycles);
+ }
+ return hwtstamps->hwtstamp;
+}
+
static ktime_t get_timestamp(struct sock *sk, struct sk_buff *skb,
int *if_index)
{
- bool cycles = READ_ONCE(sk->sk_tsflags) & SOF_TIMESTAMPING_BIND_PHC;
struct skb_shared_hwtstamps *shhwtstamps = skb_hwtstamps(skb);
struct net_device *orig_dev;
ktime_t hwtstamp;
@@ -810,7 +823,7 @@ static ktime_t get_timestamp(struct sock *sk,
struct sk_buff *skb, int *if_index
orig_dev = dev_get_by_napi_id(skb_napi_id(skb));
if (orig_dev) {
*if_index = orig_dev->ifindex;
- hwtstamp = netdev_get_tstamp(orig_dev, shhwtstamps, cycles);
+ hwtstamp = netdev_get_tstamp(orig_dev, shhwtstamps, sk);
} else {
hwtstamp = shhwtstamps->hwtstamp;
}
>
> Side note: FWIW I think we will have 2 holes, 4 bytes each, one after
> `sk_forward_alloc` and another one after `sk_mark`.
>
> I missed that explicit alignment of the `tcp_sock_write_tx` group; that
> will prevent the overall grow of `struct tcp_sock`, and will avoid bad
> side effects while changing the struct layout.
>
> I expect the change you propose would perform alike the RFC patches, but
> I'll try to do an explicit test later (and report here the results).
>
> Thanks,
>
> Paolo
>
Powered by blists - more mailing lists