[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <a00cb120e5224a20931dcba10987cc80@AcuMS.aculab.com>
Date: Thu, 30 Nov 2023 09:04:36 +0000
From: David Laight <David.Laight@...LAB.COM>
To: 'Jakub Sitnicki' <jakub@...udflare.com>
CC: "netdev@...r.kernel.org" <netdev@...r.kernel.org>, 'Jakub Kicinski'
<kuba@...nel.org>, "David S. Miller" <davem@...emloft.net>, Stephen Hemminger
<stephen@...workplumber.org>, Eric Dumazet <edumazet@...gle.com>, "'David
Ahern'" <dsahern@...nel.org>, Paolo Abeni <pabeni@...hat.com>
Subject: RE: [PATCH net-next] ipv4: Use READ/WRITE_ONCE() for IP
local_port_range
From: Jakub Sitnicki
> Sent: 29 November 2023 20:17
>
> Hey David,
>
> On Wed, Nov 29, 2023 at 07:26 PM GMT, David Laight wrote:
> > Commit 227b60f5102cd added a seqlock to ensure that the low and high
> > port numbers were always updated together.
> > This is overkill because the two 16bit port numbers can be held in
> > a u32 and read/written in a single instruction.
> >
> > More recently 91d0b78c5177f added support for finer per-socket limits.
> > The user-supplied value is 'high << 16 | low' but they are held
> > separately and the socket options protected by the socket lock.
> >
> > Use a u32 containing 'high << 16 | low' for both the 'net' and 'sk'
> > fields and use READ_ONCE()/WRITE_ONCE() to ensure both values are
> > always updated together.
> >
> > Change (the now trival) inet_get_local_port_range() to a static inline
> > to optimise the calling code.
> > (In particular avoiding returning integers by reference.)
> >
> > Signed-off-by: David Laight <david.laight@...lab.com>
> > ---
>
> Regarding the per-socket changes - we don't expect contention on sock
> lock between inet_stream_connect / __inet_bind, where we grab it and
> eventually call inet_sk_get_local_port_range, and sockopt handlers, do
> we?
>
> The motivation is not super clear for me for that of the changes.
The locking in the getsockopt() code is actually quite horrid.
Look at the conditionals for the rntl lock.
It is conditionally acquired based on a function that sets a flag,
but released based on the exit path from the switch statement.
But there are only two options that need the rtnl lock and the socket
lock, and two trivial ones (including this one) that need the socket
lock.
So the code can be simplified by moving the locking into the case branches.
With only 2 such cases the overhead will be minimal (compared the to
setsockopt() case where a lot of options need locking.
This is all part of a big patchset I'm trying to write that converts
all the setsockopt code to take an 'unsigned int optlen' parameter
and return the length to pass back to the caller.
So the copy_to_user() of the updated length is done by the syscall
stub rather than inside every separate function (and sometimes in
multiple places in a function).
After all, if the copy fails EFAULT the application is broken.
It really doesn't matter if any side effects have happened.
If you get a fault reading from a pipe the data is lost.
>
> > include/net/inet_sock.h | 5 +----
> > include/net/ip.h | 7 ++++++-
> > include/net/netns/ipv4.h | 3 +--
> > net/ipv4/af_inet.c | 4 +---
> > net/ipv4/inet_connection_sock.c | 29 ++++++++++------------------
> > net/ipv4/ip_sockglue.c | 34 ++++++++++++++++-----------------
> > net/ipv4/sysctl_net_ipv4.c | 12 ++++--------
> > 7 files changed, 40 insertions(+), 54 deletions(-)
> >
> > diff --git a/include/net/inet_sock.h b/include/net/inet_sock.h
> > index 74db6d97cae1..ebf71410aa2b 100644
> > --- a/include/net/inet_sock.h
> > +++ b/include/net/inet_sock.h
> > @@ -234,10 +234,7 @@ struct inet_sock {
> > int uc_index;
> > int mc_index;
> > __be32 mc_addr;
> > - struct {
> > - __u16 lo;
> > - __u16 hi;
> > - } local_port_range;
> > + u32 local_port_range;
>
> Nit: This field would benefit from a similar comment as you have added to
> local_ports.range ("/* high << 16 | low */"), now that it is no longer
> obvious how to interpret the contents.
>
> >
> > struct ip_mc_socklist __rcu *mc_list;
> > struct inet_cork_full cork;
>
> [...]
>
> > diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
> > index 394a498c2823..1a45d41f8b39 100644
> > --- a/net/ipv4/inet_connection_sock.c
> > +++ b/net/ipv4/inet_connection_sock.c
> > @@ -117,34 +117,25 @@ bool inet_rcv_saddr_any(const struct sock *sk)
>
> [...]
>
> > void inet_sk_get_local_port_range(const struct sock *sk, int *low, int *high)
> > {
> > const struct inet_sock *inet = inet_sk(sk);
> > const struct net *net = sock_net(sk);
> > int lo, hi, sk_lo, sk_hi;
> > + u32 sk_range;
> >
> > inet_get_local_port_range(net, &lo, &hi);
> >
> > - sk_lo = inet->local_port_range.lo;
> > - sk_hi = inet->local_port_range.hi;
> > + sk_range = READ_ONCE(inet->local_port_range);
> > + if (unlikely(sk_range)) {
> > + sk_lo = sk_range & 0xffff;
> > + sk_hi = sk_range >> 16;
> >
> > - if (unlikely(lo <= sk_lo && sk_lo <= hi))
> > - lo = sk_lo;
> > - if (unlikely(lo <= sk_hi && sk_hi <= hi))
> > - hi = sk_hi;
> > + if (unlikely(lo <= sk_lo && sk_lo <= hi))
> > + lo = sk_lo;
> > + if (unlikely(lo <= sk_hi && sk_hi <= hi))
> > + hi = sk_hi;
> > + }
>
> Actually when we know that sk_range is set, the above two branches
> become likely. It will be usually so that the set per-socket port range
> narrows down the per-netns port range.
True, I'd left them alone, but would also flip the first conditional.
I can edit the patch :-)
David
>
> These checks exist only in case the per-netns port range has been
> reconfigured after per-socket port range has been set. The per-netns one
> always takes precedence.
>
> >
> > *low = lo;
> > *high = hi;
>
> [...]
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Powered by blists - more mailing lists