lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANn89iKzwxgzX7-TAqjN5np8fksVM=qq1A0rFRdNKWjYJYWLaA@mail.gmail.com>
Date: Fri, 5 Apr 2024 11:11:20 +0200
From: Eric Dumazet <edumazet@...gle.com>
To: Pavel Begunkov <asml.silence@...il.com>
Cc: netdev@...r.kernel.org, davem@...emloft.net, dsahern@...nel.org, 
	pabeni@...hat.com, kuba@...nel.org
Subject: Re: [PATCH net-next v2] net: enable SOCK_NOSPACE for UDP

On Fri, Apr 5, 2024 at 1:37 AM Pavel Begunkov <asml.silence@...il.com> wrote:
>
> wake_up_poll() and variants can be expensive even if they don't actually
> wake anything up as it involves disabling irqs, taking a spinlock and
> walking through the poll list, which is fraught with cache bounces.
> That might happen when someone waits for POLLOUT or even POLLIN as the
> waitqueue is shared, even though we should be able to skip these
> false positive calls when the tx queue is not full.
>
> Add support for SOCK_NOSPACE for UDP sockets. The udp_poll() change is
> straightforward and repeats after tcp_poll() and others. However, for
> sock_wfree() it's done as an optional feature flagged by
> SOCK_SUPPORT_NOSPACE, because the feature requires support from the
> corresponding poll handler but there are many users of sock_wfree()
> that might be not prepared.
>
> Note, it optimises the sock_wfree() path but not sock_def_write_space().
> That's fine because it leads to more false positive wake ups, which is
> tolerable and not performance critical.
>
> It wins +5% to throughput testing with a CPU bound tx only io_uring
> based benchmark and showed 0.5-3% in more realistic workloads.
>
> Signed-off-by: Pavel Begunkov <asml.silence@...il.com>
> ---
>  include/net/sock.h |  1 +
>  net/core/sock.c    |  5 +++++
>  net/ipv4/udp.c     | 15 ++++++++++++++-
>  3 files changed, 20 insertions(+), 1 deletion(-)
>
> diff --git a/include/net/sock.h b/include/net/sock.h
> index 2253eefe2848..027a398471c4 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -944,6 +944,7 @@ enum sock_flags {
>         SOCK_XDP, /* XDP is attached */
>         SOCK_TSTAMP_NEW, /* Indicates 64 bit timestamps always */
>         SOCK_RCVMARK, /* Receive SO_MARK  ancillary data with packet */
> +       SOCK_NOSPACE_SUPPORTED, /* socket supports the SOCK_NOSPACE flag */
>  };
>
>  #define SK_FLAGS_TIMESTAMP ((1UL << SOCK_TIMESTAMP) | (1UL << SOCK_TIMESTAMPING_RX_SOFTWARE))
> diff --git a/net/core/sock.c b/net/core/sock.c
> index 5ed411231fc7..e4f486e9296a 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -3393,6 +3393,11 @@ static void sock_def_write_space_wfree(struct sock *sk)
>
>                 /* rely on refcount_sub from sock_wfree() */
>                 smp_mb__after_atomic();
> +
> +               if (sock_flag(sk, SOCK_NOSPACE_SUPPORTED) &&
> +                   !test_bit(SOCK_NOSPACE, &sk->sk_socket->flags))
> +                       return;
> +
>                 if (wq && waitqueue_active(&wq->wait))
>                         wake_up_interruptible_sync_poll(&wq->wait, EPOLLOUT |
>                                                 EPOLLWRNORM | EPOLLWRBAND);
> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> index 11460d751e73..309fa96e9020 100644
> --- a/net/ipv4/udp.c
> +++ b/net/ipv4/udp.c
> @@ -342,6 +342,7 @@ int udp_lib_get_port(struct sock *sk, unsigned short snum,
>                 hslot2->count++;
>                 spin_unlock(&hslot2->lock);
>         }
> +       sock_set_flag(sk, SOCK_NOSPACE_SUPPORTED);
>         sock_set_flag(sk, SOCK_RCU_FREE);
>         error = 0;
>  fail_unlock:
> @@ -2885,8 +2886,20 @@ __poll_t udp_poll(struct file *file, struct socket *sock, poll_table *wait)
>         /* psock ingress_msg queue should not contain any bad checksum frames */
>         if (sk_is_readable(sk))
>                 mask |= EPOLLIN | EPOLLRDNORM;
> -       return mask;
>
> +       if (!sock_writeable(sk)) {

I think there is a race that you could avoid here by using

if  (!(mask & (EPOLLOUT | EPOLLWRNORM | EPOLLWRBAND))) {

(We called datagram_poll() at the beginning of udp_poll())

> +               set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
> +
> +               /* Order with the wspace read so either we observe it
> +                * writeable or udp_sock_wfree() would find SOCK_NOSPACE and
> +                * wake us up.
> +                */
> +               smp_mb__after_atomic();
> +
> +               if (sock_writeable(sk))
> +                       mask |= EPOLLOUT | EPOLLWRNORM | EPOLLWRBAND;
> +       }
> +       return mask;
>  }
>  EXPORT_SYMBOL(udp_poll);
>
> --
> 2.44.0
>

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ