lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CANn89i+C_mQmTFsqKb3geRADET2ELWeZ=0QHdvuq+v+PKtW0AQ@mail.gmail.com>
Date: Fri, 16 Feb 2024 10:21:54 +0100
From: Eric Dumazet <edumazet@...gle.com>
To: Paolo Abeni <pabeni@...hat.com>
Cc: Jon Maloy <jmaloy@...hat.com>, kuba@...nel.org, passt-dev@...st.top, 
	sbrivio@...hat.com, lvivier@...hat.com, dgibson@...hat.com, 
	netdev@...r.kernel.org, davem@...emloft.net
Subject: Re: [PATCH v3] tcp: add support for SO_PEEK_OFF

On Fri, Feb 16, 2024 at 10:14 AM Paolo Abeni <pabeni@...hat.com> wrote:
>
> On Thu, 2024-02-15 at 17:24 -0500, Jon Maloy wrote:
> >
> > On 2024-02-15 12:46, Eric Dumazet wrote:
> > > On Thu, Feb 15, 2024 at 6:41 PM Paolo Abeni <pabeni@...hat.com> wrote:
> > > > Note: please send text-only email to netdev.
> > > >
> > > > On Thu, 2024-02-15 at 10:11 -0500, Jon Maloy wrote:
> > > > > I wonder if the following could be acceptable:
> > > > >
> > > > >   if (flags & MSG_PEEK)
> > > > >          sk_peek_offset_fwd(sk, used);
> > > > >   else if (peek_offset > 0)
> > > > >         sk_peek_offset_bwd(sk, used);
> > > > >
> > > > >   peek_offset is already present in the data cache, and if it has the value
> > > > >   zero it means either that that sk->sk_peek_off is unused (-1) or actually is zero.
> > > > >   Either way, no rewind is needed in that case.
> > > > I agree the above should avoid touching cold cachelines in the
> > > > fastpath, and looks functionally correct to me.
> > > >
> > > > The last word is up to Eric :)
> > > >
> > > An actual patch seems needed.
> > >
> > > In the current form, local variable peek_offset is 0 when !MSG_PEEK.
> > >
> > > So the "else if (peek_offset > 0)" would always be false.
> > >
> > Yes, of course. This wouldn't work unless we read sk->sk_peek_off at the
> > beginning of the function.
> > I will look at the other suggestions.
>
> I *think* that moving sk_peek_off this way:
>
> ---
> diff --git a/include/net/sock.h b/include/net/sock.h
> index a9d99a9c583f..576a6a6abb03 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -413,7 +413,7 @@ struct sock {
>         unsigned int            sk_napi_id;
>  #endif
>         int                     sk_rcvbuf;
> -       int                     sk_disconnects;
> +       int                     sk_peek_off;
>
>         struct sk_filter __rcu  *sk_filter;
>         union {
> @@ -439,7 +439,7 @@ struct sock {
>                 struct rb_root  tcp_rtx_queue;
>         };
>         struct sk_buff_head     sk_write_queue;
> -       __s32                   sk_peek_off;
> +       int                     sk_disconnects;
>         int                     sk_write_pending;
>         __u32                   sk_dst_pending_confirm;
>         u32                     sk_pacing_status; /* see enum sk_pacing */
> ---
>
> should avoid problematic accesses,
>
> The relevant cachelines layout is as follow:
>
>                         /* --- cacheline 4 boundary (256 bytes) --- */
>                 struct sk_buff *   tail;                 /*   256     8 */
>         } sk_backlog;                                    /*   240    24 */
>         int                        sk_forward_alloc;     /*   264     4 */
>         u32                        sk_reserved_mem;      /*   268     4 */
>         unsigned int               sk_ll_usec;           /*   272     4 */
>         unsigned int               sk_napi_id;           /*   276     4 */
>         int                        sk_rcvbuf;            /*   280     4 */
>         int                        sk_disconnects;       /*   284     4 */
>                                 // will become sk_peek_off
>         struct sk_filter *         sk_filter;            /*   288     8 */
>         union {
>                 struct socket_wq * sk_wq;                /*   296     8 */
>                 struct socket_wq * sk_wq_raw;            /*   296     8 */
>         };                                               /*   296     8 */
>         struct xfrm_policy *       sk_policy[2];         /*   304    16 */
>         /* --- cacheline 5 boundary (320 bytes) --- */
>
>         //  ...
>
>         /* --- cacheline 6 boundary (384 bytes) --- */
>         __s32                      sk_peek_off;          /*   384     4 */
>                                 // will become sk_diconnects
>         int                        sk_write_pending;     /*   388     4 */
>         __u32                      sk_dst_pending_confirm; /*   392     4 */
>         u32                        sk_pacing_status;     /*   396     4 */
>         long int                   sk_sndtimeo;          /*   400     8 */
>         struct timer_list          sk_timer;             /*   408    40 */
>
>         /* XXX last struct has 4 bytes of padding */
>
>         /* --- cacheline 7 boundary (448 bytes) --- */
>
> sk_peek_off will be in the same cachline of sk_forward_alloc /
> sk_reserved_mem / backlog tail, that are already touched by the
> tcp_recvmsg_locked() main loop.
>
> WDYT?

I was about to send a similar change, also moving sk_rcvtimeo, and
adding __cacheline_group_begin()/__cacheline_group_end
annotations.

I can finish this today.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ