netdev - Re: [PATCH v3] tcp: add support for SO_PEEK

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <9cb12376da3f6cd316320b29f294cc84eaba6cfa.camel@redhat.com>
Date: Fri, 16 Feb 2024 10:14:38 +0100
From: Paolo Abeni <pabeni@...hat.com>
To: Jon Maloy <jmaloy@...hat.com>, Eric Dumazet <edumazet@...gle.com>
Cc: kuba@...nel.org, passt-dev@...st.top, sbrivio@...hat.com,
 lvivier@...hat.com,  dgibson@...hat.com, netdev@...r.kernel.org,
 davem@...emloft.net
Subject: Re: [PATCH v3] tcp: add support for SO_PEEK_OFF

On Thu, 2024-02-15 at 17:24 -0500, Jon Maloy wrote:
> 
> On 2024-02-15 12:46, Eric Dumazet wrote:
> > On Thu, Feb 15, 2024 at 6:41 PM Paolo Abeni <pabeni@...hat.com> wrote:
> > > Note: please send text-only email to netdev.
> > > 
> > > On Thu, 2024-02-15 at 10:11 -0500, Jon Maloy wrote:
> > > > I wonder if the following could be acceptable:
> > > > 
> > > >   if (flags & MSG_PEEK)
> > > >          sk_peek_offset_fwd(sk, used);
> > > >   else if (peek_offset > 0)
> > > >         sk_peek_offset_bwd(sk, used);
> > > > 
> > > >   peek_offset is already present in the data cache, and if it has the value
> > > >   zero it means either that that sk->sk_peek_off is unused (-1) or actually is zero.
> > > >   Either way, no rewind is needed in that case.
> > > I agree the above should avoid touching cold cachelines in the
> > > fastpath, and looks functionally correct to me.
> > > 
> > > The last word is up to Eric :)
> > > 
> > An actual patch seems needed.
> > 
> > In the current form, local variable peek_offset is 0 when !MSG_PEEK.
> > 
> > So the "else if (peek_offset > 0)" would always be false.
> > 
> Yes, of course. This wouldn't work unless we read sk->sk_peek_off at the 
> beginning of the function.
> I will look at the other suggestions.

I *think* that moving sk_peek_off this way:

---
diff --git a/include/net/sock.h b/include/net/sock.h
index a9d99a9c583f..576a6a6abb03 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -413,7 +413,7 @@ struct sock {
 	unsigned int		sk_napi_id;
 #endif
 	int			sk_rcvbuf;
-	int			sk_disconnects;
+	int			sk_peek_off;
 
 	struct sk_filter __rcu	*sk_filter;
 	union {
@@ -439,7 +439,7 @@ struct sock {
 		struct rb_root	tcp_rtx_queue;
 	};
 	struct sk_buff_head	sk_write_queue;
-	__s32			sk_peek_off;
+	int			sk_disconnects;
 	int			sk_write_pending;
 	__u32			sk_dst_pending_confirm;
 	u32			sk_pacing_status; /* see enum sk_pacing */
---

should avoid problematic accesses,

The relevant cachelines layout is as follow:

	                /* --- cacheline 4 boundary (256 bytes) --- */
                struct sk_buff *   tail;                 /*   256     8 */
        } sk_backlog;                                    /*   240    24 */
        int                        sk_forward_alloc;     /*   264     4 */
        u32                        sk_reserved_mem;      /*   268     4 */
        unsigned int               sk_ll_usec;           /*   272     4 */
        unsigned int               sk_napi_id;           /*   276     4 */
        int                        sk_rcvbuf;            /*   280     4 */
        int                        sk_disconnects;       /*   284     4 */
				// will become sk_peek_off
        struct sk_filter *         sk_filter;            /*   288     8 */
        union {
                struct socket_wq * sk_wq;                /*   296     8 */
                struct socket_wq * sk_wq_raw;            /*   296     8 */
        };                                               /*   296     8 */
        struct xfrm_policy *       sk_policy[2];         /*   304    16 */
        /* --- cacheline 5 boundary (320 bytes) --- */

	//  ...
	
        /* --- cacheline 6 boundary (384 bytes) --- */
        __s32                      sk_peek_off;          /*   384     4 */
				// will become sk_diconnects
        int                        sk_write_pending;     /*   388     4 */
        __u32                      sk_dst_pending_confirm; /*   392     4 */
        u32                        sk_pacing_status;     /*   396     4 */
        long int                   sk_sndtimeo;          /*   400     8 */
        struct timer_list          sk_timer;             /*   408    40 */

        /* XXX last struct has 4 bytes of padding */

        /* --- cacheline 7 boundary (448 bytes) --- */

sk_peek_off will be in the same cachline of sk_forward_alloc /
sk_reserved_mem / backlog tail, that are already touched by the
tcp_recvmsg_locked() main loop.

WDYT?

thanks!

Paolo