linux-kernel - Re: [PATCH bpf-next v5 2/3] bpf, sockmap: Fix FIONREAD for sockmap

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <87ikddijrd.fsf@cloudflare.com>
Date: Wed, 07 Jan 2026 15:23:50 +0100
From: Jakub Sitnicki <jakub@...udflare.com>
To: Jiayuan Chen <jiayuan.chen@...ux.dev>
Cc: bpf@...r.kernel.org,  John Fastabend <john.fastabend@...il.com>,  "David
 S. Miller" <davem@...emloft.net>,  Eric Dumazet <edumazet@...gle.com>,
  Jakub Kicinski <kuba@...nel.org>,  Paolo Abeni <pabeni@...hat.com>,
  Simon Horman <horms@...nel.org>,  Neal Cardwell <ncardwell@...gle.com>,
  Kuniyuki Iwashima <kuniyu@...gle.com>,  David Ahern <dsahern@...nel.org>,
  Alexei Starovoitov <ast@...nel.org>,  Daniel Borkmann
 <daniel@...earbox.net>,  Andrii Nakryiko <andrii@...nel.org>,  Martin
 KaFai Lau <martin.lau@...ux.dev>,  Eduard Zingerman <eddyz87@...il.com>,
  Song Liu <song@...nel.org>,  Yonghong Song <yonghong.song@...ux.dev>,  KP
 Singh <kpsingh@...nel.org>,  Stanislav Fomichev <sdf@...ichev.me>,  Hao
 Luo <haoluo@...gle.com>,  Jiri Olsa <jolsa@...nel.org>,  Shuah Khan
 <shuah@...nel.org>,  Stefano Garzarella <sgarzare@...hat.com>,  Michal
 Luczaj <mhal@...x.co>,  Cong Wang <cong.wang@...edance.com>,
  netdev@...r.kernel.org,  linux-kernel@...r.kernel.org,
  linux-kselftest@...r.kernel.org
Subject: Re: [PATCH bpf-next v5 2/3] bpf, sockmap: Fix FIONREAD for sockmap

On Tue, Jan 06, 2026 at 01:14 PM +08, Jiayuan Chen wrote:
> A socket using sockmap has its own independent receive queue: ingress_msg.
> This queue may contain data from its own protocol stack or from other
> sockets.
>
> Therefore, for sockmap, relying solely on copied_seq and rcv_nxt to
> calculate FIONREAD is not enough.
>
> This patch adds a new ingress_size field in the psock structure to record
> the data length in ingress_msg. Additionally, we implement new ioctl
> interfaces for TCP and UDP to intercept FIONREAD operations. While Unix
> and VSOCK also support sockmap and have similar FIONREAD calculation
> issues, fixing them would require more extensive changes
> (please let me know if modifications are needed). I believe it's not
> appropriate to include those changes under this fix patch.

Nit: These last two lines don't really belong in the commit message.
Side notes for reviewers can be added after the "---" marker.

> Previous work by John Fastabend made some efforts towards FIONREAD support:
> commit e5c6de5fa025 ("bpf, sockmap: Incorrectly handling copied_seq")
> Although the current patch is based on the previous work by John Fastabend,
> it is acceptable for our Fixes tag to point to the same commit.
>
>                                                      FD1:read()
>                                                      --  FD1->copied_seq++
>                                                          |  [read data]
>                                                          |
>                                 [enqueue data]           v
>                   [sockmap]     -> ingress to self ->  ingress_msg queue
> FD1 native stack  ------>                                 ^
> -- FD1->rcv_nxt++               -> redirect to other      | [enqueue data]
>                                        |                  |
>                                        |             ingress to FD1
>                                        v                  ^
>                                       ...                 |  [sockmap]
>                                                      FD2 native stack
>
> Fixes: 04919bed948dc ("tcp: Introduce tcp_read_skb()")
> Signed-off-by: Jiayuan Chen <jiayuan.chen@...ux.dev>
> ---
>  include/linux/skmsg.h | 67 +++++++++++++++++++++++++++++++++++++++++--
>  net/core/skmsg.c      |  3 ++
>  net/ipv4/tcp_bpf.c    | 21 ++++++++++++++
>  net/ipv4/udp_bpf.c    | 25 +++++++++++++---
>  4 files changed, 110 insertions(+), 6 deletions(-)
>
> diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h
> index 0323a2b6cf5e..1fa03953043f 100644
> --- a/include/linux/skmsg.h
> +++ b/include/linux/skmsg.h
> @@ -97,6 +97,7 @@ struct sk_psock {
>  	struct sk_buff_head		ingress_skb;
>  	struct list_head		ingress_msg;
>  	spinlock_t			ingress_lock;
> +	ssize_t				ingress_size;

The name is not great because we also already have `ingress_bytes`.
I suggest to rename and add a doc string. Also we don't expect the count
to ever be negative. Why ssize_t when we store all other byte counts
there as u32?

        /** @msg_tot_len: Total bytes queued in ingress_msg list. */
        u32                             msg_tot_len;

>  	unsigned long			state;
>  	struct list_head		link;
>  	spinlock_t			link_lock;
> @@ -321,6 +322,27 @@ static inline void sock_drop(struct sock *sk, struct sk_buff *skb)
>  	kfree_skb(skb);
>  }
>  
> +static inline ssize_t sk_psock_get_msg_size_nolock(struct sk_psock *psock)
> +{
> +	/* Used by ioctl to read ingress_size only; lock-free for performance */
> +	return READ_ONCE(psock->ingress_size);
> +}
> +
> +static inline void sk_psock_inc_msg_size_locked(struct sk_psock *psock, ssize_t diff)
> +{
> +	/* Use WRITE_ONCE to ensure correct read in sk_psock_get_msg_size_nolock().
> +	 * ingress_lock should be held to prevent concurrent updates to ingress_size
> +	 */
> +	WRITE_ONCE(psock->ingress_size, psock->ingress_size + diff);
> +}
> +
> +static inline void sk_psock_inc_msg_size(struct sk_psock *psock, ssize_t diff)

Not sure about this function name. "inc" usually means increment by one.
Was that modeled after some existing interface?

If not, I'd switch rename to sk_psock_msg_len_add(..., int delta)

Following the naming convention from sk_forward_alloc_add(),
skb_frag_size_add(), skb_len_add(), etc.

> +{
> +	spin_lock_bh(&psock->ingress_lock);
> +	sk_psock_inc_msg_size_locked(psock, diff);
> +	spin_unlock_bh(&psock->ingress_lock);
> +}
> +
>  static inline bool sk_psock_queue_msg(struct sk_psock *psock,
>  				      struct sk_msg *msg)
>  {
> @@ -329,6 +351,7 @@ static inline bool sk_psock_queue_msg(struct sk_psock *psock,
>  	spin_lock_bh(&psock->ingress_lock);
>  	if (sk_psock_test_state(psock, SK_PSOCK_TX_ENABLED)) {
>  		list_add_tail(&msg->list, &psock->ingress_msg);
> +		sk_psock_inc_msg_size_locked(psock, msg->sg.size);
>  		ret = true;
>  	} else {
>  		sk_msg_free(psock->sk, msg);
> @@ -345,18 +368,25 @@ static inline struct sk_msg *sk_psock_dequeue_msg(struct sk_psock *psock)
>  
>  	spin_lock_bh(&psock->ingress_lock);
>  	msg = list_first_entry_or_null(&psock->ingress_msg, struct sk_msg, list);
> -	if (msg)
> +	if (msg) {
>  		list_del(&msg->list);
> +		sk_psock_inc_msg_size_locked(psock, -msg->sg.size);
> +	}
>  	spin_unlock_bh(&psock->ingress_lock);
>  	return msg;
>  }
>  
> +static inline struct sk_msg *sk_psock_peek_msg_locked(struct sk_psock *psock)
> +{
> +	return list_first_entry_or_null(&psock->ingress_msg, struct sk_msg, list);
> +}
> +
>  static inline struct sk_msg *sk_psock_peek_msg(struct sk_psock *psock)
>  {
>  	struct sk_msg *msg;
>  
>  	spin_lock_bh(&psock->ingress_lock);
> -	msg = list_first_entry_or_null(&psock->ingress_msg, struct sk_msg, list);
> +	msg = sk_psock_peek_msg_locked(psock);
>  	spin_unlock_bh(&psock->ingress_lock);
>  	return msg;
>  }
> @@ -523,6 +553,39 @@ static inline bool sk_psock_strp_enabled(struct sk_psock *psock)
>  	return !!psock->saved_data_ready;
>  }
>  
> +/* for tcp only, sk is locked */
> +static inline ssize_t sk_psock_msg_inq(struct sock *sk)
> +{
> +	struct sk_psock *psock;
> +	ssize_t inq = 0;
> +
> +	psock = sk_psock_get(sk);
> +	if (likely(psock)) {
> +		inq = sk_psock_get_msg_size_nolock(psock);
> +		sk_psock_put(sk, psock);
> +	}
> +	return inq;
> +}
> +
> +/* for udp only, sk is not locked */
> +static inline ssize_t sk_msg_first_length(struct sock *sk)

s/_length/_len/

> +{
> +	struct sk_psock *psock;
> +	struct sk_msg *msg;
> +	ssize_t inq = 0;
> +
> +	psock = sk_psock_get(sk);
> +	if (likely(psock)) {
> +		spin_lock_bh(&psock->ingress_lock);
> +		msg = sk_psock_peek_msg_locked(psock);
> +		if (msg)
> +			inq = msg->sg.size;
> +		spin_unlock_bh(&psock->ingress_lock);
> +		sk_psock_put(sk, psock);
> +	}
> +	return inq;
> +}
> +
>  #if IS_ENABLED(CONFIG_NET_SOCK_MSG)
>  
>  #define BPF_F_STRPARSER	(1UL << 1)
> diff --git a/net/core/skmsg.c b/net/core/skmsg.c
> index d73e03f7713a..c959d52a62b2 100644
> --- a/net/core/skmsg.c
> +++ b/net/core/skmsg.c
> @@ -455,6 +455,7 @@ int __sk_msg_recvmsg(struct sock *sk, struct sk_psock *psock, struct msghdr *msg
>  					atomic_sub(copy, &sk->sk_rmem_alloc);
>  				}
>  				msg_rx->sg.size -= copy;
> +				sk_psock_inc_msg_size(psock, -copy);
>  
>  				if (!sge->length) {
>  					sk_msg_iter_var_next(i);
> @@ -819,9 +820,11 @@ static void __sk_psock_purge_ingress_msg(struct sk_psock *psock)
>  		list_del(&msg->list);
>  		if (!msg->skb)
>  			atomic_sub(msg->sg.size, &psock->sk->sk_rmem_alloc);
> +		sk_psock_inc_msg_size(psock, -((ssize_t)msg->sg.size));

Cast won't be needed after you switch param type to `int`.

>  		sk_msg_free(psock->sk, msg);
>  		kfree(msg);
>  	}
> +	WARN_ON_ONCE(psock->ingress_size);
>  }
>  
>  static void __sk_psock_zap_ingress(struct sk_psock *psock)
> diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c
> index 6332fc36ffe6..a9c758868f13 100644
> --- a/net/ipv4/tcp_bpf.c
> +++ b/net/ipv4/tcp_bpf.c
> @@ -10,6 +10,7 @@
>  
>  #include <net/inet_common.h>
>  #include <net/tls.h>
> +#include <asm/ioctls.h>
>  
>  void tcp_eat_skb(struct sock *sk, struct sk_buff *skb)
>  {
> @@ -332,6 +333,25 @@ static int tcp_bpf_recvmsg_parser(struct sock *sk,
>  	return copied;
>  }
>  
> +static int tcp_bpf_ioctl(struct sock *sk, int cmd, int *karg)
> +{
> +	bool slow;
> +
> +	/* we only care about FIONREAD */

Nit: This comment seems redundant. The expression is obvious.

> +	if (cmd != SIOCINQ)
> +		return tcp_ioctl(sk, cmd, karg);
> +
> +	/* works similar as tcp_ioctl */
> +	if (sk->sk_state == TCP_LISTEN)
> +		return -EINVAL;
> +
> +	slow = lock_sock_fast(sk);
> +	*karg = sk_psock_msg_inq(sk);
> +	unlock_sock_fast(sk, slow);
> +
> +	return 0;
> +}
> +
>  static int tcp_bpf_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
>  			   int flags, int *addr_len)
>  {
> @@ -610,6 +630,7 @@ static void tcp_bpf_rebuild_protos(struct proto prot[TCP_BPF_NUM_CFGS],
>  	prot[TCP_BPF_BASE].close		= sock_map_close;
>  	prot[TCP_BPF_BASE].recvmsg		= tcp_bpf_recvmsg;
>  	prot[TCP_BPF_BASE].sock_is_readable	= sk_msg_is_readable;
> +	prot[TCP_BPF_BASE].ioctl		= tcp_bpf_ioctl;
>  
>  	prot[TCP_BPF_TX]			= prot[TCP_BPF_BASE];
>  	prot[TCP_BPF_TX].sendmsg		= tcp_bpf_sendmsg;
> diff --git a/net/ipv4/udp_bpf.c b/net/ipv4/udp_bpf.c
> index 0735d820e413..cc1156aef14d 100644
> --- a/net/ipv4/udp_bpf.c
> +++ b/net/ipv4/udp_bpf.c
> @@ -5,6 +5,7 @@
>  #include <net/sock.h>
>  #include <net/udp.h>
>  #include <net/inet_common.h>
> +#include <asm/ioctls.h>
>  
>  #include "udp_impl.h"
>  
> @@ -111,12 +112,28 @@ enum {
>  static DEFINE_SPINLOCK(udpv6_prot_lock);
>  static struct proto udp_bpf_prots[UDP_BPF_NUM_PROTS];
>  
> +static int udp_bpf_ioctl(struct sock *sk, int cmd, int *karg)
> +{
> +	/* we only care about FIONREAD */
> +	if (cmd != SIOCINQ)
> +		return udp_ioctl(sk, cmd, karg);
> +
> +	/* works similar as udp_ioctl.
> +	 * man udp(7): "FIONREAD (SIOCINQ): Returns the size of the next
> +	 * pending datagram in the integer in bytes, or 0 when no datagram
> +	 * is pending."
> +	 */

Not sure we need to quote man pages here.

> +	*karg = sk_msg_first_length(sk);
> +	return 0;
> +}
> +
>  static void udp_bpf_rebuild_protos(struct proto *prot, const struct proto *base)
>  {
> -	*prot        = *base;
> -	prot->close  = sock_map_close;
> -	prot->recvmsg = udp_bpf_recvmsg;
> -	prot->sock_is_readable = sk_msg_is_readable;
> +	*prot			= *base;
> +	prot->close		= sock_map_close;
> +	prot->recvmsg		= udp_bpf_recvmsg;
> +	prot->sock_is_readable	= sk_msg_is_readable;
> +	prot->ioctl		= udp_bpf_ioctl;
>  }
>  
>  static void udp_bpf_check_v6_needs_rebuild(struct proto *ops)