netdev - Re: [PATCH net v3 0/3] mptcp: Fix conflicts between MPTCP and sockmap

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <bc5831bb-cfa3-4327-b129-30ca5d17b45e@kernel.org>
Date: Tue, 28 Oct 2025 18:26:05 +0100
From: Matthieu Baerts <matttbe@...nel.org>
To: Jiayuan Chen <jiayuan.chen@...ux.dev>, mptcp@...ts.linux.dev
Cc: John Fastabend <john.fastabend@...il.com>,
 Jakub Sitnicki <jakub@...udflare.com>, Eric Dumazet <edumazet@...gle.com>,
 Kuniyuki Iwashima <kuniyu@...gle.com>, Paolo Abeni <pabeni@...hat.com>,
 Willem de Bruijn <willemb@...gle.com>, "David S. Miller"
 <davem@...emloft.net>, Jakub Kicinski <kuba@...nel.org>,
 Simon Horman <horms@...nel.org>, Mat Martineau <martineau@...nel.org>,
 Geliang Tang <geliang@...nel.org>, Alexei Starovoitov <ast@...nel.org>,
 Daniel Borkmann <daniel@...earbox.net>, Andrii Nakryiko <andrii@...nel.org>,
 Martin KaFai Lau <martin.lau@...ux.dev>, Eduard Zingerman
 <eddyz87@...il.com>, Song Liu <song@...nel.org>,
 Yonghong Song <yonghong.song@...ux.dev>, KP Singh <kpsingh@...nel.org>,
 Stanislav Fomichev <sdf@...ichev.me>, Hao Luo <haoluo@...gle.com>,
 Jiri Olsa <jolsa@...nel.org>, Shuah Khan <shuah@...nel.org>,
 Florian Westphal <fw@...len.de>, linux-kernel@...r.kernel.org,
 netdev@...r.kernel.org, bpf@...r.kernel.org, linux-kselftest@...r.kernel.org
Subject: Re: [PATCH net v3 0/3] mptcp: Fix conflicts between MPTCP and sockmap

Hi Jiayuan,

Thank you for your reply!

On 24/10/2025 06:13, Jiayuan Chen wrote:
> 2025/10/23 22:10, "Matthieu Baerts" <matttbe@...nel.org mailto:matttbe@...nel.org?to=%22Matthieu%20Baerts%22%20%3Cmatttbe%40kernel.org%3E > 写到:
> 
> 
>>>  MPTCP creates subflows for data transmission between two endpoints.
>>>  However, BPF can use sockops to perform additional operations when TCP
>>>  completes the three-way handshake. The issue arose because we used sockmap
>>>  in sockops, which replaces sk->sk_prot and some handlers.
>>>
>> Do you know at what stage the sk->sk_prot is modified with sockmap? When
>> switching to TCP_ESTABLISHED?
>> Is it before or after having set "tcp_sk(sk)->is_mptcp = 0" (in
>> subflow_ulp_fallback(), coming from subflow_syn_recv_sock() I suppose)?
> 
> 
> Yes, there are two call points. One is after executing subflow_syn_recv_sock():
> tcp_init_transfer(sk, BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB, skb);
> 
> So at this point, is_mptcp = 0. The other call point is when userspace calls
> the BPF interface, passing in an fd while it's not a subflow but a parent sk
> with its own mptcp_prot we will also reject it.

OK, thank you for the explanations! I think your commit message in patch
1/3 should then explain the conditions to have mptcp_fallback_tcp_ops()
being called with a different sk_prot. In short: MPTCP listening socket,
TCP request without MPTCP, sk_prot reset to TCP (subflow_syn_recv_sock)
when SYN RECV, then reset by sockmap when ESTABLISHED, then accept part
and sk_prot is not the expected one.

> You can refer to my provided selftest, which covers these scenarios.
> 
>> If MPTCP is still being used (sk_is_tcp(sk) && sk_is_mptcp(sk)), I guess
>> sockmap should never touch the in-kernel TCP subflows: they will likely
>> only carry a part of the data. Instead, sockmap should act on the MPTCP
>> sockets, not the in-kernel TCP subflows.
> 
> Yes, I agree.
> 
> For full functionality, we need to retrieve the parent socket from MPTCP
> and integrate it with sockmap, rather than simply rejecting.

We should be careful when adding such exceptions. I will add more
details below.

> The current implementation rejects MPTCP because I previously attempted to
> add sockmap support for MPTCP, but it required implementing many interfaces
> and would take considerable time.
> 
> So for now, I'm proposing this as a fix to resolve the immediate issue.
> Subsequently, we can continue working on fully integrating MPTCP with sockmap.

It makes sense to start with the fix for stable, then the implementation
later. I think the implementation should not be that complex: it is just
that it has to be done at MPTCP level, not TCP. sockmap supports
different protocol, and it doesn't seem to be TCP specific, so that
should be feasible.

>> There is one particular case to take into consideration: an MPTCP
>> connection can fallback to "plain" TCP before being used by the
>> userspace. Typically, that's when an MPTCP listening socket receives a
>> "plain" TCP request (without MPTCP): a "plain" TCP socket will then be
>> created, and exposed to the userspace. In this case, sk_is_mptcp(sk)
>> will return false. I guess that's the case you are trying to handle,
>> right? (It might help BPF reviewers to mention that in the commit
>> message(s).)
> 
> Yes, this is primarily the case we're addressing. I will add this description
> to the commit message.

Thanks!

>> I would then say that sk->sk_prot->psock_update_sk_prot should not point
>> to tcp_bpf_update_proto() when MPTCP is being used (or this callback
>> should take the MPTCP case into account, but I guess no). In case of
>> fallback before the accept() stage, the socket can then be used as a
>> "plain" TCP one. I guess when tcp_bpf_update_proto() will be called,
>> sk_prot is pointing to tcp(v6)_prot, not the MPTCP subflow override one,
>> right?
> 
> Yes, when tcp_bpf_update_proto is called the sk_prot is pointing to tcp(v6)_prot.
> subflow_syn_recv_sock
>  mptcp_subflow_drop_ctx
>   subflow_ulp_fallback
>    mptcp_subflow_ops_undo_override -> reset sk_prot to original one

I see, it would be good to add that in the commit message as well.

> So [patch 2/3] aims to prevent psock_update_sk_prot from being executed on subflows.
> 
> Actually, replacing the subflow's callbacks is also incorrect, as you mentioned earlier,
> because subflows only carry part of the data. By checking for subflows early and skipping
> subsequent steps, we avoid incorrect logic.
> 
> Furthermore, there's another risk: if an IPv6 request comes in and we perform the replacement,
> MPTCP will roll it back to inet_stream_ops. I haven't delved too deeply into the potential
> impact, but I noticed that inet6_release has many V6-specific cleanup procedures not present
> in inet_release.

That's why we have the WARN_ON_ONCE(): this sk_prot was not expected, a
fix in the code is required if another value is accepted.

>>> Since subflows
>>>  also have their own specialized handlers, this creates a conflict and leads
>>>  to traffic failure. Therefore, we need to reject operations targeting
>>>  subflows.
>>>
>> Would it not work to set sk_prot->psock_update_sk_prot to NULL for the
>> v4 and v6 subflows (in mptcp_subflow_init()) for the moment while
>> sockmap is not supported with MPTCP? This might save you some checks in
>> sock_map.c, no?
> 
> This seems like a reliable alternative I hadn't considered initially.
> 
> However, adding the check on the BPF side serves another purpose: to explicitly
> warn users that sockmap and MPTCP are incompatible.
> 
> Since the latest Golang version enables MPTCP server by default, and if the client
> doesn't support MPTCP, it falls back to TCP logic. We want to print a clear message
> informing users who have upgraded to the latest Golang and are using sockmap.
> 
> Perhaps we could add a function like sk_is_mptcp_subflow() in the MPTCP side?
> The implementation would simply be sk_is_tcp(sk) && sk_is_mptcp(sk).
> 
> Implementing this check logic on the BPF side might become invalid if MPTCP internals
> change later; placing it in the MPTCP side might be a better choice.

I can understand that adding an error message can be helpful, but I
don't think we should add MPTCP specific checks in sockmap for the moment.

>>> This patchset simply prevents the combination of subflows and sockmap
>>>  without changing any functionality.
>>>
>> In your case, you have an MPTCP listening socket, but you receive a TCP
>> request, right? The "sockmap update" is done when switching to
>> TCP_ESTABLISHED, when !sk_is_mptcp(sk), but that's before
>> mptcp_stream_accept(). That's why sk->sk_prot has been modified, but it
>> is fine to look at sk_family, and return inet(6)_stream_ops, right?
> 
> I believe so. Since MPTCP is fundamentally based on TCP, using sk_family to
> determine which ops to fall back to should be sufficient.
> 
> However, strictly speaking, this [patch 1/3] might not even be necessary if we
> prevent the sk_prot replacement for subflows at the sockmap layer.
> 
>> A more important question: what will typically happen in your case if
>> you receive an MPTCP request and sockmap is then not supported? Will the
>> connection be rejected or stay in a strange state because the userspace
>> will not expect that? In these cases, would it not be better to disallow
>> sockmap usage while the MPTCP support is not available? The userspace
>> would then get an error from the beginning that the protocol is not
>> supported, and should then not create an MPTCP socket in this case for
>> the moment, no?
>>
>> I can understand that the switch from TCP to MPTCP was probably done
>> globally, and this transition should be as seamless as possible, but it
>> should not cause a regression with MPTCP requests. An alternative could
>> be to force a fallback to TCP when sockmap is used, even when an MPTCP
>> request is received, but not sure if it is practical to do, and might be
>> strange from the user point of view.
> 
> Actually, I understand this not as an MPTCP regression, but as a sockmap
> regression.
> 
> Let me explain how users typically use sockmap:
> 
> Users typically create multiple sockets on a host and program using BPF+sockmap
> to enable fast data redirection. This involves intercepting data sent or received
> by one socket and redirecting it to the send or receive queue of another socket.
> 
> This requires explicit user programming. The goal is that when multiple microservices
> on one host need to communicate, they can bypass most of the network stack and avoid
> data copies between user and kernel space.
> 
> However, when an MPTCP request occurs, this redirection flow fails.

This part bothers me a bit. Does it mean that when the userspace creates
a TCP listening socket (IPPROTO_TCP), MPTCP requests will be accepted,
but MPTCP will not be used ; but when an MPTCP socket is used instead,
MPTCP requests will be rejected?

If yes, it might be clearer not to allow sockmap on connections created
from MPTCP sockets. But when looking at sockmap and what's happening
when a TCP socket is created following a "plain TCP" request, we would
need specific MPTCP code to catch that in sockmap...

> Since the sockmap workflow typically occurs after the three-way handshake, rolling
> back at that point might be too late, and undoing the logic for MPTCP would be very
> complex.
> 
> Regardless, the reality is that MPTCP and sockmap are already conflicting, and this
> has been the case for some time. So I think our first step is to catch specific
> behavior on the BPF side and print a message
> "sockmap/sockhash: MPTCP sockets are not supported\n", informing users to either
> stop using sockmap or not use MPTCP.
> 
> As for the logic to check for subflows, I think implementing it in subflow.c would be
> beneficial, as this logic would likely be useful later if we want to
> support MPTCP + sockmap.

Probably yes.

> Furthermore, this commit also addresses the issue of incorrectly selecting
> inet_stream_ops due to the subflow prot replacement, as mentioned above.

(indeed, but this seems to happen only when sk_prot has been replaced by
sockmap :) )

>>> A complete integration of MPTCP and sockmap would require more effort, for
>>>  example, we would need to retrieve the parent socket from subflows in
>>>  sockmap and implement handlers like read_skb.
>>>  
>>>  If maintainers don't object, we can further improve this in subsequent
>>>  work.
>>>
>> That would be great to add MPTCP support in sockmap! As mentioned above,
>> this should be done on the MPTCP socket. I guess the TCP "in-kernel"
>> subflows should not be modified.
> 
> 
> I think we should first fix the issue by having sockmap reject operations on subflows.
> Subsequently, we can work on fully integrating sockmap with MPTCP as a feature
> (which would require implementing some handlers).

OK for me!

>>> [1] truncated warning:
>>>  [ 18.234652] ------------[ cut here ]------------
>>>  [ 18.234664] WARNING: CPU: 1 PID: 388 at net/mptcp/protocol.c:68 mptcp_stream_accept+0x34c/0x380
>>>  [ 18.234726] Modules linked in:
>>>  [ 18.234755] RIP: 0010:mptcp_stream_accept+0x34c/0x380
>>>  [ 18.234762] RSP: 0018:ffffc90000cf3cf8 EFLAGS: 00010202
> [...]
>>>
>> Please next time use the ./scripts/decode_stacktrace.sh if possible.
>> (and strip the timestamps if it is not giving useful info)
>> Just to be sure: is it the warning you get on top of net or net-next? Or
>> an older version? (Always useful to mention the base)
> 
> Thank you, Matthieu. I will pay attention to this.
> 
> 
>>>
>>> ---
>>>  v2: https://lore.kernel.org/bpf/20251020060503.325369-1-jiayuan.chen@linux.dev/T/#t
>>>  Some advice suggested by Jakub Sitnicki
>>>  
>>>  v1: https://lore.kernel.org/mptcp/a0a2b87119a06c5ffaa51427a0964a05534fe6f1@linux.dev/T/#t
>>>  Some advice from Matthieu Baerts.
>>>
>> (It usually helps reviewers to add more details in the notes/changelog
>> for the individual patch)
> 
> Thank you, Matthieu. I will provide more detailed descriptions in the future.

Thanks!

So for the v4, patch 2/3 would be replaced by one setting ...

  tcp_prot_override.psock_update_sk_prot = NULL;
  (...)
  tcpv6_prot_override.psock_update_sk_prot = NULL;

... in mptcp_subflow_init(). (+ more details for patch 1/3).

>From there, we can discuss with other maintainers what to do with the
MPTCP listening socket + sockmap case. And in parallel, we can also
discuss MPTCP support with sockmap. WDYT?

Cheers,
Matt
-- 
Sponsored by the NGI0 Core fund.

pw-bot: cr