lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: 
 <PH7PR02MB1003916E9855CB7A11E1B1D6BCB0F2@PH7PR02MB10039.namprd02.prod.outlook.com>
Date: Wed, 17 Apr 2024 23:03:00 +0000
From: "Sean Tranchetti (QUIC)" <quic_stranche@...cinc.com>
To: "jmaloy@...hat.com" <jmaloy@...hat.com>,
        "ying.xue@...driver.com"
	<ying.xue@...driver.com>,
        "netdev@...r.kernel.org" <netdev@...r.kernel.org>
CC: "Subash Abhinov Kasiviswanathan (QUIC)" <quic_subashab@...cinc.com>
Subject: Permanent TIPC Broadcast Failure When a Member Joins

Hi everyone,

We've been trying to track down the cause of a strange TIPC failure over the last few weeks and have gotten stuck. Hoping that opening this up to a wider audience might help a bit.

The scenario is as follows. We have two programs, a sever and a client, that use TIPC group communication to exchange messages.
	- Server opens a TIPC socket and creates the group via setsockopt()
	- Once every second, the server performs a group broadcast via send()
	- Client opens a TIPC socket and joins the group via setsockopt()
	- Client then waits for messages to arrive via recv()

What we see happening is that as soon as the client joins the TIPC group, the server's broadcast will fail from then on (either blocking on the send() or returning EAGAIN/EWOULDBLOCK if set to non-blocking).

After adding several debug messages into the kernel code, we can see that the server-side send is failing because tipc_group_bc_cong() is always returning true, instructing the broadcast code to wait as the group is congested. If we compare logs taken on the same machine where these programs work (more on that later, though), we see that tipc_group_bc_cong() always returns false because of the following condition:
	if (list_empty(&grp->small_win))
		return false;

After much tracing, it seems that the tipc_member structure referring to the client socket is never removed from grp->small_win in the server-side copy of the tipc_group in the failure case. Again, comparing this to a successful run, it seems that the expected flow for this to be removed is the following 3 exchanges:

(1) Server-side group handling TIPC_PUBLISH of client socket
[   73.311097][ T2514] tipc: tipc_sk_filter_rcv(): processing RCV skb 0xffffff8034248400
[   73.311106][ T2514] tipc: tipc_sk_filter_rcv(): handling tipc protocol message. sk 0xffffff802611df00
[   73.311112][ T2514] tipc: tipc_group_member_evt(): hanlding group memebr event for group type 0x46228388 ptr 0xffffff80341ecb40, port 0x91e678d5, and instance 0xab40005
[   73.331894][ T2514] tipc: tipc_group_member_evt(): no member with port 0x91e678d5, instance 0xab40005 yet for group 0xffffff80341ecb40. creating 0xffffff8035de8480 and posting GRP_JOIN
[   73.331902][ T2514] tipc: tipc_group_update_member(): adding member instance 0xab40005 ptr 0xffffff8035de8480  to group type 0x46228388 ptr 0xffffff80341ecb40 small_win
[   73.331911][ T2514] tipc: tipc_group_proto_xmit(): handling protocol tx for group type 0x46228388 ptr 0xffffff80341ecb40, port 0x7e8dc0bf
[   73.356847][ T2514] tipc: tipc_group_proto_xmit(): member instance 0xab40005 adv adjustment for GRP_JOIN. ADV_IDLE: false, ADV_ACTIVE: false
[   73.356853][ T2514] tipc: tipc_group_proto_xmit(): queuing XMIT skb 0xffffff8034248000

(2) Client-side group handling GRP_JOIN of server socket:
[   73.715974][ T2740] tipc: tipc_sk_filter_rcv(): processing RCV skb 0xffffff8034248000
[   73.724095][ T2740] tipc: tipc_sk_filter_rcv(): handling tipc protocol message. sk 0xffffff8034d42800
[   73.733617][ T2740] tipc: tipc_group_proto_rcv(): handling group message for group type 0x46228388 ptr 0xffffff80301bc780 and port 0x7e8dc0bf
[   73.746700][ T2740] tipc: tipc_group_proto_rcv(): GRP_JOIN event for group type 0x46228388, port 0x7e8dc0bf, instance 0x9ff0009, ptr 0xffffff80301bc300
[   73.760652][ T2740] tipc: tipc_group_proto_rcv(): setting member instance 0x9ff0009 to state MBR_JOINED
[   73.770335][ T2740] tipc: tipc_group_open(): removing member instance 0x9ff0009 ptr 0xffffff80301bc300 from grp->small_win
[   73.781704][ T2740] tipc: tipc_group_update_member(): adding member instance 0x9ff0009 ptr 0xffffff80301bc300  to group type 0x46228388 ptr 0xffffff80301bc780 small_win
[   73.797186][ T2740] tipc: tipc_group_proto_xmit(): handling protocol tx for group type 0x46228388 ptr 0xffffff80301bc780, port 0x91e678d5
[   73.809906][ T2740] tipc: tipc_group_proto_xmit(): member instance 0x9ff0009 in MBR_JOINED/PENDING state for adv adjustment
[   73.821371][ T2740] tipc: tipc_group_proto_xmit(): member instance 0x9ff0009 adv adjustment for GRP_ADV. ADV_IDLE: true, ADV_ACTIVE: false
[   73.834173][ T2740] tipc: tipc_group_proto_xmit(): queuing XMIT skb 0xffffff803142b000

(3) Server-side group handling GRP_ADV of client socket:
[   73.851957][ T2740] tipc: tipc_sk_filter_rcv(): processing RCV skb 0xffffff803142b000
[   73.860038][ T2740] tipc: tipc_sk_filter_rcv(): handling tipc protocol message. sk 0xffffff802611df00
[   73.869545][ T2740] tipc: tipc_group_proto_rcv(): handling group message for group type 0x46228388 ptr 0xffffff80341ecb40 and port 0x91e678d5
[   73.882607][ T2740] tipc: tipc_group_proto_rcv(): GRP_ADV for member instance 0xab40005
[   73.890862][ T2740] tipc: tipc_group_open(): removing member instance 0xab40005 ptr 0xffffff8035de8480 from grp->small_win


In the failure case, we see no such flow. Notably, the final GRP_ADV from the client to the sever is never seen. The only concrete exchange from the flow above that exists in the log is the first part.
(1) Server-side group handling TIPC_PUBLISH of client socket
    kworker/u8:5-6798    [000] ...1.  1172.207476: tipc_sk_filter_rcv: tipc_sk_filter_rcv(): processing RCV skb 0xffffff801ec34000
    kworker/u8:5-6798    [000] ...1.  1172.207481: tipc_sk_filter_rcv: tipc_sk_filter_rcv(): handling tipc protocol message. sk 0xffffff80053d2d00
    kworker/u8:5-6798    [000] ...1.  1172.207484: tipc_group_member_evt: tipc_group_member_evt(): hanlding group memebr event for group type 0x46228388 ptr 0xffffff800449cc00, port 0x54ae5349, and instance 0x200b0005
    kworker/u8:5-6798    [000] ...1.  1172.207489: tipc_group_member_evt: tipc_group_member_evt(): no member with port 0x54ae5349, instance 0x200b0005 yet for group 0xffffff800449cc00. creating 0xffffff801f5b4540 and posting GRP_JOIN
    kworker/u8:5-6798    [000] ...1.  1172.207491: tipc_group_update_member: tipc_group_update_member(): adding member instance 0x200b0005 ptr 0xffffff801f5b4540  to group type 0x46228388 ptr 0xffffff800449cc00 small_win
    kworker/u8:5-6798    [000] ...1.  1172.207494: tipc_group_proto_xmit: tipc_group_proto_xmit(): handling protocol tx for group type 0x46228388 ptr 0xffffff800449cc00, port 0xae7b27a3
    kworker/u8:5-6798    [000] ...1.  1172.207496: tipc_group_proto_xmit: tipc_group_proto_xmit(): member instance 0x200b0005 adv adjustment for GRP_JOIN. ADV_IDLE: false, ADV_ACTIVE: false
    kworker/u8:5-6798    [000] ...1.  1172.207498: tipc_group_proto_xmit: tipc_group_proto_xmit(): queuing XMIT skb 0xffffff801ec34600


After that, it seems to get a bit murky, with SKBs popping up and disappearing more frequently. The best I've been able to conclude so far is that this is possibly some timing problem: logging all the debug messages with pr_err() or something similar will consistently make the flow work, while using trace_printk() (or no logging at all) will consistently fail. Otherwise, potentially some weird context switching messing with the TIPC input and xmit queues causing messages not to be seen when they should be.

I've attached the logs for both cases, along with some notes about what pointers are what in each case.

Thanks,
Sean

View attachment "tipc_broadcast_success.txt" of type "text/plain" (54093 bytes)

View attachment "tipc_broadcast_failure.txt" of type "text/plain" (25508 bytes)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ