netdev - Re: [RFC PATCH net-next 0/1] bonding: Return TX congested if no active slave

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <2245992.1720827286@famine>
Date: Fri, 12 Jul 2024 16:34:46 -0700
From: Jay Vosburgh <jv@...sburgh.net>
To: Nick Child <nnac123@...ux.ibm.com>
cc: netdev@...r.kernel.org
Subject: Re: [RFC PATCH net-next 0/1] bonding: Return TX congested if no
 active slave

Nick Child <nnac123@...ux.ibm.com> wrote:

>Hello. Posting this as RFC because I understand that changing return
>codes can have unexpected results and I am not well versed enough to
>claim with certainty that this won't affect something elsewhere.
>
>We are seeing a rare TCP connection timeout after only ~7.5 seconds of
>inactivity. This is mainly due to the ibmvnic driver hogging the RTNL
>lock for too long (~2 seconds per ibmvnic device). We are working on
>getting the driver off the RTNL lock but figured the core of the issue
>should also be considered.
>
>Because most of this is new ground to me, I listed the background
>knowledge that is needed to identify the issue. Feel free to skip this
>part:
>
>1. During a zero window probe, the socket attempts to get an updated
>window from the peer 15 times (default of tcp_retries2). Looking at
>tcp_send_probe0(), the timer between probes is either doubled or 0.5
>seconds. The timer is doubled if the skb transmit returns
>NET_XMIT_SUCCESS or NET_XMIT_CN (see net_xmit_eval()). Note that the
>timer is set to a static 0.5 if NET_XMIT_DROP is returned. This means
>the socket can ETIMEOUT after 7.5 seconds. The return code is typically
>the return code of __dev_queue_xmit()

	I'm not sure that your description of the behavior of
tcp_send_probe0() matches the code.

	It looks to me like the timer doubles for "if (err <= 0)",
meaning a negative value or NET_XMIT_SUCCESS.  The 0.5 timeout value (in
TCP_RESOURCE_PROBE_INTERVAL) is set in the "else" after "if (err <= 0)",
so NET_XMIT_DROP and NET_XMIT_CN both qualify and would presumably
result in the 0.5 second timeout.

	However, since tcp_write_wakeup() -> tcp_transmit_skb() ->
__tcp_transmit_skb() will convert NET_XMIT_CN to 0 (which is
NET_XMIT_SUCCESS) via net_xmit_eval(), I'm not sure that it's possible
for err to equal NET_XMIT_CN here.

	I'll note that the 0.5 second timeout logic had a relatively
recent change in c1d5674f8313 ("tcp: less aggressive window probing on
local congestion").  From reading the log, the intent seems to be to
bound the maximum probe interval to 0.5 seconds in low-RTT environments.

>2. In __dev_queue_xmit(), the skb is enqueued to the qdisc if the
>enqueue function is defined. In this circumstance, the qdisc enqueue
>function return code propagates up the stack. On the other hand, if the
>qdisc enqueue function is NULL then the drivers xmit function is called
>directly via dev_hard_start_xmit(). In this circumstance, the drivers
>xmit return code propagates up the stack.
>
>3. The bonding device uses IFF_NO_QUEUE, this sets qdisc to
>noqueue_qdisc_ops. noqueue_qdisc_ops has NULL enqueue
>function. Therefore, when the bonding device is carrier UP,
>bond_start_xmit is called directly. In this function, depending on
>bonding mode, a slave device is assigned to the skb and
>__dev_queue_xmit() is called again. This time the slaves qdisc enqueue
>function (which is almost always defined) is called.

	Does your analysis or behavior change if the bond itself does
have a qdisc?  IFF_NO_QUEUE does not install one by default, but users
are free to add one.

>4. When a device calls netif_carrier_off(), it schedules dev_deactivate
>which grabs the rtnl lock and sets the qdisc to noop_qdisc. The enqueue
>function of noop_qdisc is defined but simply returns NET_XMIT_CN.
>
>5. The miimon of the bonding device periodically checks for the carrier
>status of its slaves. If it detects that all of its slaves are down,
>then it sets currently_active_slave to NULL and calls
>netif_carrier_off() on itself.
>
>6. In the bonding devices xmit function, if it does not have any active
>slaves, it returns NET_XMIT_DROP.
>
>Given these observations. Assume a bonding devices slaves all suddenly
>call netif_carrier_off(). Also assume that a tcp connection is in a zero
>window probe. There is a window for a behavioral issue here:

>1. If the bonding device does not notice that its slaves are down
>(maybe its miimon interval is too large or the miimon commit could not
>grab rtnl), then the slaves enqueue function is invoked. This will
>either return NET_XMIT_SUCCESS OR NET_XMIT_CN. The probe timer is
>doubled.

	As I mentioned above, I'm not sure this accurately describes the
behavior in tcp_send_probe0().

	-J

>2. If the bonding device notices the slaves are down. It sets active
>slave to NULL and calls netif_carrier_off() on itself. dev_deactivate()
>is scheduled:
>
>    a. If dev_deactivate() executes. The bonding enqueue function
>(which was null) is now noop_enqueue and returns NET_XMIT_CN. The probe
>timer is doubled.
>
>    b. If dev_deactivate() cannot execute for some time (say because it
>is waiting on rtnl). Then bond_start_xmit() is called. It sees that it
>does not have an active slave and returns NET_XMIT_DROP. The probe
>timer is set to 0.5 seconds.
>
>I believe that when the active slave is NULL, it is safe to say that
>the bonding device calls netif_carrier_off() on itself. But there is a
>time where active_slave == NULL and the qdisc.enqueue !=
>noop_enqueue. During this time the return code is different. Consider
>the following timeline:
>
>  slaves go down
>             |
>             |   returns NET_XMIT_CN
>             |
>     bond calls carrier_off
>             |
>             |   returns NET_XMIT_DROP
>             |
>       dev_deactivate on bond
>             |
>             |   returns NET_XMIT_CN
>             |
>
>Because the bonding xmit function should really only be a route to a
>slave device, I propose that it returns NET_XMIT_CN in these scenarios
>instead of NET_XMIT_DROP. Note that the bond will still return
>NET_XMIT_DROP when it has NO slaves. Since miimon calls
>netif_carrier_off() when it doesn't have any active slaves, I believe
>this change will only effect the time between the bonding devices call
>to netif_carrier_off() and the execution of dev_deactivate().
>
>I was able to see this issue with bpftrace:
>  10:28:27:283929 - dev_deactivate eth252
>  10:28:27:312805 - send_probe0 - port=8682 - snd_wnd 0, icsk_probes_out = 0+1
>  10:28:27:760016 - send_probe0 - port=8682 - snd_wnd 0, icsk_probes_out = 1+1
>  10:28:28:147410 - dev_deactivate eth251
>  10:28:28:348387 - dev_queue_xmit rc = 2
>  10:28:28:348394 - dev_queue_xmit rc = 2
>  10:28:28:670013 - send_probe0 - port=8682 - snd_wnd 0, icsk_probes_out = 2+1
>  10:28:28:670066 - dev_queue_xmit rc = 2
>  10:28:28:670070 - dev_queue_xmit rc = 2
>  10:28:30:440025 - send_probe0 - port=8682 - snd_wnd 0, icsk_probes_out = 3+1
>  10:28:30:440084 - dev_queue_xmit rc = 2
>  10:28:30:440088 - dev_queue_xmit rc = 2
>  10:28:33:505982 - netif_carrier_off bond1
>        netif_carrier_off+112
>        bond_set_carrier+296
>        bond_select_active_slave+296
>        bond_mii_monitor+1900
>        process_one_work+760
>        worker_thread+136
>        kthread+456
>        ret_from_kernel_thread+92
>  10:28:33:790050 - dev_queue_xmit rc = 1
>  10:28:33:870015 - send_probe0 - port=8682 - snd_wnd 0, icsk_probes_out = 4+1
>  10:28:33:870061 - dev_queue_xmit rc = 1
>  10:28:34:300269 - dev_queue_xmit rc = 1
>  10:28:34:380025 - send_probe0 - port=8682 - snd_wnd 0, icsk_probes_out = 5+1
>  10:28:34:380072 - dev_queue_xmit rc = 1
>  10:28:34:432446 - dev_queue_xmit rc = 1
>  10:28:34:810059 - dev_queue_xmit rc = 1
>  10:28:34:900014 - send_probe0 - port=8682 - snd_wnd 0, icsk_probes_out = 6+1
>  10:28:34:900059 - dev_queue_xmit rc = 1
>  10:28:35:000050 - dev_queue_xmit rc = 1
>  10:28:35:330054 - dev_queue_xmit rc = 1
>  10:28:35:410020 - send_probe0 - port=8682 - snd_wnd 0, icsk_probes_out = 7+1
>  10:28:35:410070 - dev_queue_xmit rc = 1
>  10:28:35:630865 - dev_queue_xmit rc = 1
>  10:28:35:850072 - dev_queue_xmit rc = 1
>  10:28:35:920025 - send_probe0 - port=8682 - snd_wnd 0, icsk_probes_out = 8+1
>  10:28:35:920069 - dev_queue_xmit rc = 1
>  10:28:35:940348 - dev_queue_xmit rc = 1
>  10:28:36:140055 - dev_queue_xmit rc = 1
>  10:28:36:370050 - dev_queue_xmit rc = 1
>  10:28:36:430029 - send_probe0 - port=8682 - snd_wnd 0, icsk_probes_out = 9+1
>  10:28:36:430089 - dev_queue_xmit rc = 1
>  10:28:36:460052 - dev_queue_xmit rc = 1
>  10:28:36:650049 - dev_queue_xmit rc = 1
>  10:28:36:880059 - dev_queue_xmit rc = 1
>  10:28:36:940024 - send_probe0 - port=8682 - snd_wnd 0, icsk_probes_out = 10+1
>  10:28:36:940074 - dev_queue_xmit rc = 1
>  10:28:36:980044 - dev_queue_xmit rc = 1
>  10:28:37:160331 - dev_queue_xmit rc = 1
>  10:28:37:390060 - dev_queue_xmit rc = 1
>  10:28:37:450024 - send_probe0 - port=8682 - snd_wnd 0, icsk_probes_out = 11+1
>  10:28:37:450082 - dev_queue_xmit rc = 1
>  10:28:37:480045 - dev_queue_xmit rc = 1
>  10:28:37:730281 - dev_queue_xmit rc = 1
>  10:28:37:900051 - dev_queue_xmit rc = 1
>  10:28:37:970019 - send_probe0 - port=8682 - snd_wnd 0, icsk_probes_out = 12+1
>  10:28:37:970062 - dev_queue_xmit rc = 1
>  10:28:38:000045 - dev_queue_xmit rc = 1
>  10:28:38:240089 - dev_queue_xmit rc = 1
>  10:28:38:420053 - dev_queue_xmit rc = 1
>  10:28:38:490012 - send_probe0 - port=8682 - snd_wnd 0, icsk_probes_out = 13+1
>  10:28:38:490076 - dev_queue_xmit rc = 1
>  10:28:38:520048 - dev_queue_xmit rc = 1
>  10:28:38:750069 - dev_queue_xmit rc = 1
>  10:28:39:000029 - send_probe0 - port=8682 - snd_wnd 0, icsk_probes_out = 14+1
>  10:28:39:000093 - dev_queue_xmit rc = 1
>  10:28:39:030052 - dev_queue_xmit rc = 1
>  10:28:39:260046 - dev_queue_xmit rc = 1
>  10:28:39:450050 - dev_queue_xmit rc = 1
>  10:28:39:520044 - SK_ERR(110) port=8682 - snd_wnd 0
>        sk_error_report+12
>        tcp_write_err+64
>        tcp_write_timer_handler+564
>        tcp_write_timer+424
>        call_timer_fn+88
>        run_timer_softirq+1896
>        __do_softirq+348
>        do_softirq_own_stack+64
>        irq_exit+392
>        timer_interrupt+380
>        decrementer_common_virt+528
>  10:28:47:813297 - dev_deactivate bond1
>
>Again, I know the easier solution is to remove rtnl users to reduce the
>time before dev_deactivate (we are working on that as well).
>
>Nick Child (1):
>  bonding: Return TX congested if no active slave
>
> include/net/bonding.h | 7 +++++++
> 1 file changed, 7 insertions(+)
>
>-- 
>2.43.0

---
	-Jay Vosburgh, jv@...sburgh.net