netdev - [RFC PATCH net-next 0/1] bonding: Return TX congested if no active slave

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20240712192405.505553-1-nnac123@linux.ibm.com>
Date: Fri, 12 Jul 2024 14:24:04 -0500
From: Nick Child <nnac123@...ux.ibm.com>
To: netdev@...r.kernel.org
Cc: Nick Child <nnac123@...ux.ibm.com>
Subject: [RFC PATCH net-next 0/1] bonding: Return TX congested if no active slave

Hello. Posting this as RFC because I understand that changing return codes can have unexpected results and I am not well versed enough to claim with certainty that this won't affect something elsewhere.

We are seeing a rare TCP connection timeout after only ~7.5 seconds of inactivity. This is mainly due to the ibmvnic driver hogging the RTNL lock for too long (~2 seconds per ibmvnic device). We are working on getting the driver off the RTNL lock but figured the core of the issue should also be considered.

Because most of this is new ground to me, I listed the background knowledge that is needed to identify the issue. Feel free to skip this part:

1. During a zero window probe, the socket attempts to get an updated window from the peer 15 times (default of tcp_retries2). Looking at tcp_send_probe0(), the timer between probes is either doubled or 0.5 seconds. The timer is doubled if the skb transmit returns NET_XMIT_SUCCESS or NET_XMIT_CN (see net_xmit_eval()). Note that the timer is set to a static 0.5 if NET_XMIT_DROP is returned. This means the socket can ETIMEOUT after 7.5 seconds. The return code is typically the return code of __dev_queue_xmit()

2. In __dev_queue_xmit(), the skb is enqueued to the qdisc if the enqueue function is defined. In this circumstance, the qdisc enqueue function return code propagates up the stack. On the other hand, if the qdisc enqueue function is NULL then the drivers xmit function is called directly via dev_hard_start_xmit(). In this circumstance, the drivers xmit return code propagates up the stack.

3. The bonding device uses IFF_NO_QUEUE, this sets qdisc to noqueue_qdisc_ops. noqueue_qdisc_ops has NULL enqueue function. Therefore, when the bonding device is carrier UP, bond_start_xmit is called directly. In this function, depending on bonding mode, a slave device is assigned to the skb and __dev_queue_xmit() is called again. This time the slaves qdisc enqueue function (which is almost always defined) is called.

4. When a device calls netif_carrier_off(), it schedules dev_deactivate which grabs the rtnl lock and sets the qdisc to noop_qdisc. The enqueue function of noop_qdisc is defined but simply returns NET_XMIT_CN.

5. The miimon of the bonding device periodically checks for the carrier status of its slaves. If it detects that all of its slaves are down, then it sets currently_active_slave to NULL and calls netif_carrier_off() on itself.

6. In the bonding devices xmit function, if it does not have any active slaves, it returns NET_XMIT_DROP.

Given these observations. Assume a bonding devices slaves all suddenly call netif_carrier_off(). Also assume that a tcp connection is in a zero window probe. There is a window for a behavioral issue here:
1. If the bonding device does not notice that its slaves are down (maybe its miimon interval is too large or the miimon commit could not grab rtnl), then the slaves enqueue function is invoked. This will either return NET_XMIT_SUCCESS OR NET_XMIT_CN. The probe timer is doubled.
2. If the bonding device notices the slaves are down. It sets active slave to NULL and calls netif_carrier_off() on itself. dev_deactivate() is scheduled:
    a. If dev_deactivate() executes. The bonding enqueue function (which was null) is now noop_enqueue and returns NET_XMIT_CN. The probe timer is doubled.
    b. If dev_deactivate() cannot execute for some time (say because it is waiting on rtnl). Then bond_start_xmit() is called. It sees that it does not have an active slave and returns NET_XMIT_DROP. The probe timer is set to 0.5 seconds.

I believe that when the active slave is NULL, it is safe to say that the bonding device calls netif_carrier_off() on itself. But there is a time where active_slave == NULL and the qdisc.enqueue != noop_enqueue. During this time the return code is different. Consider the following timeline:
  slaves go down
             |
             |   returns NET_XMIT_CN
             |
     bond calls carrier_off
             |
             |   returns NET_XMIT_DROP
             |
       dev_deactivate on bond
             |
             |   returns NET_XMIT_CN
             |

Because the bonding xmit function should really only be a route to a slave device, I propose that it returns NET_XMIT_CN in these scenarios instead of NET_XMIT_DROP. Note that the bond will still return NET_XMIT_DROP when it has NO slaves. Since miimon calls netif_carrier_off() when it doesn't have any active slaves, I believe this change will only effect the time between the bonding devices call to netif_carrier_off() and the execution of dev_deactivate().

I was able to see this issue with bpftrace:
  10:28:27:283929 - dev_deactivate eth252
  10:28:27:312805 - send_probe0 - port=8682 - snd_wnd 0, icsk_probes_out = 0+1
  10:28:27:760016 - send_probe0 - port=8682 - snd_wnd 0, icsk_probes_out = 1+1
  10:28:28:147410 - dev_deactivate eth251
  10:28:28:348387 - dev_queue_xmit rc = 2
  10:28:28:348394 - dev_queue_xmit rc = 2
  10:28:28:670013 - send_probe0 - port=8682 - snd_wnd 0, icsk_probes_out = 2+1
  10:28:28:670066 - dev_queue_xmit rc = 2
  10:28:28:670070 - dev_queue_xmit rc = 2
  10:28:30:440025 - send_probe0 - port=8682 - snd_wnd 0, icsk_probes_out = 3+1
  10:28:30:440084 - dev_queue_xmit rc = 2
  10:28:30:440088 - dev_queue_xmit rc = 2
  10:28:33:505982 - netif_carrier_off bond1
        netif_carrier_off+112
        bond_set_carrier+296
        bond_select_active_slave+296
        bond_mii_monitor+1900
        process_one_work+760
        worker_thread+136
        kthread+456
        ret_from_kernel_thread+92
  10:28:33:790050 - dev_queue_xmit rc = 1
  10:28:33:870015 - send_probe0 - port=8682 - snd_wnd 0, icsk_probes_out = 4+1
  10:28:33:870061 - dev_queue_xmit rc = 1
  10:28:34:300269 - dev_queue_xmit rc = 1
  10:28:34:380025 - send_probe0 - port=8682 - snd_wnd 0, icsk_probes_out = 5+1
  10:28:34:380072 - dev_queue_xmit rc = 1
  10:28:34:432446 - dev_queue_xmit rc = 1
  10:28:34:810059 - dev_queue_xmit rc = 1
  10:28:34:900014 - send_probe0 - port=8682 - snd_wnd 0, icsk_probes_out = 6+1
  10:28:34:900059 - dev_queue_xmit rc = 1
  10:28:35:000050 - dev_queue_xmit rc = 1
  10:28:35:330054 - dev_queue_xmit rc = 1
  10:28:35:410020 - send_probe0 - port=8682 - snd_wnd 0, icsk_probes_out = 7+1
  10:28:35:410070 - dev_queue_xmit rc = 1
  10:28:35:630865 - dev_queue_xmit rc = 1
  10:28:35:850072 - dev_queue_xmit rc = 1
  10:28:35:920025 - send_probe0 - port=8682 - snd_wnd 0, icsk_probes_out = 8+1
  10:28:35:920069 - dev_queue_xmit rc = 1
  10:28:35:940348 - dev_queue_xmit rc = 1
  10:28:36:140055 - dev_queue_xmit rc = 1
  10:28:36:370050 - dev_queue_xmit rc = 1
  10:28:36:430029 - send_probe0 - port=8682 - snd_wnd 0, icsk_probes_out = 9+1
  10:28:36:430089 - dev_queue_xmit rc = 1
  10:28:36:460052 - dev_queue_xmit rc = 1
  10:28:36:650049 - dev_queue_xmit rc = 1
  10:28:36:880059 - dev_queue_xmit rc = 1
  10:28:36:940024 - send_probe0 - port=8682 - snd_wnd 0, icsk_probes_out = 10+1
  10:28:36:940074 - dev_queue_xmit rc = 1
  10:28:36:980044 - dev_queue_xmit rc = 1
  10:28:37:160331 - dev_queue_xmit rc = 1
  10:28:37:390060 - dev_queue_xmit rc = 1
  10:28:37:450024 - send_probe0 - port=8682 - snd_wnd 0, icsk_probes_out = 11+1
  10:28:37:450082 - dev_queue_xmit rc = 1
  10:28:37:480045 - dev_queue_xmit rc = 1
  10:28:37:730281 - dev_queue_xmit rc = 1
  10:28:37:900051 - dev_queue_xmit rc = 1
  10:28:37:970019 - send_probe0 - port=8682 - snd_wnd 0, icsk_probes_out = 12+1
  10:28:37:970062 - dev_queue_xmit rc = 1
  10:28:38:000045 - dev_queue_xmit rc = 1
  10:28:38:240089 - dev_queue_xmit rc = 1
  10:28:38:420053 - dev_queue_xmit rc = 1
  10:28:38:490012 - send_probe0 - port=8682 - snd_wnd 0, icsk_probes_out = 13+1
  10:28:38:490076 - dev_queue_xmit rc = 1
  10:28:38:520048 - dev_queue_xmit rc = 1
  10:28:38:750069 - dev_queue_xmit rc = 1
  10:28:39:000029 - send_probe0 - port=8682 - snd_wnd 0, icsk_probes_out = 14+1
  10:28:39:000093 - dev_queue_xmit rc = 1
  10:28:39:030052 - dev_queue_xmit rc = 1
  10:28:39:260046 - dev_queue_xmit rc = 1
  10:28:39:450050 - dev_queue_xmit rc = 1
  10:28:39:520044 - SK_ERR(110) port=8682 - snd_wnd 0
        sk_error_report+12
        tcp_write_err+64
        tcp_write_timer_handler+564
        tcp_write_timer+424
        call_timer_fn+88
        run_timer_softirq+1896
        __do_softirq+348
        do_softirq_own_stack+64
        irq_exit+392
        timer_interrupt+380
        decrementer_common_virt+528
  10:28:47:813297 - dev_deactivate bond1

Again, I know the easier solution is to remove rtnl users to reduce the time before dev_deactivate (we are working on that as well).

Nick Child (1):
  bonding: Return TX congested if no active slave

 include/net/bonding.h | 7 +++++++
 1 file changed, 7 insertions(+)

-- 
2.43.0