lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20231102172503.3413318-1-alexey.pakhunov@spacex.com>
Date: Thu, 2 Nov 2023 10:25:01 -0700
From: <alexey.pakhunov@...cex.com>
To: <mchan@...adcom.com>
CC: <vincent.wong2@...cex.com>, <netdev@...r.kernel.org>,
        <linux-kernel@...r.kernel.org>, <siva.kallam@...adcom.com>,
        <prashant@...adcom.com>, Alex Pakhunov <alexey.pakhunov@...cex.com>
Subject: [PATCH v2 0/2] tg3: Fix the TX ring stall

From: Alex Pakhunov <alexey.pakhunov@...cex.com>

This patch fixes a problem with the tg3 driver we encountered on several
machines having Broadcom 5719 NIC. The problem showed up as a 10-20 seconds
interruption in network traffic and these dmegs message followed by the NIC
registers dump:

=== dmesg ===
NETDEV WATCHDOG: eth0 (tg3): transmit queue 0 timed out
...
RIP: 0010:dev_watchdog+0x21e/0x230
...
tg3 0000:02:00.2 eth0: transmit timed out, resetting
=== ===

The issue was observed with "4.15.0-52-lowlatency #56~16.04.1-Ubuntu" and
"4.15.0-161-lowlatency #169~16.04.1-Ubuntu" kernels.

Based on the state of the TX queue at the time of the reset and analysis of
dev_watchdog() it appeared that the NIC has not been notified about packets
accumulated in the TX ring for TG3_TX_TIMEOUT seconds and was reset:

=== dmesg ===
tg3 0000:02:00.2 eth0: 0: Host status block [00000001:000000a0:(0000:06d8:0000):(0000:01a0)]
tg3 0000:02:00.2 eth0: 0: NAPI info [000000a0:000000a0:(0188:01a0:01ff):0000:(06f2:0000:0000:0000)]
=== ===

tnapi->hw_status->idx[0].tx_consumer is the same as tnapi->tx_cons (0x1a0)
meaning that the driver has processed all TX descriptions released by
the NIC. tnapi->tx_prod (0x188) is ahead of 0x1a0 meaning that there are
more descriptors in the TX ring ready to be sent but the NIC does not know
about that yet.

Further analysis showed that tg3_start_xmit() can stop the TX queue and
not tell the NIC about already enqueued packets. The specific sequence
is:

1. tg3_start_xmit() is called at least once and queues packet(s) without
   updating tnapi->prodmbox (netdev_xmit_more() returns true)
2. tg3_start_xmit() is called with an SKB which causes tg3_tso_bug() to be
   called.
3. tg3_tso_bug() determines that the SKB is too large [L7860], ...

        if (unlikely(tg3_tx_avail(tnapi) <= frag_cnt_est)) {

   ... stops the queue [L7861], and returns NETDEV_TX_BUSY [L7870]:

        netif_tx_stop_queue(txq);
        ...
        if (tg3_tx_avail(tnapi) <= frag_cnt_est)
                return NETDEV_TX_BUSY;

4. Since all tg3_tso_bug() call sites directly return, the code updating
   tnapi->prodmbox [L8138] is skipped.

5. The queue is stuck now. tg3_start_xmit() is not called while the queue
   is stopped. The NIC is not processing new packets because
   tnapi->prodmbox wasn't updated. tg3_tx() is not called by
   tg3_poll_work() because the all TX descriptions that could be freed has
   been freed [L7159]:

        /* run TX completion thread */
        if (tnapi->hw_status->idx[0].tx_consumer != tnapi->tx_cons) {
                tg3_tx(tnapi);

6. Eventually, dev_watchdog() fires resetting the queue.

As far as I can tell this sequence is still possible in HEAD of master.

I could not reproduce this stall by generating traffic to match conditions
required for tg3_tso_bug() to be called. Based on the driver's code
the SKB must be a TSO or GSO skb; it should contain a VLAN tag or extra TCP
header options; and it should be queued at exactly the right time.
I believe that the last part is what makes reproducing it harder.

However I was able to reproduce the stall by mimicing the behavior of
tg3_tso_bug() in tg3_start_xmit(). I added the following lines to
tg3_start_xmit() before "would_hit_hwbug = 0;" [L8046]:

        if (...) {
                netif_tx_stop_queue(txq);
                return NETDEV_TX_BUSY;
        }

        would_hit_hwbug = 0;

The condition is not super relevant. It was used to control when the stall
is induced, so that the network is not completely broken dueing testing.
This approach reproduced the issue rather reliably.

The proposed fix makes sure that the tnapi->prodmbox update happens
regardless of the reason tg3_start_xmit() returned. It essentially moves
the code updating tnapi->prodmbox from tg3_start_xmit() (which is renamed
to __tg3_start_xmit()) to a new wrapper. This makes sure all retun paths
are covered.

I tested this fix with the code inducing the TX stall from above. The fix
eliminated stalls completely.

An aternative approch, jumping to the code updating tnapi->prodmbox after
returning from tg3_tso_bug(), was considered. It yields a patch of almost
the same size. There are four branches in tg3_start_xmit() that would
need the goto: three tg3_tso_bug() call sites and the early return in
the very beginning of tg3_start_xmit(). This seemed like a more fragile
approach too since anyone modifying the function would need to be careful
to preserve the invatiant of leaving it through a particular branch.

Alex Pakhunov (2):
  tg3: Increment tx_dropped in tg3_tso_bug()
  tg3: Fix the TX ring stall

 drivers/net/ethernet/broadcom/tg3.c | 57 +++++++++++++++++++++++------
 1 file changed, 45 insertions(+), 12 deletions(-)


base-commit: ffc253263a1375a65fa6c9f62a893e9767fbebfa
-- 
2.39.3


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ