netdev - RFC: possible Cadence GEM race condition in NAPI polling

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [day] [month] [year] [list]
Message-ID: <7330510cf1d9be810f7c168b79ce0693d5839e6d.camel@calian.com>
Date:   Fri, 25 Feb 2022 22:32:03 +0000
From:   Robert Hancock <robert.hancock@...ian.com>
To:     "netdev@...r.kernel.org" <netdev@...r.kernel.org>
CC:     "nicolas.ferre@...rochip.com" <nicolas.ferre@...rochip.com>,
        "davem@...emloft.net" <davem@...emloft.net>,
        "claudiu.beznea@...rochip.com" <claudiu.beznea@...rochip.com>,
        "kuba@...nel.org" <kuba@...nel.org>
Subject: RFC: possible Cadence GEM race condition in NAPI polling

Hi all,

We've been debugging an issue we've been seeing on the Xilinx MPSoC platform
with the GEM Ethernet core. The problem we're seeing is that in rare cases,
receive packet processing is delayed - likely until some other packet shows up
and causes that packet to actually get processed. This has caused timeouts to
occur in a point-to-point, UDP-based request/response protocol where there are
no automatic retransmissions.

One suspect for this sort of thing is the "rotting packet" problem where an RX
packet notification is lost during the transition from NAPI poll-based
processing to re-enabling interrupts. Looking at the code in macb_poll in
drivers/net/ethernet/cadence/macb_main.c:

	status = macb_readl(bp, RSR);
	macb_writel(bp, RSR, status);

	netdev_vdbg(bp->dev, "poll: status = %08lx, budget = %d\n",
		    (unsigned long)status, budget);

	work_done = bp->macbgem_ops.mog_rx(queue, napi, budget);
	if (work_done < budget) {
		napi_complete_done(napi, work_done);

		/* Packets received while interrupts were disabled */
		status = macb_readl(bp, RSR);
		if (status) {
			if (bp->caps & MACB_CAPS_ISR_CLEAR_ON_WRITE)
				queue_writel(queue, ISR, MACB_BIT(RCOMP));
			napi_reschedule(napi);
		} else {
			queue_writel(queue, IER, bp->rx_intr_mask);
		}
	}

The check of RSR after napi_complete_done is apparently required because bits
being set in RSR while the corresponding interrupt is disabled don't cause the
interrupt to get asserted when the interrupt gets re-enabled (i.e. the
propagation of RSR to ISR is edge-triggered, not level-triggered). However, it
appears there may also be a race window between reading RSR and enabling
interrupts. If a packet is received between a read of RSR which returns 0 and
the write to IER to enable the interrupt, the same problem can occur where RSR
is set prior to enabling the interrupt in IER, causing no interrupt to be
raised.

Somewhat curiously, instead of stalling all RX traffic from then on, as one
might expect, it seems another subsequent packet arrival must cause an
interrupt to be raised even though the RCOMP bit in RSR was already 1. This un-
stalls the receive process and causes the previously ignored packet(s) to be
processed. This also seems to have made the bug subtle enough that it has gone
unnoticed thus far.

The following patch seems to resolve the issue:

 	work_done = bp->macbgem_ops.mog_rx(queue, napi, budget);
 	if (work_done < budget) {
+		queue_writel(queue, IER, bp->rx_intr_mask);
 		napi_complete_done(napi, work_done);
 
 		/* Packets received while interrupts were disabled */
 		status = macb_readl(bp, RSR);
 		if (status) {
+			queue_writel(queue, IDR, bp->rx_intr_mask);
 			if (bp->caps & MACB_CAPS_ISR_CLEAR_ON_WRITE)
 				queue_writel(queue, ISR, MACB_BIT(RCOMP));
 			napi_reschedule(napi);
-		} else {
-			queue_writel(queue, IER, bp->rx_intr_mask);
 		}

However, the downside is it always re-enables interrupts even when it ends up
disabling them again soon afterwards. Another possibility is a "double-check"
approach like this:

		/* Packets received while interrupts were disabled */
		status = macb_readl(bp, RSR);
		if (status) {
			if (bp->caps & MACB_CAPS_ISR_CLEAR_ON_WRITE)
				queue_writel(queue, ISR, MACB_BIT(RCOMP));
			napi_reschedule(napi);
                } else {
                        queue_writel(queue, IER, bp->rx_intr_mask);
+                       status = macb_readl(bp, RSR);
+                       if (status) {
+                               queue_writel(queue, IDR, bp->rx_intr_mask);
+                               if (bp->caps & MACB_CAPS_ISR_CLEAR_ON_WRITE)
+                                       queue_writel(queue, ISR,
MACB_BIT(RCOMP));
+                               napi_schedule(napi);
+                       }
                }

That way the enable/disable only occurs in the cases where the race is
triggered and not every time, but is more complex. This second approach hasn't
been extensively tested yet, but it seems like it should also work.

Both approaches can potentially have a race with the interrupt handler (i.e. if
an interrupt is raised due to a new packet right after IER is written to enable
interrupts again, both the ISR and this code can end up doing the same thing).
However, as far as I can tell that seems harmless.

Looking for some thoughts on which approach would be better in this case? Or
something else we haven't thought of?

-- 
Robert Hancock
Senior Hardware Designer, Calian Advanced Technologies
www.calian.com