netdev - Re: [PATCH net] Revert "r8169: don't try to disable interrupts if NAPI is, scheduled already"

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <84d8b645-d756-4a07-9062-cece62fcfd50@gmail.com>
Date: Thu, 16 May 2024 22:10:23 +0100
From: Ken Milmore <ken.milmore@...il.com>
To: "netdev@...r.kernel.org" <netdev@...r.kernel.org>,
 Heiner Kallweit <hkallweit1@...il.com>
Cc: Realtek linux nic maintainers <nic_swsd@...ltek.com>
Subject: Re: [PATCH net] Revert "r8169: don't try to disable interrupts if
 NAPI is, scheduled already"

On 15/05/2024 07:18, Heiner Kallweit wrote:
> This reverts commit 7274c4147afbf46f45b8501edbdad6da8cd013b9.
> 
> Ken reported that RTL8125b can lock up if gro_flush_timeout has the
> default value of 20000 and napi_defer_hard_irqs is set to 0.
> In this scenario device interrupts aren't disabled, what seems to
> trigger some silicon bug under heavy load. I was able to reproduce this
> behavior on RTL8168h. Fix this by reverting 7274c4147afb.
> 
> Fixes: 7274c4147afb ("r8169: don't try to disable interrupts if NAPI is scheduled already")
> Cc: stable@...r.kernel.org
> Reported-by: Ken Milmore <ken.milmore@...il.com>
> Signed-off-by: Heiner Kallweit <hkallweit1@...il.com>
> ---
>  drivers/net/ethernet/realtek/r8169_main.c | 6 ++----
>  1 file changed, 2 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/net/ethernet/realtek/r8169_main.c b/drivers/net/ethernet/realtek/r8169_main.c
> index 0fc5fe564ae5..69606c8081a3 100644
> --- a/drivers/net/ethernet/realtek/r8169_main.c
> +++ b/drivers/net/ethernet/realtek/r8169_main.c
> @@ -4655,10 +4655,8 @@ static irqreturn_t rtl8169_interrupt(int irq, void *dev_instance)
>  		rtl_schedule_task(tp, RTL_FLAG_TASK_RESET_PENDING);
>  	}
>  
> -	if (napi_schedule_prep(&tp->napi)) {
> -		rtl_irq_disable(tp);
> -		__napi_schedule(&tp->napi);
> -	}
> +	rtl_irq_disable(tp);
> +	napi_schedule(&tp->napi);
>  out:
>  	rtl_ack_events(tp, status);
>  

FYI, by now I am reasonably well convinced that the behaviour we've been seeing
is not in fact a silicon bug, but rather a very specific behaviour regarding how
these devices raise MSI interrupts.

This is largely due to the investigations by David Dillow described exhaustively
in the 2009 netdev thread linked below. I wish I had spotted this much sooner!
This information has been corroborated by my own testing on the RTL8125b:
https://lore.kernel.org/netdev/1242001754.4093.12.camel@obelisk.thedillows.org/T/

To summarise precisely what I think the behaviour is:

********
An interrupt is generated *only* when the device registers undergo a transition
from (status & mask) == 0 to (status & mask) != 0.
********

If the above holds, then calling rtl_irq_disable() will immediately force the
condition (status & mask) == 0, so we are ready to raise another interrupt when
interrupts are subsequently enabled again. 

To try and verify this, I tried the code below, which locks up the network
traffic immediately, regardless of the setting of napi_defer_hard_irqs:

diff --git linux-source-6.1~/drivers/net/ethernet/realtek/r8169_main.c linux-source-6.1/drivers/net/ethernet/realtek/r8169_main.c
index 2ce4bff..add5bdd 100644
--- linux-source-6.1~/drivers/net/ethernet/realtek/r8169_main.c
+++ linux-source-6.1/drivers/net/ethernet/realtek/r8169_main.c
@@ -4607,10 +4607,13 @@ static irqreturn_t rtl8169_interrupt(int irq, void *dev_instance)
 		rtl_schedule_task(tp, RTL_FLAG_TASK_RESET_PENDING);
 	}

-	rtl_irq_disable(tp);
 	napi_schedule(&tp->napi);
 out:
+	u32 status2 = rtl_get_events(tp);
 	rtl_ack_events(tp, status);
+	if(status2 & ~status)
+		printk_ratelimited("rtl8169_interrupt: status=%x status2=%x\n",
+				   status, status2);

 	return IRQ_HANDLED;
 }

Here's some typical dmesg output:

[11315.581136] rtl8169_interrupt: status=1 status2=85
[11324.142176] r8169 0000:07:00.0 eth0: ASPM disabled on Tx timeout
[11324.151765] rtl8169_interrupt: status=4 status2=84

We can see that when a new interrupt is flagged in the interval between reading
the status register and writing to it, we may never achieve the condition
(status & mask) == 0.

So, if we read 0x01 (RxOK) from the status register, we will then write
0x01 back to acknowledge the interrupt. But in the meantime, 0x04 (TxOK) has
been flagged, as well as 0x80 (TxDescUnavail), so the register now contains
0x85. We acknowledge by writing back 0x01, so the status register should now
contain 0x84. If interrupts are unmasked throughout, then (status & mask) != 0
throughout, so no interrupt will be raised for the missed TxOK event, unless
something else should occur to set one of the other status bits.

To test this hypothesis, I tried the code below, which never disables
interrupts but instead clears out the status register on every interrupt:

index 2ce4bff..dbda9ef 100644
--- linux-source-6.1~/drivers/net/ethernet/realtek/r8169_main.c
+++ linux-source-6.1/drivers/net/ethernet/realtek/r8169_main.c
@@ -4610,7 +4610,7 @@ static irqreturn_t rtl8169_interrupt(int irq, void *dev_instance)
 	rtl_irq_disable(tp);
 	napi_schedule(&tp->napi);
 out:
-	rtl_ack_events(tp, status);
+	rtl_ack_events(tp, tp->irq_mask);

 	return IRQ_HANDLED;
 }

This passed my iperf3 test perfectly! It is likely to cause other problems
though: Specifically it opens the possibility that we will miss a SYSErr,
LinkChg or RxFIFOOver interrupt. Hence the rationale for achieving the required
(status & mask) == 0 condition by clearing the mask register instead.

I hope this information may prove useful in the future.