lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <20200530040157.31038-1-john.stultz@linaro.org>
Date:   Sat, 30 May 2020 04:01:57 +0000
From:   John Stultz <john.stultz@...aro.org>
To:     lkml <linux-kernel@...r.kernel.org>
Cc:     John Stultz <john.stultz@...aro.org>,
        Guenter Roeck <linux@...ck-us.net>,
        Heikki Krogerus <heikki.krogerus@...ux.intel.com>,
        Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
        YongQin Liu <yongqin.liu@...aro.org>, linux-usb@...r.kernel.org
Subject: [RFC][PATCH] usb: typec: tcpci_rt1711h: Try to avoid screaming irq causing boot hangs

I've recently (since 5.7-rc1) started noticing very rare hangs
pretty early in bootup on my HiKey960 board.

They have been particularly difficult to debug, as the system
seems to not respond at all to sysrq- commands. However, the
system is alive as I'll occaionally see firmware loading timeout
errors after awhile. Adding changes like initcall_debug and
lockdep weren't informative, as it tended to cause the problem
to hide.

I finally tried to dig in a bit more on this today, and noticed
that the last dmesg output before the hang was usually:
  "random: crng init done"

So I dumped the stack at that point, and saw it was being called
from the pl061 gpio irq, and the hang always occurred when the
crng init finished on cpu 0. Instrumenting that more I could see
that when the issue triggered, we were getting a stream of irqs.

Chasing further, I found the screaming irq was for the rt1711h,
and narrowed down that we were hitting the !chip->tcpci check
which immediately returns IRQ_HANDLED, but does not stop the
irq from triggering immediately afterwards.

This patch slightly reworks the logic, so if we hit the irq
before the chip->tcpci has been assigned, we still read and
write the alert register, but just skip calling tcpci_irq().

With this change, I haven't managed to trip over the problem
(though it hasn't been super long - but I did confirm I hit
the error case and it didn't hang the system).

I still have some concern that I don't know why this cropped
up since 5.7-rc, as there haven't been any changes to the
driver since 5.4 (or before). It may just be the initialization
timing has changed due to something else, and its just exposed
this issue? I'm not sure, and that's not super re-assuring.

Anyway, I'd love to hear your thoughts if this looks like a sane
fix or not.

Cc: Guenter Roeck <linux@...ck-us.net>
Cc: Heikki Krogerus <heikki.krogerus@...ux.intel.com>
Cc: Greg Kroah-Hartman <gregkh@...uxfoundation.org>
Cc: YongQin Liu <yongqin.liu@...aro.org>
Cc: linux-usb@...r.kernel.org
Signed-off-by: John Stultz <john.stultz@...aro.org>
---
 drivers/usb/typec/tcpm/tcpci_rt1711h.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/usb/typec/tcpm/tcpci_rt1711h.c b/drivers/usb/typec/tcpm/tcpci_rt1711h.c
index 017389021b96..530fd2c111ad 100644
--- a/drivers/usb/typec/tcpm/tcpci_rt1711h.c
+++ b/drivers/usb/typec/tcpm/tcpci_rt1711h.c
@@ -159,9 +159,6 @@ static irqreturn_t rt1711h_irq(int irq, void *dev_id)
 	u8 status;
 	struct rt1711h_chip *chip = dev_id;
 
-	if (!chip->tcpci)
-		return IRQ_HANDLED;
-
 	ret = rt1711h_read16(chip, TCPC_ALERT, &alert);
 	if (ret < 0)
 		goto out;
@@ -176,6 +173,9 @@ static irqreturn_t rt1711h_irq(int irq, void *dev_id)
 	}
 
 out:
+	if (!chip->tcpci)
+		return IRQ_HANDLED;
+
 	return tcpci_irq(chip->tcpci);
 }
 
-- 
2.17.1

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ