lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20201005154529.GB628746@ulmo>
Date:   Mon, 5 Oct 2020 17:45:29 +0200
From:   Thierry Reding <thierry.reding@...il.com>
To:     Marc Zyngier <maz@...nel.org>
Cc:     linux-tegra@...r.kernel.org, linux-kernel@...r.kernel.org,
        linux-arm-kernel@...ts.infradead.org,
        Jonathan Hunter <jonathanh@...dia.com>,
        Dmitry Osipenko <digetx@...il.com>,
        Sowjanya Komatineni <skomatineni@...dia.com>,
        Venkat Reddy Talla <vreddytalla@...dia.com>,
        Thomas Gleixner <tglx@...utronix.de>, kernel-team@...roid.com
Subject: Re: [PATCH 0/3] soc/tegra: Prevent the PMC driver from corrupting
 interrupt routing

On Mon, Oct 05, 2020 at 02:06:37PM +0100, Marc Zyngier wrote:
> On 2020-10-05 12:22, Thierry Reding wrote:
> > On Mon, Oct 05, 2020 at 12:14:40PM +0100, Marc Zyngier wrote:
> > > Jon recently reported that one of the Tegra systems (Jetson TX2, aka
> > > tegra186) stopped booting with the introduction of the "IPI as IRQs"
> > > series. After a few weeks of head scratching and complete puzzlement,
> > > I obtained a board and started looking at what was happening.
> > > 
> > > The interrupt hierarchy looks like this:
> > > 
> > > 	[DEVICE] -A-> [PMC] -B-> [GIC]
> > > 
> > > which seems simple enough. However, not all the devices attached to
> > > the PMC follow this hierarchy, and in some cases, the 'B' link isn't
> > > present in the HW. In other cases, neither 'A' nor 'B' are present.
> > > And yet the PMC driver creates such linkages using random hwirq values
> > > for the non-existent links, potentially overriding existing mappings
> > > in the process. "What could possibly go wrong?"
> > 
> > Yes, that would've been my fault. It seemed like the right thing to do
> > at the time, but the way you describe it makes it obvious that it was
> > not. I can't say I understand why this would've worked prior to the
> > rework that made this surface, though.
> 
> Because until these IPI patches, the range 0-7 never ever appeared
> as actual hwirqs in the GIC domain. SGIs were handled in the GIC
> code, behind the core kernel's back. As soon as we start using
> an actual domain mapping for hwirq 0, the PMC driver starts messing
> with it.

Okay, that makes sense.

> > > It turns out that for the 'B' link, the PMC driver uses hwirq 0, which
> > > is SGI0 for the GIC, and used as the rescheduling IPI. Obviously, this
> > > doesn't go very well, nor very far, as the IPI gets routed to random
> > > drivers. Also, as the handling flow has been overridden, this
> > > interrupt never gets deactivated and can't fire anymore. Yes, this is
> > > bad.
> > > 
> > > The 'A' link is less problematic, but the hwirq value is still out of
> > > the irqdomain range, and gets remapped every time a new 'A'-less
> > > driver comes up.
> > > 
> > > Instead, let's trim the unused hierarchy levels as needed. This
> > > requires some checks in the upper levels of the hierarchy as we now
> > > have optional levels, but this looks a lot saner than what we
> > > currently have. With this, tegra186 is back booting on -next.
> > > 
> > > I haven't tested any wake-up stuff, nor any other nvidia system (this
> > > is the only one I have). If people agree to these changes, I can take
> > > them via the irqchip tree so that they make it into the next merge
> > > window.
> > 
> > Yeah, it sounds like this needs to go in ideally before the rework that
> > caused this to surface in order to preserve bisectibility. But if it
> > goes in afterwards that's probably fine as well.
> 
> It's easy to take it as part of the same pull request as the IPI
> stuff. Not fully bisectable for this platform, but close enough.
> I may even be able to merge this in *before* the IPI patches
> (I'd need to rebuild irq/irqchip-next, but that won't alter the commit
> ids of the individual patches as they are on separate branches).

I noticed there's a small conflict with another patch that I've queued
up and that I was thinking of sending in for v5.10. This adds a callback
for .irq_retrigger. I don't exactly recall why I needed this, but I've
been carrying that patch locally for a while now. I don't think it's
critical, but there must've been an issue that it fixed. Unfortunately I
seem to have been in a hurry because the commit message is a bit terse.

I can probably just leave that out for now to not further complicated
things, in which case it should be fine to take your patches here into
v5.10 along with the IPI changes. As I already mentioned I'm a bit
unhappy about the needless move of the irq_chip to a global variable,
but I can send out an updated version with those changed back, which is
basically what I have on my local test branch because I had to manually
apply that patch anyway due to the conflict I mentioned above, so
keeping the embedded irq_chip was actually simpler and reduces the diff
a little bit.

> > Let Jon and myself do a bit of testing with this to verify that the wake
> > up paths are still working.
> 
> Sure. Let me know what you find.

The results are in and it's a bit of a mixed bag. I was able to confirm
that Tegra194 also boots again after this series and I'm also able to
resume from sleep using either rtcwake or the power-key as wakeup
source, so the wake-events mechanism is still functional after the
series. I do see a bit of breakage on resume, but none of that seems
related to your patches and is likely something that crept in while we
were looking into the current issue.

Jon had started a job in our test farm in parallel and that came back
with a failing suspend/resume test on Tegra186 (Jetson TX2), but that
seems to have been a pre-existing issue. This was already in linux-next
around next-20200910 and Jon had been investigating it when the boot
failures due to the IPI changes started happening. So I then hooked up
my Jetson TX2 and verified locally that I can properly suspend/resume
using either rtcwake or the power-key as wakeup source, just like I
previously did on Tegra194 (Jetson AGX Xavier). Tegra186 seems to be a
little more unstable because it didn't boot every time for me, but that
is probably not related to this.

I also did a bit of testing on Tegra210 and that seems to also still
work. We cannot test suspend/resume and therefore wake-events there
out-of-the-box because of some long-standing issue with USB refusing
to suspend, but I was able to at least trigger wake-up using the
RTC by first unloading the XUSB driver. It doesn't resume to a fully
functional system, but if I do see it resume, then the wake-event
code must have worked.

So, I'm tempted to say:

Tested-by: Thierry Reding <treding@...dia.com>

on the series and then we can pick up the other pieces once everything
is back to at least booting.

Thanks again for tracking this down, and apologies for breaking it in
the first place. Let me know if you want me to send the bike-shedded
version of your patches.

Thierry

Download attachment "signature.asc" of type "application/pgp-signature" (834 bytes)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ