[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20240322160640.GF5634@willie-the-truck>
Date: Fri, 22 Mar 2024 16:06:40 +0000
From: Will Deacon <will@...nel.org>
To: Tyler Hicks <code@...icks.com>
Cc: Robin Murphy <robin.murphy@....com>, Jason Gunthorpe <jgg@...pe.ca>,
Jerry Snitselaar <jsnitsel@...hat.com>,
linux-arm-kernel@...ts.infradead.org, iommu@...ts.linux.dev,
linux-kernel@...r.kernel.org, Dexuan Cui <decui@...rosoft.com>,
Easwar Hariharan <eahariha@...ux.microsoft.com>
Subject: Re: Why is the ARM SMMU v1/v2 put into bypass mode on kexec?
On Tue, Mar 19, 2024 at 02:14:26PM -0500, Tyler Hicks wrote:
> On 2024-03-19 15:47:56, Will Deacon wrote:
> > On Tue, Mar 19, 2024 at 12:57:52PM +0000, Robin Murphy wrote:
> > > Beyond properly quiescing and resetting the system back to a boot-time
> > > state, the outgoing kernel in a kexec can only really do things which affect
> > > itself. Sure, we *could* configure the SMMU to block all traffic and disable
> > > the interrupt to avoid getting stuck in a storm of faults on the way out,
> > > but what does that mean for the incoming kexec payload? That it can have the
> > > pleasure of discovering the SMMU, innocently enabling the interrupt and
> > > getting stuck in an unexpected storm of faults. Or perhaps just resetting
> > > the SMMU into a disabled state and thus still unwittingly allowing its
> > > memory to be corrupted by the previous kernel not supporting kexec properly.
> >
> > Right, it's hard to win if DMA-active devices weren't quiesced properly
> > by the outgoing kernel. Either the SMMU was left in abort (leading to the
> > problems you list above) or the SMMU is left in bypass (leading to possible
> > data corruption). Which is better?
>
> My thoughts are that a loud and obvious failure (via unidentified stream
> fault messages and/or a possible interrupt storm preventing the new
> kernel from booting) is favorable to silent and subtle data corruption
> of the target kernel.
Looking at the SMMUv3 spec, the architecture does actually allow hardware
to reset into an aborting state:
[GBPA.ABORT]
| Note: An implementation can reset this field to 1, in order to
| implement a default deny policy on reset.
so perhaps it's not that unreasonable. I just dread the flood of emails
I'll get because the SMMU driver is noisy due to missing ->shutdown()
callbacks elsewhere :/
> > The best solution is obviously to implement those missing ->shutdown()
> > callbacks.
>
> Completely agree here but it can be difficult to even identify that a
> missing ->shutdown hook is the root cause without code changes to put
> the SMMU into abort mode and sleep for a bit in the SMMU's ->shutdown
> hook.
Perhaps that's the thing to tackle first, then? If we make it easier for
folks to diagnose and fix the missing ->shutdown() callbacks, then going
into abort is much more reasonable,
Will
Powered by blists - more mailing lists