linux-kernel - Re: Why is the ARM SMMU v1/v2 put into bypass mode on kexec?

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20240322160640.GF5634@willie-the-truck>
Date: Fri, 22 Mar 2024 16:06:40 +0000
From: Will Deacon <will@...nel.org>
To: Tyler Hicks <code@...icks.com>
Cc: Robin Murphy <robin.murphy@....com>, Jason Gunthorpe <jgg@...pe.ca>,
	Jerry Snitselaar <jsnitsel@...hat.com>,
	linux-arm-kernel@...ts.infradead.org, iommu@...ts.linux.dev,
	linux-kernel@...r.kernel.org, Dexuan Cui <decui@...rosoft.com>,
	Easwar Hariharan <eahariha@...ux.microsoft.com>
Subject: Re: Why is the ARM SMMU v1/v2 put into bypass mode on kexec?

On Tue, Mar 19, 2024 at 02:14:26PM -0500, Tyler Hicks wrote:
> On 2024-03-19 15:47:56, Will Deacon wrote:
> > On Tue, Mar 19, 2024 at 12:57:52PM +0000, Robin Murphy wrote:
> > > Beyond properly quiescing and resetting the system back to a boot-time
> > > state, the outgoing kernel in a kexec can only really do things which affect
> > > itself. Sure, we *could* configure the SMMU to block all traffic and disable
> > > the interrupt to avoid getting stuck in a storm of faults on the way out,
> > > but what does that mean for the incoming kexec payload? That it can have the
> > > pleasure of discovering the SMMU, innocently enabling the interrupt and
> > > getting stuck in an unexpected storm of faults. Or perhaps just resetting
> > > the SMMU into a disabled state and thus still unwittingly allowing its
> > > memory to be corrupted by the previous kernel not supporting kexec properly.
> > 
> > Right, it's hard to win if DMA-active devices weren't quiesced properly
> > by the outgoing kernel. Either the SMMU was left in abort (leading to the
> > problems you list above) or the SMMU is left in bypass (leading to possible
> > data corruption). Which is better?
> 
> My thoughts are that a loud and obvious failure (via unidentified stream
> fault messages and/or a possible interrupt storm preventing the new
> kernel from booting) is favorable to silent and subtle data corruption
> of the target kernel.

Looking at the SMMUv3 spec, the architecture does actually allow hardware
to reset into an aborting state:

[GBPA.ABORT]
  | Note: An implementation can reset this field to 1, in order to
  | implement a default deny policy on reset.

so perhaps it's not that unreasonable. I just dread the flood of emails
I'll get because the SMMU driver is noisy due to missing ->shutdown()
callbacks elsewhere :/

> > The best solution is obviously to implement those missing ->shutdown()
> > callbacks.
> 
> Completely agree here but it can be difficult to even identify that a
> missing ->shutdown hook is the root cause without code changes to put
> the SMMU into abort mode and sleep for a bit in the SMMU's ->shutdown
> hook.

Perhaps that's the thing to tackle first, then? If we make it easier for
folks to diagnose and fix the missing ->shutdown() callbacks, then going
into abort is much more reasonable,

Will