[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20220420100801.GA7235@willie-the-truck>
Date: Wed, 20 Apr 2022 11:08:04 +0100
From: Will Deacon <will@...nel.org>
To: Shanker R Donthineni <sdonthineni@...dia.com>
Cc: Marc Zyngier <maz@...nel.org>,
Catalin Marinas <catalin.marinas@....com>,
Mark Rutland <mark.rutland@....com>,
linux-arm-kernel@...ts.infradead.org, linux-kernel@...r.kernel.org,
Ard Biesheuvel <ardb@...nel.org>,
Vikram Sethi <vsethi@...dia.com>,
Thierry Reding <treding@...dia.com>,
Anshuman Khandual <anshuman.khandual@....com>
Subject: Re: [PATCH] arm64: head: Fix cache inconsistency of the
identity-mapped region
Hi Shanker,
On Mon, Apr 18, 2022 at 07:53:20AM -0500, Shanker R Donthineni wrote:
> On 4/18/22 4:16 AM, Marc Zyngier wrote:
> > External email: Use caution opening links or attachments
Ok.
> > On Fri, 15 Apr 2022 18:05:03 +0100,
> > Shanker Donthineni <sdonthineni@...dia.com> wrote:
> >> The secondary cores boot is stuck due to data abort while executing the
> >> instruction 'ldr x8, =__secondary_switched'. The RELA value of this
> >> instruction was updated by a primary boot core from __relocate_kernel()
> >> but those memory updates are not visible to CPUs after calling
> >> switch_to_vhe() causing problem.
> >>
> >> The cacheable/shareable attributes of the identity-mapped regions are
> >> different while CPU executing in EL1 (MMU enabled) and for a short period
> >> of time in hyp-stub (EL2-MMU disabled). As per the ARM-ARM specification
> >> (DDI0487G_b), this is not allowed.
> >>
> >> G5.10.3 Cache maintenance requirement:
> >> "If the change affects the cacheability attributes of the area of memory,
> >> including any change between Write-Through and Write-Back attributes,
> >> software must ensure that any cached copies of affected locations are
> >> removed from the caches, typically by cleaning and invalidating the
> >> locations from the levels of cache that might hold copies of the locations
> >> affected by the attribute change."
> >>
> >> Clean+invalidate the identity-mapped region till PoC before switching to
> >> VHE world to fix the cache inconsistency.
> >>
> >> Problem analysis with disassembly (vmlinux):
> >> 1) Both __primary_switch() and enter_vhe() are part of the identity region
> >> 2) RELA entries and enter_vhe() are sharing the same cache line fff800010970480
> >> 3) Memory ffff800010970484-ffff800010970498 is updated with EL1-MMU enabled
> >> 4) CPU fetches intrsuctions of enter_vhe() with EL2-MMU disabled
> >> - Non-coherent access causing the cache line fff800010970480 drop
> > Non-coherent? You mean non-cacheable, right? At this stage, we only
> > have a single CPU, so I'm not sure coherency is the problem here. When
> > you say 'drop', is that an eviction? Replaced by what? By the previous
> > version of the cache line, containing the stale value?
> Yes,non-cacheable. The cache line corresponding to function enter_vhe() was
> marked with shareable & WB-cache as a result of write to RELA, the same cache
> line is being fetched with non-shareable & non-cacheable. The eviction is
> not pushed to PoC and got dropped because of cache-line attributes mismatch.
I'm really struggling to understand this. Why is the instruction fetch
non-shareable? I'm trying to align your observations with the rules about
mismatched aliases in the architecture and I'm yet to satisfy myself that
the CPU is allowed to drop a dirty line on the floor in response to an
unexpected hit.
My mental model (which seems to align with Marc) is something like:
1. The boot CPU fetches the line via a cacheable mapping and dirties it
in its L1 as part of applying the relocations.
2. The boot CPU then enters EL2 with the MMU off and fetches the same
line on the I-side. AFAICT, the architecture says this is done with
outer-shareable, non-cacheable attributes.
3. !!! Somehow the instruction fetch results in the _dirty_ line from (1)
being discarded !!!
4. A secondary CPU later on fetches the line via a cacheable mapping and
explodes because the relocation hasn't been applied.
Is that what you're seeing? If so, we really need more insight into what
is going on at step (3) because it feels like it could have a much more
significant impact than the issue we're trying to fix here. How is the line
dropped? Is it due to back invalidation from a shared cache? Is it due to
IDC snooping? Does the line actually stick around on the D-side, but somehow
the I-side is shared between CPUs?
Many questions...
Will
Powered by blists - more mailing lists