[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <6b08ece9-64c0-9de9-9876-ed2dceee9bb0@nvidia.com>
Date: Mon, 18 Apr 2022 07:53:20 -0500
From: Shanker R Donthineni <sdonthineni@...dia.com>
To: Marc Zyngier <maz@...nel.org>
Cc: Catalin Marinas <catalin.marinas@....com>,
Will Deacon <will@...nel.org>,
Mark Rutland <mark.rutland@....com>,
linux-arm-kernel@...ts.infradead.org, linux-kernel@...r.kernel.org,
Ard Biesheuvel <ardb@...nel.org>,
Vikram Sethi <vsethi@...dia.com>,
Thierry Reding <treding@...dia.com>,
Anshuman Khandual <anshuman.khandual@....com>
Subject: Re: [PATCH] arm64: head: Fix cache inconsistency of the
identity-mapped region
Thanks Marc for your comments.
On 4/18/22 4:16 AM, Marc Zyngier wrote:
> External email: Use caution opening links or attachments
>
>
> Hi Shanker,
>
> On Fri, 15 Apr 2022 18:05:03 +0100,
> Shanker Donthineni <sdonthineni@...dia.com> wrote:
>> The secondary cores boot is stuck due to data abort while executing the
>> instruction 'ldr x8, =__secondary_switched'. The RELA value of this
>> instruction was updated by a primary boot core from __relocate_kernel()
>> but those memory updates are not visible to CPUs after calling
>> switch_to_vhe() causing problem.
>>
>> The cacheable/shareable attributes of the identity-mapped regions are
>> different while CPU executing in EL1 (MMU enabled) and for a short period
>> of time in hyp-stub (EL2-MMU disabled). As per the ARM-ARM specification
>> (DDI0487G_b), this is not allowed.
>>
>> G5.10.3 Cache maintenance requirement:
>> "If the change affects the cacheability attributes of the area of memory,
>> including any change between Write-Through and Write-Back attributes,
>> software must ensure that any cached copies of affected locations are
>> removed from the caches, typically by cleaning and invalidating the
>> locations from the levels of cache that might hold copies of the locations
>> affected by the attribute change."
>>
>> Clean+invalidate the identity-mapped region till PoC before switching to
>> VHE world to fix the cache inconsistency.
>>
>> Problem analysis with disassembly (vmlinux):
>> 1) Both __primary_switch() and enter_vhe() are part of the identity region
>> 2) RELA entries and enter_vhe() are sharing the same cache line fff800010970480
>> 3) Memory ffff800010970484-ffff800010970498 is updated with EL1-MMU enabled
>> 4) CPU fetches intrsuctions of enter_vhe() with EL2-MMU disabled
>> - Non-coherent access causing the cache line fff800010970480 drop
> Non-coherent? You mean non-cacheable, right? At this stage, we only
> have a single CPU, so I'm not sure coherency is the problem here. When
> you say 'drop', is that an eviction? Replaced by what? By the previous
> version of the cache line, containing the stale value?
Yes,non-cacheable. The cache line corresponding to function enter_vhe() was
marked with shareable & WB-cache as a result of write to RELA, the same cache
line is being fetched with non-shareable & non-cacheable. The eviction is
not pushed to PoC and got dropped because of cache-line attributes mismatch.
> It is also unclear to me how the instruction fetches directly
> influence what happens *on the other CPUs*. Is this line kept at a
> level beyond the PoU? Are we talking of a system cache here? It would
> really help if you could describe your cache topology.
>
>> 5) Secondary core executes 'ldr x8, __secondary_switched'
>> - Getting data abort because of the incorrect value at ffff800010970488
> My interpretation of the above is as follows:
>
> - CPU0 performs the RELA update with the MMU on
>
> - A switch to EL2 with the MMU off results in the cache line sitting
> beyond the PoU and containing the RELA update to be replaced with
> the *stale* version (the fetch happening on the i-side).
>
> - CPU1 (with its MMU on) fetches the stale data from the cache
>
> Is this correct?
Yes, correct,
> What is unclear to me is why the eviction occurs. Yes, this is of
> course allowed, and the code is wrong for assuming any odd behaviour.
> But then,
The eviction is happening naturally don't know the time and the exact
line of code where eviction is happening.
> why only clean the idmap?
Seems only EL2-STUB code is accessing RELA of idmap with MMU disabled.
I did 2 other experiments, the issue was not reproducible.
1) CONFIG_RELOCATABLE=n
2) Change the line "ldr x8, =__secondary_switched" to
adr_l x9, _text
adr_l x8, __secondary_switched
sub x8, x8, x9
mov_q x9, KIMAGE_VADDR
add x8, x8, x9
> I have the feeling that by this
> standard, we should see this a lot more often. Or are we just lucky
> that we don't have any other examples of data and instructions sharing
> the same cache line and accessed with different cacheability
> attributes?
I think in all other places either MMU is enabled or instructions/data
pushed to the PoC.
> Thanks,
>
> M.
>
> --
> Without deviation from the norm, progress is not possible.
Powered by blists - more mailing lists