[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <87a53rk83s.wl-maz@kernel.org>
Date: Fri, 22 Aug 2025 10:40:07 +0100
From: Marc Zyngier <maz@...nel.org>
To: Wei-Lin Chang <r09922117@...e.ntu.edu.tw>
Cc: linux-arm-kernel@...ts.infradead.org,
kvmarm@...ts.linux.dev,
linux-kernel@...r.kernel.org,
Oliver Upton <oliver.upton@...ux.dev>,
Joey Gouly <joey.gouly@....com>,
Suzuki K Poulose <suzuki.poulose@....com>,
Zenghui Yu <yuzenghui@...wei.com>,
Catalin Marinas <catalin.marinas@....com>,
Will Deacon <will@...nel.org>
Subject: Re: [PATCH] KVM: arm64: nv: Allow shadow stage 2 read fault
Hey Wei-Lin,
On Fri, 22 Aug 2025 04:18:53 +0100,
Wei-Lin Chang <r09922117@...e.ntu.edu.tw> wrote:
>
> We don't expect to ever encounter a stage-2 read permission fault,
> because read permission is always granted when mapping stage-2. However,
> this isn't certainly the case when NV is involved. The key is the shadow
> mapping built for L2 must have the same or stricter permissions than
> those that L1 built for L2, hence opening the possibility of mappings
> without read permission.
>
> Let's continue handling the read fault if we're handling a shadow
> stage-2 fault, instead of erroring out.
>
> The following illustrates an example, vertical lines stand for either
> L0, L1, or L2 running, with left: L0, middle: L1, and right: L2.
> Horizontal arrows stand for traps and erets between them.
>
> '''
> L0 L1 L2
> L2 |
> writes to an L2 PA |
> |<----------------------------------------------------------------|
> |
> | L0
> | finds no mapping in
> | L1's s2pt for L2
> |
> | L0
> | injects data abort
> |------------------------------->|
> | L1
> | for whatever reason
> | treats this abort specially,
> | only gives it W permission
> |
> eret traps to L0 |
> |<-------------------------------|
> |
> | eret back to L2
> |---------------------------------------------------------------->|
> |
> L2 |
> writes to same |
> L2 PA (replay) |
> |<----------------------------------------------------------------|
> |
> | L0
> | finds mapping in s2pt that
> | L1 built for L2, create
> | shadow mapping for L2,
> | but only gives it W to match
> | what L1 had given
> |
> |
> | eret back to L2
> |---------------------------------------------------------------->|
> |
> L2 |
> writes to same |
> L2 PA (replay) |
> success |
> |<----------------------------------------------------------------|
> |
> |
> |------------------------------->| L1
> | for some reason, relaxes
> | L2's permission from W
> | to RW, AND, doesn't flush
> | TLB
> |
> eret traps to L0 |
> |<-------------------------------|
> |
> | eret back to L2
> |---------------------------------------------------------------->|
> |
> L2 |
> read to L2 PA |
> |<----------------------------------------------------------------|
> | stage-2 read fault
> |
> '''
>
> Signed-off-by: Wei-Lin Chang <r09922117@...e.ntu.edu.tw>
Excellent detective work! Bonus points for the ASCII art -- I'm not
sure how practical it will be to keep it in the final commit, but this
definitely helps understanding the issue.
> ---
>
> I am able to trigger this error with a modified L1 KVM, but I do realize
> this requires L1 to be very strange (or even just wrong) so I understand
> if we don't want to handle this kind of edge case. On the other hand,
> could there also be other ways to trigger this that I have not thought
> of?
>
> Another thing is that this change lets L1 get away with not flushing the
> TLB, but TLBs are ephemeral so it's fine in this aspect, however I'm not
> sure if there are other considerations.
Well, it depends whose TLBs we're talking about. The CPU TLBs are
definitely ephemeral. But the KVM-managed TLBs (aka the shadow S2) is
pretty static, and I am a bit worried to relax this.
What would happen on actual HW? L1 would take a S2 fault, because the
TLBs are out of sync with the S2 PTs. And yes, maybe the TLB pressure
woulds make it so that it works *sometimes*, but there wouldn't be any
guarantee of forward progress as long as EL2 doesn't issue a TLBI.
I reckon we should implement that particular behaviour whenever
possible, and handle permission faults early, possibly ahead of the
guest S2 walk (the way we handle VNCR_EL2 faults is IMO a good model).
This would imply taking the guest's S2 permission at face value, and
only drop W permission when the memslot is RO -- you'd then need to
keep track of the original W bit somewhere. And that's where things
become much harder, because KVM can decide to remap arbitrary ranges
of IPA space as RO, which implies we should track the W bit at all
times, most likely as one of the SW bits in the PTE.
We'll need exactly that if we ever want to implement the
Hardware-managed Dirty Bit, but I have the feeling we need an actual
design for this, and not a quick hack. Your approach is therefore the
correct one for the time being.
>
> ---
> arch/arm64/kvm/mmu.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index 1c78864767c5c..41017ca579b19 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -1508,8 +1508,8 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> exec_fault = kvm_vcpu_trap_is_exec_fault(vcpu);
> VM_BUG_ON(write_fault && exec_fault);
>
> - if (fault_is_perm && !write_fault && !exec_fault) {
> - kvm_err("Unexpected L2 read permission error\n");
> + if (fault_is_perm && !write_fault && !exec_fault && !nested) {
> + kvm_err("Unexpected S2 read permission error\n");
> return -EFAULT;
> }
>
Honestly, I'd rather kill this check altogether rather than adding
another condition to it -- I suggested it to Fuad a while ago. This is
the first time we ever see it firing, and I don't think it is bringing
much.
Thanks,
M.
--
Jazz isn't dead. It just smells funny.
Powered by blists - more mailing lists