linux-kernel - Re: [PATCH] KVM: arm64: nv: Allow shadow stage 2 read fault

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <87a53rk83s.wl-maz@kernel.org>
Date: Fri, 22 Aug 2025 10:40:07 +0100
From: Marc Zyngier <maz@...nel.org>
To: Wei-Lin Chang <r09922117@...e.ntu.edu.tw>
Cc: linux-arm-kernel@...ts.infradead.org,
	kvmarm@...ts.linux.dev,
	linux-kernel@...r.kernel.org,
	Oliver Upton <oliver.upton@...ux.dev>,
	Joey Gouly <joey.gouly@....com>,
	Suzuki K Poulose <suzuki.poulose@....com>,
	Zenghui Yu <yuzenghui@...wei.com>,
	Catalin Marinas <catalin.marinas@....com>,
	Will Deacon <will@...nel.org>
Subject: Re: [PATCH] KVM: arm64: nv: Allow shadow stage 2 read fault

Hey Wei-Lin,

On Fri, 22 Aug 2025 04:18:53 +0100,
Wei-Lin Chang <r09922117@...e.ntu.edu.tw> wrote:
> 
> We don't expect to ever encounter a stage-2 read permission fault,
> because read permission is always granted when mapping stage-2. However,
> this isn't certainly the case when NV is involved. The key is the shadow
> mapping built for L2 must have the same or stricter permissions than
> those that L1 built for L2, hence opening the possibility of mappings
> without read permission.
> 
> Let's continue handling the read fault if we're handling a shadow
> stage-2 fault, instead of erroring out.
> 
> The following illustrates an example, vertical lines stand for either
> L0, L1, or L2 running, with left: L0, middle: L1, and right: L2.
> Horizontal arrows stand for traps and erets between them.
> 
> '''
> L0                               L1                               L2
>                                                L2                 |
>                                                writes to an L2 PA |
> |<----------------------------------------------------------------|
> |
> | L0
> | finds no mapping in
> | L1's s2pt for L2
> |
> | L0
> | injects data abort
> |------------------------------->|
>                                  | L1
>                                  | for whatever reason
>                                  | treats this abort specially,
>                                  | only gives it W permission
>                                  |
>        eret traps to L0          |
> |<-------------------------------|
> |
> |      eret back to L2
> |---------------------------------------------------------------->|
>                                                                   |
>                                                    L2             |
>                                                    writes to same |
>                                                    L2 PA (replay) |
> |<----------------------------------------------------------------|
> |
> | L0
> | finds mapping in s2pt that
> | L1 built for L2, create
> | shadow mapping for L2,
> | but only gives it W to match
> | what L1 had given
> |
> |
> |      eret back to L2
> |---------------------------------------------------------------->|
>                                                                   |
>                                                    L2             |
>                                                    writes to same |
>                                                    L2 PA (replay) |
>                                                    success        |
> |<----------------------------------------------------------------|
> |
> |
> |------------------------------->| L1
>                                  | for some reason, relaxes
>                                  | L2's permission from W
>                                  | to RW, AND, doesn't flush
>                                  | TLB
>                                  |
>        eret traps to L0          |
> |<-------------------------------|
> |
> |      eret back to L2
> |---------------------------------------------------------------->|
>                                                                   |
>                                                     L2            |
>                                                     read to L2 PA |
> |<----------------------------------------------------------------|
> | stage-2 read fault
> |
> '''
> 
> Signed-off-by: Wei-Lin Chang <r09922117@...e.ntu.edu.tw>

Excellent detective work! Bonus points for the ASCII art -- I'm not
sure how practical it will be to keep it in the final commit, but this
definitely helps understanding the issue.

> ---
> 
> I am able to trigger this error with a modified L1 KVM, but I do realize
> this requires L1 to be very strange (or even just wrong) so I understand
> if we don't want to handle this kind of edge case. On the other hand,
> could there also be other ways to trigger this that I have not thought
> of?
> 
> Another thing is that this change lets L1 get away with not flushing the
> TLB, but TLBs are ephemeral so it's fine in this aspect, however I'm not
> sure if there are other considerations.

Well, it depends whose TLBs we're talking about. The CPU TLBs are
definitely ephemeral. But the KVM-managed TLBs (aka the shadow S2) is
pretty static, and I am a bit worried to relax this.

What would happen on actual HW? L1 would take a S2 fault, because the
TLBs are out of sync with the S2 PTs. And yes, maybe the TLB pressure
woulds make it so that it works *sometimes*, but there wouldn't be any
guarantee of forward progress as long as EL2 doesn't issue a TLBI.

I reckon we should implement that particular behaviour whenever
possible, and handle permission faults early, possibly ahead of the
guest S2 walk (the way we handle VNCR_EL2 faults is IMO a good model).

This would imply taking the guest's S2 permission at face value, and
only drop W permission when the memslot is RO -- you'd then need to
keep track of the original W bit somewhere. And that's where things
become much harder, because KVM can decide to remap arbitrary ranges
of IPA space as RO, which implies we should track the W bit at all
times, most likely as one of the SW bits in the PTE.

We'll need exactly that if we ever want to implement the
Hardware-managed Dirty Bit, but I have the feeling we need an actual
design for this, and not a quick hack. Your approach is therefore the
correct one for the time being.

> 
> ---
>  arch/arm64/kvm/mmu.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index 1c78864767c5c..41017ca579b19 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -1508,8 +1508,8 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  	exec_fault = kvm_vcpu_trap_is_exec_fault(vcpu);
>  	VM_BUG_ON(write_fault && exec_fault);
>  
> -	if (fault_is_perm && !write_fault && !exec_fault) {
> -		kvm_err("Unexpected L2 read permission error\n");
> +	if (fault_is_perm && !write_fault && !exec_fault && !nested) {
> +		kvm_err("Unexpected S2 read permission error\n");
>  		return -EFAULT;
>  	}
>  

Honestly, I'd rather kill this check altogether rather than adding
another condition to it -- I suggested it to Fuad a while ago. This is
the first time we ever see it firing, and I don't think it is bringing
much.

Thanks,

	M.

-- 
Jazz isn't dead. It just smells funny.