linux-kernel - Re: [PATCH] KVM: arm64: nv: Allow shadow stage 2 read fault

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <djypsyratk63ovzv3flzb2tmunqtcoryzserwhsaaq5nuogsrx@u3uuwynnafbj>
Date: Tue, 26 Aug 2025 21:49:27 +0800
From: Wei-Lin Chang <r09922117@...e.ntu.edu.tw>
To: Marc Zyngier <maz@...nel.org>
Cc: linux-arm-kernel@...ts.infradead.org, kvmarm@...ts.linux.dev, 
	linux-kernel@...r.kernel.org, Oliver Upton <oliver.upton@...ux.dev>, 
	Joey Gouly <joey.gouly@....com>, Suzuki K Poulose <suzuki.poulose@....com>, 
	Zenghui Yu <yuzenghui@...wei.com>, Catalin Marinas <catalin.marinas@....com>, 
	Will Deacon <will@...nel.org>
Subject: Re: [PATCH] KVM: arm64: nv: Allow shadow stage 2 read fault

Hi Marc,

On Fri, Aug 22, 2025 at 10:40:07AM +0100, Marc Zyngier wrote:
> Hey Wei-Lin,
> 
> On Fri, 22 Aug 2025 04:18:53 +0100,
> Wei-Lin Chang <r09922117@...e.ntu.edu.tw> wrote:
> > 
> > We don't expect to ever encounter a stage-2 read permission fault,
> > because read permission is always granted when mapping stage-2. However,
> > this isn't certainly the case when NV is involved. The key is the shadow
> > mapping built for L2 must have the same or stricter permissions than
> > those that L1 built for L2, hence opening the possibility of mappings
> > without read permission.
> > 
> > Let's continue handling the read fault if we're handling a shadow
> > stage-2 fault, instead of erroring out.
> > 
> > The following illustrates an example, vertical lines stand for either
> > L0, L1, or L2 running, with left: L0, middle: L1, and right: L2.
> > Horizontal arrows stand for traps and erets between them.
> > 
> > '''
> > L0                               L1                               L2
> >                                                L2                 |
> >                                                writes to an L2 PA |
> > |<----------------------------------------------------------------|
> > |
> > | L0
> > | finds no mapping in
> > | L1's s2pt for L2
> > |
> > | L0
> > | injects data abort
> > |------------------------------->|
> >                                  | L1
> >                                  | for whatever reason
> >                                  | treats this abort specially,
> >                                  | only gives it W permission
> >                                  |
> >        eret traps to L0          |
> > |<-------------------------------|
> > |
> > |      eret back to L2
> > |---------------------------------------------------------------->|
> >                                                                   |
> >                                                    L2             |
> >                                                    writes to same |
> >                                                    L2 PA (replay) |
> > |<----------------------------------------------------------------|
> > |
> > | L0
> > | finds mapping in s2pt that
> > | L1 built for L2, create
> > | shadow mapping for L2,
> > | but only gives it W to match
> > | what L1 had given
> > |
> > |
> > |      eret back to L2
> > |---------------------------------------------------------------->|
> >                                                                   |
> >                                                    L2             |
> >                                                    writes to same |
> >                                                    L2 PA (replay) |
> >                                                    success        |
> > |<----------------------------------------------------------------|
> > |
> > |
> > |------------------------------->| L1
> >                                  | for some reason, relaxes
> >                                  | L2's permission from W
> >                                  | to RW, AND, doesn't flush
> >                                  | TLB
> >                                  |
> >        eret traps to L0          |
> > |<-------------------------------|
> > |
> > |      eret back to L2
> > |---------------------------------------------------------------->|
> >                                                                   |
> >                                                     L2            |
> >                                                     read to L2 PA |
> > |<----------------------------------------------------------------|
> > | stage-2 read fault
> > |
> > '''
> > 
> > Signed-off-by: Wei-Lin Chang <r09922117@...e.ntu.edu.tw>
> 
> Excellent detective work! Bonus points for the ASCII art -- I'm not
> sure how practical it will be to keep it in the final commit, but this
> definitely helps understanding the issue.

Thanks, I tried to make it clear for those that aren't familiar with NV,
including myself :P

> 
> > ---
> > 
> > I am able to trigger this error with a modified L1 KVM, but I do realize
> > this requires L1 to be very strange (or even just wrong) so I understand
> > if we don't want to handle this kind of edge case. On the other hand,
> > could there also be other ways to trigger this that I have not thought
> > of?
> > 
> > Another thing is that this change lets L1 get away with not flushing the
> > TLB, but TLBs are ephemeral so it's fine in this aspect, however I'm not
> > sure if there are other considerations.
> 
> Well, it depends whose TLBs we're talking about. The CPU TLBs are
> definitely ephemeral. But the KVM-managed TLBs (aka the shadow S2) is
> pretty static, and I am a bit worried to relax this.
> 
> What would happen on actual HW? L1 would take a S2 fault, because the
> TLBs are out of sync with the S2 PTs. And yes, maybe the TLB pressure
> woulds make it so that it works *sometimes*, but there wouldn't be any
> guarantee of forward progress as long as EL2 doesn't issue a TLBI.
> 
> I reckon we should implement that particular behaviour whenever
> possible, and handle permission faults early, possibly ahead of the
> guest S2 walk (the way we handle VNCR_EL2 faults is IMO a good model).

The parts above make complete sense.

> 
> This would imply taking the guest's S2 permission at face value, and
> only drop W permission when the memslot is RO -- you'd then need to
> keep track of the original W bit somewhere. And that's where things
> become much harder, because KVM can decide to remap arbitrary ranges
> of IPA space as RO, which implies we should track the W bit at all
> times, most likely as one of the SW bits in the PTE.

But sorry, I struggle to understand this paragraph after reading it many
times, probably my experience with the code isn't enough for me to make
the connection. Why are we talking about the W bit suddenly?
If you don't mind, can you reword what's discussed here?
I only very vaguely get that there will be 2 W bits, one from what L1 set,
and one from the L0 memslot, if I didn't completely miss the point..

> 
> We'll need exactly that if we ever want to implement the
> Hardware-managed Dirty Bit, but I have the feeling we need an actual
> design for this, and not a quick hack. Your approach is therefore the
> correct one for the time being.
> 
> > 
> > ---
> >  arch/arm64/kvm/mmu.c | 4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> > 
> > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> > index 1c78864767c5c..41017ca579b19 100644
> > --- a/arch/arm64/kvm/mmu.c
> > +++ b/arch/arm64/kvm/mmu.c
> > @@ -1508,8 +1508,8 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> >  	exec_fault = kvm_vcpu_trap_is_exec_fault(vcpu);
> >  	VM_BUG_ON(write_fault && exec_fault);
> >  
> > -	if (fault_is_perm && !write_fault && !exec_fault) {
> > -		kvm_err("Unexpected L2 read permission error\n");
> > +	if (fault_is_perm && !write_fault && !exec_fault && !nested) {
> > +		kvm_err("Unexpected S2 read permission error\n");
> >  		return -EFAULT;
> >  	}
> >  
> 
> Honestly, I'd rather kill this check altogether rather than adding
> another condition to it -- I suggested it to Fuad a while ago. This is
> the first time we ever see it firing, and I don't think it is bringing
> much.

No problem and agreed. Let me prepare another version.

Thanks,
Wei-Lin Chang

> 
> Thanks,
> 
> 	M.
> 
> -- 
> Jazz isn't dead. It just smells funny.