[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <djypsyratk63ovzv3flzb2tmunqtcoryzserwhsaaq5nuogsrx@u3uuwynnafbj>
Date: Tue, 26 Aug 2025 21:49:27 +0800
From: Wei-Lin Chang <r09922117@...e.ntu.edu.tw>
To: Marc Zyngier <maz@...nel.org>
Cc: linux-arm-kernel@...ts.infradead.org, kvmarm@...ts.linux.dev,
linux-kernel@...r.kernel.org, Oliver Upton <oliver.upton@...ux.dev>,
Joey Gouly <joey.gouly@....com>, Suzuki K Poulose <suzuki.poulose@....com>,
Zenghui Yu <yuzenghui@...wei.com>, Catalin Marinas <catalin.marinas@....com>,
Will Deacon <will@...nel.org>
Subject: Re: [PATCH] KVM: arm64: nv: Allow shadow stage 2 read fault
Hi Marc,
On Fri, Aug 22, 2025 at 10:40:07AM +0100, Marc Zyngier wrote:
> Hey Wei-Lin,
>
> On Fri, 22 Aug 2025 04:18:53 +0100,
> Wei-Lin Chang <r09922117@...e.ntu.edu.tw> wrote:
> >
> > We don't expect to ever encounter a stage-2 read permission fault,
> > because read permission is always granted when mapping stage-2. However,
> > this isn't certainly the case when NV is involved. The key is the shadow
> > mapping built for L2 must have the same or stricter permissions than
> > those that L1 built for L2, hence opening the possibility of mappings
> > without read permission.
> >
> > Let's continue handling the read fault if we're handling a shadow
> > stage-2 fault, instead of erroring out.
> >
> > The following illustrates an example, vertical lines stand for either
> > L0, L1, or L2 running, with left: L0, middle: L1, and right: L2.
> > Horizontal arrows stand for traps and erets between them.
> >
> > '''
> > L0 L1 L2
> > L2 |
> > writes to an L2 PA |
> > |<----------------------------------------------------------------|
> > |
> > | L0
> > | finds no mapping in
> > | L1's s2pt for L2
> > |
> > | L0
> > | injects data abort
> > |------------------------------->|
> > | L1
> > | for whatever reason
> > | treats this abort specially,
> > | only gives it W permission
> > |
> > eret traps to L0 |
> > |<-------------------------------|
> > |
> > | eret back to L2
> > |---------------------------------------------------------------->|
> > |
> > L2 |
> > writes to same |
> > L2 PA (replay) |
> > |<----------------------------------------------------------------|
> > |
> > | L0
> > | finds mapping in s2pt that
> > | L1 built for L2, create
> > | shadow mapping for L2,
> > | but only gives it W to match
> > | what L1 had given
> > |
> > |
> > | eret back to L2
> > |---------------------------------------------------------------->|
> > |
> > L2 |
> > writes to same |
> > L2 PA (replay) |
> > success |
> > |<----------------------------------------------------------------|
> > |
> > |
> > |------------------------------->| L1
> > | for some reason, relaxes
> > | L2's permission from W
> > | to RW, AND, doesn't flush
> > | TLB
> > |
> > eret traps to L0 |
> > |<-------------------------------|
> > |
> > | eret back to L2
> > |---------------------------------------------------------------->|
> > |
> > L2 |
> > read to L2 PA |
> > |<----------------------------------------------------------------|
> > | stage-2 read fault
> > |
> > '''
> >
> > Signed-off-by: Wei-Lin Chang <r09922117@...e.ntu.edu.tw>
>
> Excellent detective work! Bonus points for the ASCII art -- I'm not
> sure how practical it will be to keep it in the final commit, but this
> definitely helps understanding the issue.
Thanks, I tried to make it clear for those that aren't familiar with NV,
including myself :P
>
> > ---
> >
> > I am able to trigger this error with a modified L1 KVM, but I do realize
> > this requires L1 to be very strange (or even just wrong) so I understand
> > if we don't want to handle this kind of edge case. On the other hand,
> > could there also be other ways to trigger this that I have not thought
> > of?
> >
> > Another thing is that this change lets L1 get away with not flushing the
> > TLB, but TLBs are ephemeral so it's fine in this aspect, however I'm not
> > sure if there are other considerations.
>
> Well, it depends whose TLBs we're talking about. The CPU TLBs are
> definitely ephemeral. But the KVM-managed TLBs (aka the shadow S2) is
> pretty static, and I am a bit worried to relax this.
>
> What would happen on actual HW? L1 would take a S2 fault, because the
> TLBs are out of sync with the S2 PTs. And yes, maybe the TLB pressure
> woulds make it so that it works *sometimes*, but there wouldn't be any
> guarantee of forward progress as long as EL2 doesn't issue a TLBI.
>
> I reckon we should implement that particular behaviour whenever
> possible, and handle permission faults early, possibly ahead of the
> guest S2 walk (the way we handle VNCR_EL2 faults is IMO a good model).
The parts above make complete sense.
>
> This would imply taking the guest's S2 permission at face value, and
> only drop W permission when the memslot is RO -- you'd then need to
> keep track of the original W bit somewhere. And that's where things
> become much harder, because KVM can decide to remap arbitrary ranges
> of IPA space as RO, which implies we should track the W bit at all
> times, most likely as one of the SW bits in the PTE.
But sorry, I struggle to understand this paragraph after reading it many
times, probably my experience with the code isn't enough for me to make
the connection. Why are we talking about the W bit suddenly?
If you don't mind, can you reword what's discussed here?
I only very vaguely get that there will be 2 W bits, one from what L1 set,
and one from the L0 memslot, if I didn't completely miss the point..
>
> We'll need exactly that if we ever want to implement the
> Hardware-managed Dirty Bit, but I have the feeling we need an actual
> design for this, and not a quick hack. Your approach is therefore the
> correct one for the time being.
>
> >
> > ---
> > arch/arm64/kvm/mmu.c | 4 ++--
> > 1 file changed, 2 insertions(+), 2 deletions(-)
> >
> > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> > index 1c78864767c5c..41017ca579b19 100644
> > --- a/arch/arm64/kvm/mmu.c
> > +++ b/arch/arm64/kvm/mmu.c
> > @@ -1508,8 +1508,8 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> > exec_fault = kvm_vcpu_trap_is_exec_fault(vcpu);
> > VM_BUG_ON(write_fault && exec_fault);
> >
> > - if (fault_is_perm && !write_fault && !exec_fault) {
> > - kvm_err("Unexpected L2 read permission error\n");
> > + if (fault_is_perm && !write_fault && !exec_fault && !nested) {
> > + kvm_err("Unexpected S2 read permission error\n");
> > return -EFAULT;
> > }
> >
>
> Honestly, I'd rather kill this check altogether rather than adding
> another condition to it -- I suggested it to Fuad a while ago. This is
> the first time we ever see it firing, and I don't think it is bringing
> much.
No problem and agreed. Let me prepare another version.
Thanks,
Wei-Lin Chang
>
> Thanks,
>
> M.
>
> --
> Jazz isn't dead. It just smells funny.
Powered by blists - more mailing lists