linux-kernel - Re: [PATCH v2] KVM: arm64: Limit stage2_apply

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <86o7uz7o8h.wl-maz@kernel.org>
Date:   Wed, 28 Sep 2022 07:05:50 -0400
From:   Marc Zyngier <maz@...nel.org>
To:     Oliver Upton <oliver.upton@...ux.dev>
Cc:     Catalin Marinas <catalin.marinas@....com>,
        Will Deacon <will@...nel.org>,
        James Morse <james.morse@....com>,
        Alexandru Elisei <alexandru.elisei@....com>,
        Suzuki K Poulose <suzuki.poulose@....com>,
        Ricardo Koller <ricarkol@...gle.com>,
        linux-arm-kernel@...ts.infradead.org, linux-kernel@...r.kernel.org,
        kvmarm@...ts.cs.columbia.edu, Quentin Perret <qperret@...gle.com>
Subject: Re: [PATCH v2] KVM: arm64: Limit stage2_apply_range() batch size to 1GB

On Tue, 27 Sep 2022 15:43:10 -0400,
Oliver Upton <oliver.upton@...ux.dev> wrote:
> 
> Hi Marc,
> 
> On Tue, Sep 27, 2022 at 07:34:22AM -0400, Marc Zyngier wrote:
> 
> [...]
> 
> > > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> > > index c9a13e487187..5d05bb92e129 100644
> > > --- a/arch/arm64/kvm/mmu.c
> > > +++ b/arch/arm64/kvm/mmu.c
> > > @@ -31,6 +31,12 @@ static phys_addr_t hyp_idmap_vector;
> > >  
> > >  static unsigned long io_map_base;
> > >  
> > > +static inline phys_addr_t stage2_apply_range_next(phys_addr_t addr, phys_addr_t end)
> > 
> > Please drop the inline. I'm sure the compiler will perform its
> > magic.
> > 
> > Can I also bikeshed a bit about the name? This doesn't "apply"
> > anything, nor does it return the next range. It really computes the
> > end of the current one.
> > 
> > Something like stage2_range_addr_end() would at least be consistent
> > with the rest of the arm64 code (grep for _addr_end ...).
> 
> Bikeshed all you want :) But yeah, I like your suggestion.
> 
> > > +{
> > > +	phys_addr_t boundary = round_down(addr + SZ_1G, SZ_1G);
> > 
> > nit: the rest of the code is using ALIGN_DOWN(). Any reason why this
> > can't be used here?
> 
> Nope!
> 
> > > +
> > > +	return (boundary - 1 < end - 1) ? boundary : end;
> > > +}
> > >  
> > >  /*
> > >   * Release kvm_mmu_lock periodically if the memory region is large. Otherwise,
> > > @@ -52,7 +58,7 @@ static int stage2_apply_range(struct kvm *kvm, phys_addr_t addr,
> > >  		if (!pgt)
> > >  			return -EINVAL;
> > >  
> > > -		next = stage2_pgd_addr_end(kvm, addr, end);
> > > +		next = stage2_apply_range_next(addr, end);
> > >  		ret = fn(pgt, addr, next - addr);
> > >  		if (ret)
> > >  			break;
> > > 
> > 
> > The main problem I see with this is that some entries now get visited
> > multiple times if they cover more than a single 1GB entry (like a
> > 512GB level-0 entry with 4k pages and 48bit IPA) . As long as this
> > isn't destructive (CMOs, for example), this is probably OK. For
> > operations that are not idempotent (such as stage2_unmap_walker), this
> > is a significant change in behaviour.
> > 
> > My concern is that we have on one side a walker that is strictly
> > driven by the page-table sizes, and we now get an arbitrary value that
> > doesn't necessarily a multiple of block sizes. Yes, this works right
> > now because you can't create a block mapping larger than 1GB with any
> > of the supported page size.
> > 
> > But with 52bit VA/PA support, this changes: we can have 512GB (4k),
> > 64GB (16k) and 4TB (64k) block mappings at S2. We don't support this
> > yet at S2, but when this hits, we'll be in potential trouble.
> 
> Ah, I didn't fully capture the reasoning about the batch size. I had
> thought about batching by operating on at most 1 block of the largest
> supported granularity, but that felt like an inefficient walk restarting
> from root every 32M (for 16K paging).
> 
> OTOH, if/when we add support for larger blocks in S2 we will run into
> the same exact problem if we batch on the largest block size. If
> dirty logging caused the large blocks to be shattered down to leaf
> granularity then we will visit a crazy amount of PTEs before releasing
> the lock.
> 
> I guess what I'm getting at is we need to detect lock contention and the
> need to reschedule in the middle of the walk instead of at some
> predetermined boundary, though that ventures into the territory of 'TDP
> MMU' features...
> 
> So, seems to me we can crack this a few ways:
> 
>   1.Batch at the arbitrary 1GB since it works currently and produces a
>     more efficient walk for all page sizes. I can rework some of the
>     logic in kvm_level_supports_block_mapping() such that we can
>     BUILD_BUG_ON() if the largest block size exceeds 1GB. Kicks the can
>     down the road on a better implementation.
> 
>   2.Batch by the largest supported block mapping size. This will result
>     in less efficient walks for !4K page sizes and likely produce soft
>     lockups when we support even larger blocks. Nonetheless, the
>     implementation will remain correct regardless of block size.
> 
>   3.Test for lock contention and need_resched() in the middle of the
>     walk, rewalking from wherever we left off when scheduled again. TDP
>     MMU already does this, so it could be a wasted effort adding support
>     for it to the ARM MMU if we are to switch over at some point.
> 
> WDYT?

[+ Quentin, who is looking at this as well]

At this stage, I'm more keen on option (2). It solves your immediate
issue, and while it doesn't improve the humongous block mapping case,
it doesn't change the behaviour of the existing walkers (you are
guaranteed to visit a leaf only once per traversal).

Option (3) is more of a long term goal, and I'd rather we improve
things in now.

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.