linux-kernel - Re: [PATCH] KVM: arm64: Fix soft-lockup on relaxing PTE permission

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <0f93a015-4f10-b53e-f67f-a84db43ca533@redhat.com>
Date:   Tue, 5 Sep 2023 10:06:14 +1000
From:   Gavin Shan <gshan@...hat.com>
To:     Oliver Upton <oliver.upton@...ux.dev>
Cc:     kvmarm@...ts.linux.dev, linux-kernel@...r.kernel.org,
        maz@...nel.org, james.morse@....com, suzuki.poulose@....com,
        yuzenghui@...wei.com, catalin.marinas@....com, will@...nel.org,
        qperret@...gle.com, ricarkol@...gle.com, tabba@...gle.com,
        bgardon@...gle.com, zhenyzha@...hat.com, yihyu@...hat.com,
        shan.gavin@...il.com
Subject: Re: [PATCH] KVM: arm64: Fix soft-lockup on relaxing PTE permission

Hi Oliver,

On 9/4/23 18:04, Oliver Upton wrote:
> On Mon, Sep 04, 2023 at 05:28:26PM +1000, Gavin Shan wrote:
>> We observed soft-lockup on the host in a specific scenario where
>> the host on Ampere's Altra Max CPU has 64KB base page size and the
>> guest has 4KB base page size, 64 vCPUs and 13GB memory. The guest's
>> memory is backed by 512MB huge pages via hugetlbfs. All the 64 vCPUs
>> are simultaneously trapped into the host due to permission page faults,
>> to request adding the execution permission to the corresponding PMD
>> entry, before the soft-lockup is raised on the host. On handling the
>> parallel requests, the instruction cache for the 512MB huge page is
>> invalidated by mm_ops->icache_inval_pou() in stage2_attr_walker() on
>> 64 hardware CPUs. Unfortunately, the instruction cache invalidation
>> on one CPU interfere with that on another CPU in the hardware level.
>> It takes 37 seconds for mm_ops->icache_inval_pou() to finish in the
>> worst case.
>>
>> So we can't scale out to handle the permission faults at will. They
>> need to be serialized to some extent with the help of a interval tree,
> 
> Parallel permission faults is not the cause of the soft lockups
> you observe. The real issue is the volume of invalidations that are
> happening under the hood.
> 
> Take a look at __invalidate_icache_guest_page() -- we invalidate the
> icache by VA regardless of the size of the range. 512M / 64b = 8388608
> invalidation operations. Yes, multiple threads doing these invalidations
> in parallel makes the issue more pronounced as they bottleneck at the
> Miscellaneous node in the interconnect, but we should really do
> something about our invalidation strategy instead.
> 
> The approach you propose adds a fairly complex serialization mechanic
> _and_ unfairly penalizes systems that do not require explicit icache
> invalidation (i.e. FEAT_DIC).
> 
> I have a patch for the invalidation issue that I've been needing to
> send out for a while, could you please give this a go and see if it
> addresses the soft lockups you observe? If so, I can clean it up and
> send it as a patch. At minimum, MAX_TLBI_OPS needs to be renamed to hint
> at the common thread (DVM) between I$ and TLB invalidations.
> 

I generally agree with you that the issue is caused by hardware limitation
where too much time is needed to invalidate the instruction cache for a 512MB
huge page. The parallel invalidation on multiple CPUs make it worse. Actually,
the issue was reported from our downstream kernel, after the feature to support
the parallel page fault handling is included. It's to say, we didn't see the
issue (soft-lock on the host) when the page fault handlers are serialized.

Yeah, my patch has too much complex and FEAT_DIC is disregarded. FEAT_DIC isn't
available on Ampere's Alter and Alter-Max CPU. FEAT_IDC is supported on these
two CPU models though.

Thanks a lot for your patch, which looks much simplified. With it, I don't see
the soft-lockup issue again. However, we potentially have icache thrashing issue
since all icache lines are invalidated by icache_inval_all_pou(). So I think it's
critical to choose a sensible threshold (MAX_TLBI_OPS). I measured the consumed
time for various operations on Ampere's Altra and Altra-max models like below, which
may be helpful for you to choose a sensible threshold (MAX_TLBI_OPS).

   Operation               Altra           Altra Max
   -------------------------------------------------
   icache_inval_all_pou        153us          71us
   icache_inval_pou(64KB)       18us           8us
   icache_inval_pou(512MB) 1130744us       579132us

> --
> 
> diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
> index 96a80e8f6226..fd23644c9988 100644
> --- a/arch/arm64/include/asm/kvm_mmu.h
> +++ b/arch/arm64/include/asm/kvm_mmu.h
> @@ -117,6 +117,7 @@ alternative_cb_end
>   #include <asm/cache.h>
>   #include <asm/cacheflush.h>
>   #include <asm/mmu_context.h>
> +#include <asm/tlbflush.h>
>   #include <asm/kvm_emulate.h>
>   #include <asm/kvm_host.h>
>   
> @@ -224,15 +225,38 @@ static inline void __clean_dcache_guest_page(void *va, size_t size)
>   	kvm_flush_dcache_to_poc(va, size);
>   }
>   
> +static inline u32 __icache_line_size(void)
> +{
> +	u8 iminline;
> +	u64 ctr;
> +
> +	asm volatile(ALTERNATIVE_CB("movz %0, #0\n"
> +				    "movk %0, #0, lsl #16\n"
> +				    "movk %0, #0, lsl #32\n"
> +				    "movk %0, #0, lsl #48\n",
> +				    ARM64_ALWAYS_SYSTEM,
> +				    kvm_compute_final_ctr_el0)
> +		     : "=r" (ctr));
> +
> +	iminline = SYS_FIELD_GET(CTR_EL0, IminLine, ctr);
> +	return 4 << iminline;
> +}
> +
>   static inline void __invalidate_icache_guest_page(void *va, size_t size)
>   {
> +	size_t nr_lines = size / __icache_line_size();
> +
>   	if (icache_is_aliasing()) {
>   		/* any kind of VIPT cache */
>   		icache_inval_all_pou();
>   	} else if (read_sysreg(CurrentEL) != CurrentEL_EL1 ||
>   		   !icache_is_vpipt()) {
>   		/* PIPT or VPIPT at EL2 (see comment in __kvm_tlb_flush_vmid_ipa) */
> -		icache_inval_pou((unsigned long)va, (unsigned long)va + size);
> +		if (nr_lines > MAX_TLBI_OPS)
> +			icache_inval_all_pou();
> +		else
> +			icache_inval_pou((unsigned long)va,
> +					 (unsigned long)va + size);
>   	}
>   }
>   

I'm not sure if it's worthy to pull the @iminline from CTR_EL0 since it's almost
fixed to 64-bytes. @size is guranteed to be PAGE_SIZE or PMD_SIZE aligned. Maybe
we can just aggressively do something like below, disregarding the icache thrashing.
In this way, the code is further simplified.

     if (size > PAGE_SIZE) {
         icache_inval_all_pou();
     } else {
         icache_inval_pou((unsigned long)va,
                          (unsigned long)va + size);
     }                                                          // parantheses is still needed

---

I'm leveraging the chance to ask one question, which isn't related to the issue.
It seems we're doing the icache/dcache coherence differently for stage1 and stage-2
page table entries. The question is why we needn't to clean the dcache for stage-2,
as we're doing for the stage-1 case?

   // stage-1 page table                                       // stage-2 page table
   __set_pte_at                                                invalidate_icache_guest_page
   __sync_icache_dcache                                        __invalidate_icache_guest_page
   sync_icache_aliases                                         icache_inval_pou
   caches_clean_inval_pou                                        invalidate_icache_by_line  // !ARM64_HAS_CACHE_DIC
   caches_clean_inval_pou_macro
     dcache_by_line_op          // !ARM64_HAS_CACHE_IDC
     invalidate_icache_by_line  // !ARM64_HAS_CACHE_DIC

Thanks,
Gavin