linux-kernel - Re: [PATCH] KVM: arm64: Fix soft-lockup on relaxing PTE permission

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <bfdafdc5-4abf-a387-0857-e8cb84e4b3d7@redhat.com>
Date:   Wed, 6 Sep 2023 08:26:24 +1000
From:   Gavin Shan <gshan@...hat.com>
To:     Oliver Upton <oliver.upton@...ux.dev>
Cc:     kvmarm@...ts.linux.dev, linux-kernel@...r.kernel.org,
        maz@...nel.org, james.morse@....com, suzuki.poulose@....com,
        yuzenghui@...wei.com, catalin.marinas@....com, will@...nel.org,
        qperret@...gle.com, ricarkol@...gle.com, tabba@...gle.com,
        bgardon@...gle.com, zhenyzha@...hat.com, yihyu@...hat.com,
        shan.gavin@...il.com
Subject: Re: [PATCH] KVM: arm64: Fix soft-lockup on relaxing PTE permission


On 9/6/23 04:06, Oliver Upton wrote:
> On Tue, Sep 05, 2023 at 10:06:14AM +1000, Gavin Shan wrote:
> 
> [...]
> 
>>>    static inline void __invalidate_icache_guest_page(void *va, size_t size)
>>>    {
>>> +	size_t nr_lines = size / __icache_line_size();
>>> +
>>>    	if (icache_is_aliasing()) {
>>>    		/* any kind of VIPT cache */
>>>    		icache_inval_all_pou();
>>>    	} else if (read_sysreg(CurrentEL) != CurrentEL_EL1 ||
>>>    		   !icache_is_vpipt()) {
>>>    		/* PIPT or VPIPT at EL2 (see comment in __kvm_tlb_flush_vmid_ipa) */
>>> -		icache_inval_pou((unsigned long)va, (unsigned long)va + size);
>>> +		if (nr_lines > MAX_TLBI_OPS)
>>> +			icache_inval_all_pou();
>>> +		else
>>> +			icache_inval_pou((unsigned long)va,
>>> +					 (unsigned long)va + size);
>>>    	}
>>>    }
>>
>> I'm not sure if it's worthy to pull the @iminline from CTR_EL0 since it's almost
>> fixed to 64-bytes.
> 
> I firmly disagree. The architecture allows implementers to select a
> different minimum line size, and non-64b systems _do_ exist in the wild.
> Furthermore, some implementers have decided to glue together cores with
> mismatched line sizes too...
> 
> Though we could avoid some headache by normalizing on 64b, the cold
> reality of the ecosystem requires that we go out of our way to
> accomodate ~any design choice allowed by the architecture.
> 

It seems I didn't make it clear enough. The reason why I had the concern
to avoid reading ctr_el0 is we read ctr_el0 for twice in the following path,
but I doubt if anybody cares. Since it's a hot path, each bit of performance
gain will count.

   invalidate_icache_guest_page
   __invalidate_icache_guest_page   // first read on ctr_el0, with your code changes
   icache_inval_pou(va, va + size)
   invalidate_icache_by_line
     icache_line_size               // second read on ctr_el0


>> @size is guranteed to be PAGE_SIZE or PMD_SIZE aligned. Maybe
>> we can just aggressively do something like below, disregarding the icache thrashing.
>> In this way, the code is further simplified.
>>
>>      if (size > PAGE_SIZE) {
>>          icache_inval_all_pou();
>>      } else {
>>          icache_inval_pou((unsigned long)va,
>>                           (unsigned long)va + size);
>>      }                                                          // parantheses is still needed
> 
> This could work too but we already have a kernel heuristic for limiting
> the amount of broadcast invalidations, which is MAX_TLBI_OPS. I don't
> want to introduce a second, KVM-specific hack to address the exact same
> thing.
> 

Ok. I was confused at the first glance since TLB isn't relevant to icache.
I think it's fine to reuse MAX_TLBI_OPS here, but a comment may be needed.
Oliver, could you please send a formal patch for your changes?

>> I'm leveraging the chance to ask one question, which isn't related to the issue.
>> It seems we're doing the icache/dcache coherence differently for stage1 and stage-2
>> page table entries. The question is why we needn't to clean the dcache for stage-2,
>> as we're doing for the stage-1 case?
> 
> KVM always does its required dcache maintenance (if any) on the first
> translation abort to a given IPA. On systems w/o FEAT_DIC, we lazily
> grant execute permissions as an optimization to avoid unnecessary icache
> invalidations, which as you've seen tends to be a bit of a sore spot.
> 
> Between the two faults, we've effectively guaranteed that any
> host-initiated writes to the PA are visible to the guest on both the I
> and D side. Any CMOs for making guest-initiated writes coherent after
> the translation fault are the sole responsibility of the guest.
> 

Nice, thanks a lot for the explanation.

Thanks,
Gavin