linux-kernel - Re: [PATCH] arm64: tlb: Fix TLBI RANGE operand

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20134458-508b-4853-a962-c1b44009e8ed@redhat.com>
Date: Wed, 3 Apr 2024 21:37:30 +1000
From: Gavin Shan <gshan@...hat.com>
To: Marc Zyngier <maz@...nel.org>
Cc: linux-arm-kernel@...ts.infradead.org, linux-kernel@...r.kernel.org,
 catalin.marinas@....com, will@...nel.org, akpm@...ux-foundation.org,
 apopple@...dia.com, mark.rutland@....com, ryan.roberts@....com,
 rananta@...gle.com, yangyicong@...ilicon.com, v-songbaohua@...o.com,
 yezhenyu2@...wei.com, yihyu@...hat.com, shan.gavin@...il.com
Subject: Re: [PATCH] arm64: tlb: Fix TLBI RANGE operand

On 4/3/24 18:58, Marc Zyngier wrote:
> On Wed, 03 Apr 2024 07:49:29 +0100,
> Gavin Shan <gshan@...hat.com> wrote:
>>
>> KVM/arm64 relies on TLBI RANGE feature to flush TLBs when the dirty
>> bitmap is collected by VMM and the corresponding PTEs need to be
>> write-protected again. Unfortunately, the operand passed to the TLBI
>> RANGE instruction isn't correctly sorted out by commit d1d3aa98b1d4
>> ("arm64: tlb: Use the TLBI RANGE feature in arm64"). It leads to
>> crash on the destination VM after live migration because some of the
>> dirty pages are missed.
>>
>> For example, I have a VM where 8GB memory is assigned, starting from
>> 0x40000000 (1GB). Note that the host has 4KB as the base page size.
>> All TLBs for VM can be covered by one TLBI RANGE operation. However,
>> I receives 0xffff708000040000 as the operand, which is wrong and the
>> correct one should be 0x00007f8000040000. From the wrong operand, we
>> have 3 and 1 for SCALE (bits[45:44) and NUM (bits943:39], only 1GB
>> instead of 8GB memory is covered.
>>
>> Fix the macro __TLBI_RANGE_NUM() so that the correct NUM and TLBI
>> RANGE operand are provided.
>>
>> Fixes: d1d3aa98b1d4 ("arm64: tlb: Use the TLBI RANGE feature in arm64")
>> Cc: stable@...nel.org # v5.10+
>> Reported-by: Yihuang Yu <yihyu@...hat.com>
>> Signed-off-by: Gavin Shan <gshan@...hat.com>
>> ---
>>   arch/arm64/include/asm/tlbflush.h | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
>> index 3b0e8248e1a4..07c4fb4b82b4 100644
>> --- a/arch/arm64/include/asm/tlbflush.h
>> +++ b/arch/arm64/include/asm/tlbflush.h
>> @@ -166,7 +166,7 @@ static inline unsigned long get_trans_granule(void)
>>    */
>>   #define TLBI_RANGE_MASK			GENMASK_ULL(4, 0)
>>   #define __TLBI_RANGE_NUM(pages, scale)	\
>> -	((((pages) >> (5 * (scale) + 1)) & TLBI_RANGE_MASK) - 1)
>> +	((((pages) >> (5 * (scale) + 1)) - 1) & TLBI_RANGE_MASK)
>>   
>>   /*
>>    *	TLB Invalidation
> 
> This looks pretty wrong, by the very definition of the comment that's
> just above:
> 
> <quote>
> /*
>   * Generate 'num' values from -1 to 30 with -1 rejected by the
>   * __flush_tlb_range() loop below.
>   */
> </quote>
> 
> With your change, num can't ever be negative, and that breaks
> __flush_tlb_range_op():
> 
> <quote>
> 		num = __TLBI_RANGE_NUM(pages, scale);			\
> 		if (num >= 0) {						\
> 			addr = __TLBI_VADDR_RANGE(start >> shift, asid, \
> 						scale, num, tlb_level);	\
> 			__tlbi(r##op, addr);				\
> 			if (tlbi_user)					\
> 				__tlbi_user(r##op, addr);		\
> 			start += __TLBI_RANGE_PAGES(num, scale) << PAGE_SHIFT; \
> 			pages -= __TLBI_RANGE_PAGES(num, scale);	\
> 		}							\
> 		scale--;						\
> </quote>
> 
> We'll then shove whatever value we've found in the TLBI operation,
> leading to unknown results instead of properly adjusting the scale to
> issue a smaller invalidation.
> 

Marc, thanks for your review and comments.

Indeed, this patch is incomplete at least. I think we need __TLBI_RANGE_NUM()
to return [-1 31] instead of [-1 30], to be consistent with MAX_TLBI_RANGE_PAGES.
-1 will be rejected in the following loop. I'm not 100% sure if I did the correct
calculation though.

/*
  * Generate 'num' values in range [-1 31], but -1 will be rejected
  * by the __flush_tlb_range() loop below.
  */
#define __TLBI_RANGE_NUM(pages, scale)                                          \
         ({                                                                      \
                 int __next = (pages) & (1ULL << (5 * (scale) + 6));             \
                 int __mask = ((pages) >> (5 * (scale) + 1)) & TLBI_RANGE_MASK;  \
                 int __num = (((pages) >> (5 * (scale) + 1)) - 1) &              \
                             TLBI_RANGE_MASK;                                    \
                 (__next || __mask) ? __num : -1;                                \
         })

Alternatively, we can also limit the number of pages to be invalidated from
arch/arm64/kvm/hyp/pgtable.c::kvm_tlb_flush_vmid_range() because the maximal
capacity is (MAX_TLBI_RANGE_PAGES - 1) instead of MAX_TLBI_RANGE_PAGES, as
the comments for __flush_tlb_range_nosync() say.

-               inval_pages = min(pages, MAX_TLBI_RANGE_PAGES);
+               inval_pages = min(pages, MAX_TLBI_RANGE_PAGES - 1);


static inline void __flush_tlb_range_nosync(...)
{
         :
        /*
          * When not uses TLB range ops, we can handle up to
          * (MAX_DVM_OPS - 1) pages;
          * When uses TLB range ops, we can handle up to
          * (MAX_TLBI_RANGE_PAGES - 1) pages.
          */
         if ((!system_supports_tlb_range() &&
              (end - start) >= (MAX_DVM_OPS * stride)) ||
             pages >= MAX_TLBI_RANGE_PAGES) {
                 flush_tlb_mm(vma->vm_mm);
                 return;
         }
}

Please let me know which way is better.

> I think the problem is that you are triggering NUM=31 and SCALE=3,
> which the current code cannot handle as per the comment above
> __flush_tlb_range_op() (we can't do NUM=30 and SCALE=4, obviously).
> 

Yes, exactly.

> Can you try the untested patch below?
> 
> 
> diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
> index 3b0e8248e1a4..b71a1cece802 100644
> --- a/arch/arm64/include/asm/tlbflush.h
> +++ b/arch/arm64/include/asm/tlbflush.h
> @@ -379,10 +379,6 @@ static inline void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
>    * 3. If there is 1 page remaining, flush it through non-range operations. Range
>    *    operations can only span an even number of pages. We save this for last to
>    *    ensure 64KB start alignment is maintained for the LPA2 case.
> - *
> - * Note that certain ranges can be represented by either num = 31 and
> - * scale or num = 0 and scale + 1. The loop below favours the latter
> - * since num is limited to 30 by the __TLBI_RANGE_NUM() macro.
>    */
>   #define __flush_tlb_range_op(op, start, pages, stride,			\
>   				asid, tlb_level, tlbi_user, lpa2)	\
> @@ -407,6 +403,7 @@ do {									\
>   									\
>   		num = __TLBI_RANGE_NUM(pages, scale);			\
>   		if (num >= 0) {						\
> +			num += 1;					\
>   			addr = __TLBI_VADDR_RANGE(start >> shift, asid, \
>   						scale, num, tlb_level);	\
>   			__tlbi(r##op, addr);				\
> 

Thanks, but I don't think it's going to work. The loop will be running infinitely
because the condition 'if (num >= 0)' can't be met when @pages is 0x200000 when
@scale is 3/2/1/0 until @scale becomes negative and positive again, but @scale
isn't in range [0 3]. I ported the chunk of code to user-space and I can see this
with added printf() messages.

Thanks,
Gavin