linux-kernel - [QUESTION FOR ARM64 TLB] performance issue and implementation difference of TLB flush

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-ID: <2eb026b8-9e13-2b60-9e14-06417b142ac9@bytedance.com>
Date:   Thu, 27 Apr 2023 11:26:50 +0800
From:   Gang Li <ligang.bdlg@...edance.com>
To:     Will Deacon <will@...nel.org>,
        Tomasz Nowicki <tomasz.nowicki@...aro.org>,
        Laura Abbott <lauraa@...eaurora.org>
Cc:     Catalin Marinas <catalin.marinas@....com>,
        Will Deacon <will@...nel.org>,
        Ard Biesheuvel <ardb@...nel.org>,
        Anshuman Khandual <anshuman.khandual@....com>,
        Mark Rutland <mark.rutland@....com>,
        Kefeng Wang <wangkefeng.wang@...wei.com>,
        Feiyang Chen <chenfeiyang@...ngson.cn>,
        linux-arm-kernel@...ts.infradead.org, linux-kernel@...r.kernel.org
Subject: [QUESTION FOR ARM64 TLB] performance issue and implementation
 difference of TLB flush

Hi all,

I have encountered a performance issue on our ARM64 machine, which seems
to be caused by the flush_tlb_kernel_range.

Here is the stack on the ARM64 machine:

# ARM64:
```
     ghes_unmap
     clear_fixmap
     __set_fixmap
     flush_tlb_kernel_range
```

As we can see, the ARM64 implementation eventually calls
flush_tlb_kernel_range, which flushes the TLB on all cores. However, on
AMD64, the implementation calls flush_tlb_one_kernel instead.

# AMD64:
```
     ghes_unmap
     clear_fixmap
     __set_fixmap
     mmu.set_fixmap
     native_set_fixmap
     __native_set_fixmap
     set_pte_vaddr
     set_pte_vaddr_p4d
     __set_pte_vaddr
     flush_tlb_one_kernel
```

On our ARM64 machine, flush_tlb_kernel_range is causing a noticeable
performance degradation.

This arm64 patch said:
https://lore.kernel.org/all/20161201135112.15396-1-fu.wei@linaro.org/
(commit 9f9a35a7b654e006250530425eb1fb527f0d32e9)

```
/*
  * Despite its name, this function must still broadcast the TLB
  * invalidation in order to ensure other CPUs don't end up with junk
  * entries as a result of speculation. Unusually, its also called in
  * IRQ context (ghes_iounmap_irq) so if we ever need to use IPIs for
  * TLB broadcasting, then we're in trouble here.
  */
static inline void arch_apei_flush_tlb_one(unsigned long addr)
{
     flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
}
```

1. I am curious to know the reason behind the design choice of flushing
the TLB on all cores for ARM64's clear_fixmap, while AMD64 only flushes
the TLB on a single core. Are there any TLB design details that make a
difference here?

2. Is it possible to let the ARM64 to flush the TLB on just one core,
similar to the AMD64?

3. If so, would there be any potential drawbacks or limitations to
making such a change?

Thanks,

Gang Li