lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <2eb026b8-9e13-2b60-9e14-06417b142ac9@bytedance.com>
Date:   Thu, 27 Apr 2023 11:26:50 +0800
From:   Gang Li <ligang.bdlg@...edance.com>
To:     Will Deacon <will@...nel.org>,
        Tomasz Nowicki <tomasz.nowicki@...aro.org>,
        Laura Abbott <lauraa@...eaurora.org>
Cc:     Catalin Marinas <catalin.marinas@....com>,
        Will Deacon <will@...nel.org>,
        Ard Biesheuvel <ardb@...nel.org>,
        Anshuman Khandual <anshuman.khandual@....com>,
        Mark Rutland <mark.rutland@....com>,
        Kefeng Wang <wangkefeng.wang@...wei.com>,
        Feiyang Chen <chenfeiyang@...ngson.cn>,
        linux-arm-kernel@...ts.infradead.org, linux-kernel@...r.kernel.org
Subject: [QUESTION FOR ARM64 TLB] performance issue and implementation
 difference of TLB flush

Hi all,

I have encountered a performance issue on our ARM64 machine, which seems
to be caused by the flush_tlb_kernel_range.

Here is the stack on the ARM64 machine:

# ARM64:
```
     ghes_unmap
     clear_fixmap
     __set_fixmap
     flush_tlb_kernel_range
```

As we can see, the ARM64 implementation eventually calls
flush_tlb_kernel_range, which flushes the TLB on all cores. However, on
AMD64, the implementation calls flush_tlb_one_kernel instead.

# AMD64:
```
     ghes_unmap
     clear_fixmap
     __set_fixmap
     mmu.set_fixmap
     native_set_fixmap
     __native_set_fixmap
     set_pte_vaddr
     set_pte_vaddr_p4d
     __set_pte_vaddr
     flush_tlb_one_kernel
```

On our ARM64 machine, flush_tlb_kernel_range is causing a noticeable
performance degradation.

This arm64 patch said:
https://lore.kernel.org/all/20161201135112.15396-1-fu.wei@linaro.org/
(commit 9f9a35a7b654e006250530425eb1fb527f0d32e9)

```
/*
  * Despite its name, this function must still broadcast the TLB
  * invalidation in order to ensure other CPUs don't end up with junk
  * entries as a result of speculation. Unusually, its also called in
  * IRQ context (ghes_iounmap_irq) so if we ever need to use IPIs for
  * TLB broadcasting, then we're in trouble here.
  */
static inline void arch_apei_flush_tlb_one(unsigned long addr)
{
     flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
}
```

1. I am curious to know the reason behind the design choice of flushing
the TLB on all cores for ARM64's clear_fixmap, while AMD64 only flushes
the TLB on a single core. Are there any TLB design details that make a
difference here?

2. Is it possible to let the ARM64 to flush the TLB on just one core,
similar to the AMD64?

3. If so, would there be any potential drawbacks or limitations to
making such a change?

Thanks,

Gang Li

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ