[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250120024104.1924753-1-riel@surriel.com>
Date: Sun, 19 Jan 2025 21:40:08 -0500
From: Rik van Riel <riel@...riel.com>
To: x86@...nel.org
Cc: linux-kernel@...r.kernel.org,
bp@...en8.de,
peterz@...radead.org,
dave.hansen@...ux.intel.com,
zhengqi.arch@...edance.com,
nadav.amit@...il.com,
thomas.lendacky@....com,
kernel-team@...a.com,
linux-mm@...ck.org,
akpm@...ux-foundation.org,
jannh@...gle.com,
mhklinux@...look.com,
andrew.cooper3@...rix.com
Subject: [PATCH v6 00/12] AMD broadcast TLB invalidation
Add support for broadcast TLB invalidation using AMD's INVLPGB instruction.
This allows the kernel to invalidate TLB entries on remote CPUs without
needing to send IPIs, without having to wait for remote CPUs to handle
those interrupts, and with less interruption to what was running on
those CPUs.
Because x86 PCID space is limited, and there are some very large
systems out there, broadcast TLB invalidation is only used for
processes that are active on 3 or more CPUs, with the threshold
being gradually increased the more the PCID space gets exhausted.
Combined with the removal of unnecessary lru_add_drain calls
(see https://lkml.org/lkml/2024/12/19/1388) this results in a
nice performance boost for the will-it-scale tlb_flush2_threads
test on an AMD Milan system with 36 cores:
- vanilla kernel: 527k loops/second
- lru_add_drain removal: 731k loops/second
- only INVLPGB: 527k loops/second
- lru_add_drain + INVLPGB: 1157k loops/second
Profiling with only the INVLPGB changes showed while
TLB invalidation went down from 40% of the total CPU
time to only around 4% of CPU time, the contention
simply moved to the LRU lock.
Fixing both at the same time about doubles the
number of iterations per second from this case.
Some numbers closer to real world performance
can be found at Phoronix, thanks to Michael:
https://www.phoronix.com/news/AMD-INVLPGB-Linux-Benefits
My current plan is to implement support for Intel's RAR
(Remote Action Request) TLB flushing in a follow-up series,
after this thing has been merged into -tip. Making things
any larger would just be unwieldy for reviewers.
v6:
- fix info->end check in flush_tlb_kernel_range (Michael)
- disable broadcast TLB flushing on 32 bit x86
v5:
- use byte assembly for compatibility with older toolchains (Borislav, Michael)
- ensure a panic on an invalid number of extra pages (Dave, Tom)
- add cant_migrate() assertion to tlbsync (Jann)
- a bunch more cleanups (Nadav)
- key TCE enabling off X86_FEATURE_TCE (Andrew)
- fix a race between reclaim and ASID transition (Jann)
v4:
- Use only bitmaps to track free global ASIDs (Nadav)
- Improved AMD initialization (Borislav & Tom)
- Various naming and documentation improvements (Peter, Nadav, Tom, Dave)
- Fixes for subtle race conditions (Jann)
v3:
- Remove paravirt tlb_remove_table call (thank you Qi Zheng)
- More suggested cleanups and changelog fixes by Peter and Nadav
v2:
- Apply suggestions by Peter and Borislav (thank you!)
- Fix bug in arch_tlbbatch_flush, where we need to do both
the TLBSYNC, and flush the CPUs that are in the cpumask.
- Some updates to comments and changelogs based on questions.
Powered by blists - more mailing lists