[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <20251030-mte-tighten-tco-v2-0-e259dda9d5b3@os.amperecomputing.com>
Date: Thu, 15 Jan 2026 15:07:16 -0800
From: Carl Worth <carl@...amperecomputing.com>
To: Catalin Marinas <catalin.marinas@....com>,
Will Deacon <will@...nel.org>
Cc: linux-arm-kernel@...ts.infradead.org, linux-kernel@...r.kernel.org,
Taehyun Noh <taehyun@...xas.edu>, Carl Worth <carl@...amperecomputing.com>
Subject: [PATCH v2 0/2] arm64: mte: Improve performance by explicitly
disabling unwanted tag checking
[Thanks to Taehyun Noh from UT Austin for originally reporting this
bug. In this cover letter, "we" refers to a collaborative effort
between indiviuals at both Ampere Computing and UT Austin.]
We measured severe performance overhead (25-50%) when enabling
userspace MTE and running memcached on an AmpereOne machine, (detailed
benchmark results are provided below).
We identified excessive tag checking taking place in the kernel,
(though only userspace tag checking was requested), as the culprit for
the performance slowdown. The existing kernel implementation expects
that if tag check faults are not requested, then the implementation
will not perform tag checking. We found (empirically) that this is not
the case for at least some implementations, and verified that there's
no architectural requirement that tag checking be disabled when tag
check faults are not requested.
This patch series addresses the slowdown by using TCMA1 to explicitly
disable unwanted tag checking.
The effect of this patch series is most-readily seen by using perf to
count tag-checked accesses in both kernel and userspace, for example
while runnning "perf bench futex hash" with MTE enabled.
Prior to the patch series, we see:
# GLIBC_TUNABLES=glibc.mem.tagging=3 perf stat -e mem_access_checked_rd:u,mem_access_checked_wr:u,mem_access_checked_rd:k,mem_access_checked_wr:k perf bench futex hash
...
Performance counter stats for 'perf bench futex hash':
4,246,651,954 mem_access_checked_rd:u
29,375,167 mem_access_checked_wr:u
246,588,717,771 mem_access_checked_rd:k
78,805,316,911 mem_access_checked_wr:k
And after the patch series we see (for the same command):
Performance counter stats for 'perf bench futex hash':
4,337,091,554 mem_access_checked_rd:u
23,487 mem_access_checked_wr:u
4,342,774,550 mem_access_checked_rd:k
788 mem_access_checked_wr:k
As can be seen above, with roughly equivalent counts of userspace
tag-checked accesses, over 98% of the kernel-space tag-checked
accesses are eliminated.
As to performance, the patch series should have no behavioral impact
if the kernel is not compiled with MTE support. And the series has not
been observed to have any impact when the kernel includes MTE support
but the workloads have MTE disabled in userspace.
For workloads with MTE enabled, we measured the series giving a 2%
improvement for "perf bench futex hash" at 95% confidence.
Also, we used the Phoronix Test Suite pts/memcached benchmark with a
get-heavy workload (1:10 Set:Get ratio) which is where the slowdown
appears most clearly. The slowdown worsens with increased core count,
levelling out above 32 cores. The numbers below are based on averages
from 50 runs each, with 96 cores on each run. For "MTE on",
GLIBC_TUNABLES was set to "glibc.mem.tagging=3". For "MTE off",
GLIBC_TUNABLES was unset.
The numbers below are normalized ops./sec. (higher is better),
normalized to the baseline case (unpatched kernel, MTE off).
Before the patch series (upstream v6.19-rc5+):
MTE off: 1.000
MTE on: 0.742
MTE overhead: 25.8% +/- 1.6%
After applying this patch series:
MTE off: 0.991
MTE on: 0.990
MTE overhead: No difference proven at 95.0% confidence
-Carl
---
Changes in v2:
- Fixed to correctly pass 'current' vs. 'next' in set_kernel_mte_policy,
(thanks to Will Deacon)
- Changed approach to use TCMA1 rather than toggling PSTATE.TCO
(thanks to Catalin Marinas)
- Link to v1: https://lore.kernel.org/r/20251030-mte-tighten-tco-v1-0-88c92e7529d9@os.amperecomputing.com
---
Carl Worth (1):
arm64: mte: Set TCMA1 whenever MTE is present in the kernel
Taehyun Noh (1):
arm64: mte: Clarify kernel MTE policy and manipulation of TCO
arch/arm64/include/asm/mte.h | 40 +++++++++++++++++++++++++++++++++-------
arch/arm64/kernel/entry-common.c | 4 ++--
arch/arm64/kernel/mte.c | 2 +-
arch/arm64/mm/proc.S | 10 +++++-----
4 files changed, 41 insertions(+), 15 deletions(-)
---
base-commit: 944aacb68baf7624ab8d277d0ebf07f025ca137c
Powered by blists - more mailing lists