linux-kernel - [PATCH 0/2] arm64: mte: Improve performance by tightening handling of PSTATE.TCO

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-Id: <20251030-mte-tighten-tco-v1-0-88c92e7529d9@os.amperecomputing.com>
Date: Thu, 30 Oct 2025 20:49:30 -0700
From: Carl Worth <carl@...amperecomputing.com>
To: Catalin Marinas <catalin.marinas@....com>, 
 Will Deacon <will@...nel.org>
Cc: linux-arm-kernel@...ts.infradead.org, linux-kernel@...r.kernel.org, 
 Taehyun Noh <taehyun@...xas.edu>, Carl Worth <carl@...amperecomputing.com>
Subject: [PATCH 0/2] arm64: mte: Improve performance by tightening handling
 of PSTATE.TCO

[Thanks to Taehyun Noh from UT Austin for originally reporting this
bug. In this cover letter, "we" refers to a collaborative effort
between indiviuals at both Ampere Computing and UT Austin.]

We measured severe performance overhead (30-50%) when enabling
userspace MTE and running memcached on an AmpereOne machine, (detailed
benchmark results are provided below).

We identified excessive tag checking taking place in the kernel,
(though only userspace tag checking was requested), as the culprit for
the performance slowdown. The existing code enables tag checking (by
_disabling_ PSTATE.TCO: ("tag check override")) at kernel entry
regardless of whether it's kernel-side MTE (via KASAN_HW_TAGS) or
userspace MTE that is being requested.

This patch series addresses the slowdown (in the case that only
userspace MTE is requested) by deferring the enabling of tag checking
until the kernel is about to access userspace memory, that is enabling
tag checking in user_access_begin and then disabling it again in
user_access_end.

The effect of this patch series is most-readily seen by using perf to
count tag-checked accesses in both kernel and userspace, for example
while runnning "perf bench futex hash" with MTE enabled.

Prior to the patch series, we see:

 # GLIBC_TUNABLES=glibc.mem.tagging=3 perf stat -e mem_access_checked_rd:u,mem_access_checked_wr:u,mem_access_checked_rd:k,mem_access_checked_wr:k perf bench futex hash
...
 Performance counter stats for 'perf bench futex hash':
     4,046,872,020      mem_access_checked_rd:u
            23,580      mem_access_checked_wr:u
   251,248,813,102      mem_access_checked_rd:k
    87,256,021,241      mem_access_checked_wr:k

And after the patch series we see (for the same command):

 Performance counter stats for 'perf bench futex hash':
     3,866,346,822      mem_access_checked_rd:u
            23,499      mem_access_checked_wr:u
     7,725,072,314      mem_access_checked_rd:k
               424      mem_access_checked_wr:k

As can be seen above, with roughly equivalent counts of userspace
tag-checked accesses, over 97% of the kernel-space tag-checked
accesses are eliminated.

As to performance, the patch series has been observed as having no
impact for workloads with MTE disabled.

For workloads with MTE enabled, we measured the series causing a 5-8%
slowdown for "perf bench futex hash". Presumably, this results from
code paths that now include 2 writes to PSTATE.TCO where previously
there was only 1. Given that this is a synthetic micro-benchmark, we
argue that this performance slowdown is acceptable given the results
with more realistic workloads as described below.

We used the Phoronix Test Suite pts/memcached benchmark with a
get-heavy workload (1:10 Set:Get ratio) which is where the slowdown
appears most clearly. The slowdown worsens with increased core count,
levelling out above 32 cores. The numbers below are based on averages
from 50 runs each, with 96 cores on each run. For "MTE on",
GLIBC_TUNABLES was set to "glibc.mem.tagging=3". For "MTE off",
GLIBC_TUNABLES was unset.

The numbers below are normalized ops./sec. (higher is better),
normalized to the baseline case (unpatched kernel, MTE off).

Before the patch series (unpatched v6.18-rc1):

	MTE off: 1.000
	MTE  on: 0.455

	MTE overhead: 54.5% +/ 2.3%

After applying this patch series:

	MTE off: 0.997
	MTE  on: 1.002

	MTE overhead: No difference proven at 95.0% confidence

Changes since v1:

  * Reorded patches to put cleanup patch before performance fix.

Signed-off-by: Carl Worth <carl@...amperecomputing.com>
---
Carl Worth (1):
      arm64: mte: Defer disabling of TCO until user_access_begin/end

Taehyun Noh (1):
      arm64: mte: Unify kernel MTE policy and manipulation of TCO

 arch/arm64/include/asm/mte.h     | 53 +++++++++++++++++++++++++++++++---------
 arch/arm64/include/asm/uaccess.h | 32 +++++++++++++++++++++++-
 arch/arm64/kernel/entry-common.c |  4 +--
 arch/arm64/kernel/mte.c          |  2 +-
 4 files changed, 76 insertions(+), 15 deletions(-)
---
base-commit: 3a8660878839faadb4f1a6dd72c3179c1df56787