lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <20251030-mte-tighten-tco-v2-0-e259dda9d5b3@os.amperecomputing.com>
Date: Thu, 15 Jan 2026 15:07:16 -0800
From: Carl Worth <carl@...amperecomputing.com>
To: Catalin Marinas <catalin.marinas@....com>, 
 Will Deacon <will@...nel.org>
Cc: linux-arm-kernel@...ts.infradead.org, linux-kernel@...r.kernel.org, 
 Taehyun Noh <taehyun@...xas.edu>, Carl Worth <carl@...amperecomputing.com>
Subject: [PATCH v2 0/2] arm64: mte: Improve performance by explicitly
 disabling unwanted tag checking

[Thanks to Taehyun Noh from UT Austin for originally reporting this
bug. In this cover letter, "we" refers to a collaborative effort
between indiviuals at both Ampere Computing and UT Austin.]

We measured severe performance overhead (25-50%) when enabling
userspace MTE and running memcached on an AmpereOne machine, (detailed
benchmark results are provided below).

We identified excessive tag checking taking place in the kernel,
(though only userspace tag checking was requested), as the culprit for
the performance slowdown. The existing kernel implementation expects
that if tag check faults are not requested, then the implementation
will not perform tag checking. We found (empirically) that this is not
the case for at least some implementations, and verified that there's
no architectural requirement that tag checking be disabled when tag
check faults are not requested.

This patch series addresses the slowdown by using TCMA1 to explicitly
disable unwanted tag checking.

The effect of this patch series is most-readily seen by using perf to
count tag-checked accesses in both kernel and userspace, for example
while runnning "perf bench futex hash" with MTE enabled.

Prior to the patch series, we see:

 # GLIBC_TUNABLES=glibc.mem.tagging=3 perf stat -e mem_access_checked_rd:u,mem_access_checked_wr:u,mem_access_checked_rd:k,mem_access_checked_wr:k perf bench futex hash
...
 Performance counter stats for 'perf bench futex hash':
     4,246,651,954      mem_access_checked_rd:u
        29,375,167      mem_access_checked_wr:u
   246,588,717,771      mem_access_checked_rd:k
    78,805,316,911      mem_access_checked_wr:k

And after the patch series we see (for the same command):

 Performance counter stats for 'perf bench futex hash':
     4,337,091,554      mem_access_checked_rd:u
            23,487      mem_access_checked_wr:u
     4,342,774,550      mem_access_checked_rd:k
               788      mem_access_checked_wr:k

As can be seen above, with roughly equivalent counts of userspace
tag-checked accesses, over 98% of the kernel-space tag-checked
accesses are eliminated.

As to performance, the patch series should have no behavioral impact
if the kernel is not compiled with MTE support. And the series has not
been observed to have any impact when the kernel includes MTE support
but the workloads have MTE disabled in userspace.

For workloads with MTE enabled, we measured the series giving a 2%
improvement for "perf bench futex hash" at 95% confidence.

Also, we used the Phoronix Test Suite pts/memcached benchmark with a
get-heavy workload (1:10 Set:Get ratio) which is where the slowdown
appears most clearly. The slowdown worsens with increased core count,
levelling out above 32 cores. The numbers below are based on averages
from 50 runs each, with 96 cores on each run. For "MTE on",
GLIBC_TUNABLES was set to "glibc.mem.tagging=3". For "MTE off",
GLIBC_TUNABLES was unset.

The numbers below are normalized ops./sec. (higher is better),
normalized to the baseline case (unpatched kernel, MTE off).

Before the patch series (upstream v6.19-rc5+):

	MTE off: 1.000
	MTE  on: 0.742

	MTE overhead: 25.8% +/- 1.6%

After applying this patch series:

	MTE off: 0.991
	MTE  on: 0.990

	MTE overhead: No difference proven at 95.0% confidence

-Carl

---
Changes in v2:
- Fixed to correctly pass 'current' vs. 'next' in set_kernel_mte_policy,
  (thanks to Will Deacon)
- Changed approach to use TCMA1 rather than toggling PSTATE.TCO
  (thanks to Catalin Marinas)
- Link to v1: https://lore.kernel.org/r/20251030-mte-tighten-tco-v1-0-88c92e7529d9@os.amperecomputing.com
---
Carl Worth (1):
      arm64: mte: Set TCMA1 whenever MTE is present in the kernel

Taehyun Noh (1):
      arm64: mte: Clarify kernel MTE policy and manipulation of TCO

 arch/arm64/include/asm/mte.h     | 40 +++++++++++++++++++++++++++++++++-------
 arch/arm64/kernel/entry-common.c |  4 ++--
 arch/arm64/kernel/mte.c          |  2 +-
 arch/arm64/mm/proc.S             | 10 +++++-----
 4 files changed, 41 insertions(+), 15 deletions(-)
---
base-commit: 944aacb68baf7624ab8d277d0ebf07f025ca137c


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ