linux-kernel - Re: [PATCH] x86: write aligned to 8 bytes in copy_user

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <992edb0fed403333d237350b0a6730f2@dogleg.nslick.com>
Date: Fri, 02 May 2025 01:19:17 -0500
From: Nicholas Sielicki <opensource@...ick.com>
To: herton@...hat.com
Cc: aokuliar@...hat.com,atomasov@...hat.com,bp@...en8.de,dave.hansen@...ux.intel.com,hpa@...or.com,linux-kernel@...r.kernel.org,mingo@...hat.com,mjguzik@...il.com,olichtne@...hat.com,tglx@...utronix.de,torvalds@...ux-foundation.org,x86@...nel.org
Subject: Re: [PATCH] x86: write aligned to 8 bytes in copy_user_generic (when without FSRM/ERMS)

On Thu, Mar 20, 2025 at 3:22 PM Herton R. Krzesinski <herton@...hat.com> wrote:
> History of the performance regression:
> ======================================
> 
> Since the following series of user copy updates were merged upstream
> ~2 years ago via:
> 
>   a5624566431d ("Merge branch 'x86-rep-insns': x86 user copy clarifications")
> 
> .. copy_user_generic() on x86_64 stopped doing alignment of the
> writes to the destination to a 8 byte boundary for the non FSRM case.
> 
> Previously, this was done through the ALIGN_DESTINATION macro that
> was used in the now removed copy_user_generic_unrolled function.
> 
> Turns out this change causes some loss of performance/throughput on
> some use cases and specific CPU/platforms without FSRM and ERMS.
> 
> Lately I got two reports of performance/throughput issues after a
> RHEL 9 kernel pulled the same upstream series with updates to user
> copy functions. Both reports consisted of running specific
> networking/TCP related testing using iperf3.
> 
> Partial upstream fix
> ====================
> 
> The first report was related to a Linux Bridge testing using VMs on a
> specific machine with an AMD CPU (EPYC 7402), and after a brief
> investigation it turned out that the later change via:
> 
>   ca96b162bfd2 ("x86: bring back rep movsq for user access on CPUs without ERMS")
> 
> ... helped/fixed the performance issue.
> 
> However, after the later commit/fix was applied, then I got another
> regression reported in a multistream TCP test on a 100Gbit mlx5 nic, also
> running on an AMD based platform (AMD EPYC 7302 CPU), again that was using
> iperf3 to run the test. That regression was after applying the later
> fix/commit, but only this didn't help in telling the whole history.

I went down the rabbit-hole of this issue in late 2022 at $work, which
we found from running the same iperf3 single-flow workload being
described above on a Milan system. It took me some time (much longer
than I'd like to admit), but eventually I started asking questions about
FSRM and ERMS, and stumbled across the lkml threads surrounding the
+FSRM/-ERMS alternative.

Before arriving at that root cause, I had noticed that tx-nocache-copy /
NETIF_F_NOCACHE_COPY was able to considerably improve perf compared to
baseline. Is that interesting to anyone?

I did a bit of research on the history of tx-nocache-copy, and how it
might relate to znver3, and walked away believing the following story to
be mostly true:

1. tx-nocache-copy was introduced in 2011 and was initially enabled by
default on all non-loopback interfaces. It was tested on an AMD
bulldozer-like system, it showed a significant improvement in tail
latency and a 5%-10% improvement in transactions per second.

2. A year later, for products released in year 2012, intel introduced
something called DDIO. My entire understanding of DDIO comes from a
single PDF [1], so take this with a massive grain of salt, but from what
I understand it was intended to largely solve the same problem that
tx-nocache-copy was optimizing for, and I think it did so in a way that
broke many of the underlying assumptions that tx-nocache-copy was
relying on.

In other words, it didn't just make tx-nocache-copy unnecessary, but
made its usage actively harmful. For two reasons:

+ DDIO operates atop dedicated cache ways. So if that reserved cache
  space is not used, it doesn't result in any less cache contention for
  other uses, it just means wasting that cache space. Remote reads from
  the NIC of data held in cache stopped resulting in a write-back to main
  memory from the cache, which you might otherwise expect to occur under
  most coherency protocols; their state machine grew a carve-out for this
  specific flow.

  So if your motivation for issuing a non-temporal write is any of:

   a) to avoid your write from evicting other useful data from the cache
   b) to avoid coherency traffic triggered by the remote read.
   c) you anticipate a significant amount of time and/or cache churn to
      occur between now and when the remote read takes place, and you feel
      it's reasonable to suspect that the data will have been evicted from
      cache into main memory by then, anyway.

  all of them make less sense on a system with ddio.

+ Because reads by the NIC are expected to usually be directly serviced
  from cache, Intel also stopped issuing speculative reads against main
  memory as early as they otherwise could, under the assumption that it
  would be especially rare for it to be used.

  This means that on systems with DDIO, if you elect to use the non-temporal
  hints, those remote reads become extra slow, because they are
  serialized with the read attempt against the cache.

Putting these two together, tx-nocache-copy stopped making sense, and
the best thing the kernel could do, was to play dumb.

3. By 2017, after many complaints, and in a world where almost everyone
in the high-performance networking/server space was using an intel
platform with DDIO, tx-nocache-copy was moved to be disabled by default.

I didn't see DDIO mentioned in any of the on-list discussions of
tx-nocache-copy that I could find, and it lead me to wonder if this is
real story behind tx-nocache-copy: a reasonable feature, introduced with
unlucky timing, which ultimately fell victim to a hardware monoculture
without anyone fully realizing it.

If that story is true, then it might suggest that tx-nocache-copy still
has merit under the current zen systems (which have nothing similar to
DDIO, as far as I'm aware). At least in late 2022, under the original
unpatched kernels that were available at the time, I can report that it
did. I no longer work at $work, so I have no ability to retest myself,
but I'd be curious to hear the results from anyone that finds this
interesting enough to look into it. 

[1]: https://www.intel.com/content/dam/www/public/us/en/documents/technology-briefs/data-direct-i-o-technology-brief.pdf
The relevant section for tx-nocache-copy is "2.2".