[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <fe231f2d-fcb1-05c9-49c3-405c533a0200@suse.de>
Date: Mon, 28 Oct 2024 14:45:01 +0100 (CET)
From: Michael Matz <matz@...e.de>
To: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
cc: Vlastimil Babka <vbabka@...e.cz>,
Andrew Morton <akpm@...ux-foundation.org>,
"Liam R. Howlett" <Liam.Howlett@...cle.com>, Jann Horn <jannh@...gle.com>,
Thorsten Leemhuis <regressions@...mhuis.info>, linux-mm@...ck.org,
linux-kernel@...r.kernel.org, Petr Tesarik <ptesarik@...e.com>,
Gabriel Krisman Bertazi <gabriel@...sman.be>,
Matthias Bodenbinder <matthias@...enbinder.de>, stable@...r.kernel.org,
Rik van Riel <riel@...riel.com>, Yang Shi <yang@...amperecomputing.com>
Subject: Re: [PATCH hotfix 6.12] mm, mmap: limit THP aligment of anonymous
mappings to PMD-aligned sizes
Hello,
On Thu, 24 Oct 2024, Lorenzo Stoakes wrote:
> > benchmark seems to create many mappings of 4632kB, which would have
> > merged to a large THP-backed area before commit efa7df3e3bb5 and now
> > they are fragmented to multiple areas each aligned to PMD boundary with
> > gaps between. The regression then seems to be caused mainly due to the
> > benchmark's memory access pattern suffering from TLB or cache aliasing
> > due to the aligned boundaries of the individual areas.
>
> Any more details on precisely why?
Anything we found out and theorized about is in the suse bugreport. I
think the best theory is TLB aliasing when the mixing^Whash function in
the given hardware uses too few bits, and most of them in the low 21-12
bits of an address. Of course that then still depends on the particular
access pattern. cactuBSSN has about 20 memory streams in the hot loops,
and the accesses are fairly regular from step to step (plus/minus certain
strides in 3D arrays). When their start addresses all differ only in the
upper bits, you will hit TLB aliasing from time to time, and when the
dimensions/strides are just right it occurs often, the N-way associativity
doesn't save you anymore and you will hit it very very hard.
It was interesting to see how broad the range of CPUs and vendors was that
exhibited the problem (in various degrees of severity, from 50% to 600%
slowdown), and how more recent CPUs don't show the symptom anymore. I
guess the micro-arch guys eventually convinced P&R management that hashing
another bit or two is worthwhile the silicon :-)
Ciao,
Michael.
Powered by blists - more mailing lists