linux-kernel - Re: [RFC PATCH v1 00/57] Boot-time page size selection for arm64

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20241113152259.57983855@mordecai.tesarici.cz>
Date: Wed, 13 Nov 2024 15:22:59 +0100
From: Petr Tesarik <ptesarik@...e.com>
To: Ryan Roberts <ryan.roberts@....com>
Cc: Andrew Morton <akpm@...ux-foundation.org>, Anshuman Khandual
 <anshuman.khandual@....com>, Ard Biesheuvel <ardb@...nel.org>, Catalin
 Marinas <catalin.marinas@....com>, David Hildenbrand <david@...hat.com>,
 Greg Marsden <greg.marsden@...cle.com>, Ivan Ivanov <ivan.ivanov@...e.com>,
 Kalesh Singh <kaleshsingh@...gle.com>, Marc Zyngier <maz@...nel.org>, Mark
 Rutland <mark.rutland@....com>, Matthias Brugger <mbrugger@...e.com>,
 Miroslav Benes <mbenes@...e.cz>, Will Deacon <will@...nel.org>,
 linux-arm-kernel@...ts.infradead.org, linux-kernel@...r.kernel.org,
 linux-mm@...ck.org
Subject: Re: [RFC PATCH v1 00/57] Boot-time page size selection for arm64

On Wed, 13 Nov 2024 12:56:24 +0000
Ryan Roberts <ryan.roberts@....com> wrote:

> On 13/11/2024 12:40, Petr Tesarik wrote:
> > On Tue, 12 Nov 2024 11:50:39 +0100
> > Petr Tesarik <ptesarik@...e.com> wrote:
> >   
> >> On Tue, 12 Nov 2024 10:19:34 +0000
> >> Ryan Roberts <ryan.roberts@....com> wrote:
> >>  
> >>> On 12/11/2024 09:45, Petr Tesarik wrote:    
> >>>> On Mon, 11 Nov 2024 12:25:35 +0000
> >>>> Ryan Roberts <ryan.roberts@....com> wrote:
> >>>>       
> >>>>> Hi Petr,
> >>>>>
> >>>>> On 11/11/2024 12:14, Petr Tesarik wrote:      
> >>>>>> Hi Ryan,
> >>>>>>
> >>>>>> On Thu, 17 Oct 2024 13:32:43 +0100
> >>>>>> Ryan Roberts <ryan.roberts@....com> wrote:      
> >>>>> [...]      
> >>>>>> Third, a few micro-benchmarks saw a significant regression.
> >>>>>>
> >>>>>> Most notably, getenv and getenvT2 tests from libMicro were 18% and 20%
> >>>>>> slower with variable page size. I don't know why, but I'm looking into
> >>>>>> it. The system() library call was also about 18% slower, but that might
> >>>>>> be related.        
> >>>>>
> >>>>> OK, ouch. I think there are some things we can try to optimize the
> >>>>> implementation further. But I'll wait for your analysis before digging myself.      
> >>>>
> >>>> This turned out to be a false positive. The way this microbenchmark was
> >>>> invoked did not get enough samples, so it was mostly dependent on
> >>>> whether caches were hot or cold, and the timing on this specific system
> >>>> with the specific sequence of bencnmarks in the suite happens to favour
> >>>> my baseline kernel.
> >>>>
> >>>> After increasing the batch count, I'm getting pretty much the same
> >>>> performance for 6.11 vanilla and patched kernels:
> >>>>
> >>>>                         prc thr   usecs/call      samples   errors cnt/samp 
> >>>> getenv (baseline)         1   1      0.14975           99        0   100000 
> >>>> getenv (patched)          1   1      0.14981           92        0   100000       
> >>>
> >>> Oh that's good news! Does this account for all 3 of the above tests (getenv,
> >>> getenvT2 and system())?    
> >>
> >> It does for getenvT2 (a variant of the test with 2 threads), but not
> >> for system. Thanks for asking, I forgot about that one.
> >>
> >> I'm getting substantial difference there (+29% on average over 100 runs):
> >>
> >>                         prc thr   usecs/call      samples   errors cnt/samp  command
> >> system (baseline)         1   1   6937.18016          102        0      100     A=$$
> >> system (patched)          1   1   8959.48032          102        0      100     A=$$
> >>
> >> So, yeah, this should in fact be my priority #1.  
> > 
> > Further testing reveals the workload is bimodal, that is to say the
> > distribution of results has two peaks. The first peak around 3.2 ms
> > covers 30% runs, the second peak around 15.7 ms covers 11%. Two per
> > cent are faster than the fast peak, 5% are slower than slow peak, the
> > rest is distributed almost evenly between them.  
> 
> FWIW, One source of bimodality I've seen on Ampere systems with 2 NUMA nodes is
> placement of the kernel image vs placement of the running thread. If they are
> remote from eachother, you'll see a slowdown. I've hacked this source away in
> the past by effectively using only a single NUMA node (with the help of
> 'maxcpus' and 'mem' kernel cmdline options).

This system has only one NUMA node. But your comment leads in the right
direction. CPU placement does play a role here.

I can consistently get the fast results if I pin the benchmark process
to a single CPU core, or more generally to a CPU set which shares the
L2 cache (as found on eMAG). But the scheduler only considers LLC,
which (with CONFIG_SCHED_CLUSTER=y) follows the complex affinity of the
SLC.

Long story short, without explicit affinity, the scheduler may place a
forked child onto a CPU with a cold L2 cache, which harms short-lived
processes (like the ones created by this benchmark).

Now it all makes sense and it is totally unrelated to dynamic page size
selection. :-)

Petr T