linux-kernel - Re: [RFC PATCH v1 0/4] Introduce QPW for per-cpu operations

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Zn8FSqqoLIIB/LnG@tpad>
Date: Fri, 28 Jun 2024 15:47:38 -0300
From: Marcelo Tosatti <mtosatti@...hat.com>
To: Leonardo Bras <leobras@...hat.com>
Cc: Boqun Feng <boqun.feng@...il.com>, Vlastimil Babka <vbabka@...e.cz>,
	Johannes Weiner <hannes@...xchg.org>,
	Michal Hocko <mhocko@...nel.org>,
	Roman Gushchin <roman.gushchin@...ux.dev>,
	Shakeel Butt <shakeel.butt@...ux.dev>,
	Muchun Song <muchun.song@...ux.dev>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Christoph Lameter <cl@...ux.com>, Pekka Enberg <penberg@...nel.org>,
	David Rientjes <rientjes@...gle.com>,
	Joonsoo Kim <iamjoonsoo.kim@....com>,
	Hyeonggon Yoo <42.hyeyoo@...il.com>,
	Thomas Gleixner <tglx@...utronix.de>,
	Peter Zijlstra <peterz@...radead.org>,
	Ingo Molnar <mingo@...hat.com>, Will Deacon <will@...nel.org>,
	Waiman Long <longman@...hat.com>, linux-kernel@...r.kernel.org,
	cgroups@...r.kernel.org, linux-mm@...ck.org
Subject: Re: [RFC PATCH v1 0/4] Introduce QPW for per-cpu operations

On Mon, Jun 24, 2024 at 11:57:57PM -0300, Leonardo Bras wrote:
> On Mon, Jun 24, 2024 at 03:54:14PM -0700, Boqun Feng wrote:
> > On Mon, Jun 24, 2024 at 09:31:51AM +0200, Vlastimil Babka wrote:
> > > Hi,
> > > 
> > > you've included tglx, which is great, but there's also LOCKING PRIMITIVES
> > > section in MAINTAINERS so I've added folks from there in my reply.
> > 
> > Thanks!
> > 
> > > Link to full series:
> > > https://lore.kernel.org/all/20240622035815.569665-1-leobras@redhat.com/
> > > 
> > 
> > And apologies to Leonardo... I think this is a follow-up of:
> > 
> > 	https://lpc.events/event/17/contributions/1484/
> > 
> > and I did remember we had a quick chat after that which I suggested it's
> > better to change to a different name, sorry that I never found time to
> > write a proper rely to your previous seriese [1] as promised.
> > 
> > [1]: https://lore.kernel.org/lkml/20230729083737.38699-2-leobras@redhat.com/
> 
> That's correct, I commented about this in the end of above presentation.
> Don't worry, and thanks for suggesting the per-cpu naming, it was very 
> helpful on designing this solution.
> 
> > 
> > > On 6/22/24 5:58 AM, Leonardo Bras wrote:
> > > > The problem:
> > > > Some places in the kernel implement a parallel programming strategy
> > > > consisting on local_locks() for most of the work, and some rare remote
> > > > operations are scheduled on target cpu. This keeps cache bouncing low since
> > > > cacheline tends to be mostly local, and avoids the cost of locks in non-RT
> > > > kernels, even though the very few remote operations will be expensive due
> > > > to scheduling overhead.
> > > > 
> > > > On the other hand, for RT workloads this can represent a problem: getting
> > > > an important workload scheduled out to deal with remote requests is
> > > > sure to introduce unexpected deadline misses.
> > > > 
> > > > The idea:
> > > > Currently with PREEMPT_RT=y, local_locks() become per-cpu spinlocks.
> > > > In this case, instead of scheduling work on a remote cpu, it should
> > > > be safe to grab that remote cpu's per-cpu spinlock and run the required
> > > > work locally. Tha major cost, which is un/locking in every local function,
> > > > already happens in PREEMPT_RT.
> > > 
> > > I've also noticed this a while ago (likely in the context of rewriting SLUB
> > > to use local_lock) and asked about it on IRC, and IIRC tglx wasn't fond of
> > > the idea. But I forgot the details about why, so I'll let the the locking
> > > experts reply...
> > > 
> > 
> > I think it's a good idea, especially the new name is less confusing ;-)
> > So I wonder Thomas' thoughts as well.
> 
> Thanks!
> 
> > 
> > And I think a few (micro-)benchmark numbers will help.
> 
> Last year I got some numbers on how replacing local_locks with 
> spinlocks would impact memcontrol.c cache operations:
> 
> https://lore.kernel.org/all/20230125073502.743446-1-leobras@redhat.com/
> 
> tl;dr: It increased clocks spent in the most common this_cpu operations, 
> while reducing clocks spent in remote operations (drain_all_stock).
> 
> In RT case, since local locks are already spinlocks, this cost is 
> already paid, so we can get results like these:
> 
> drain_all_stock
> cpus	Upstream 	Patched		Diff (cycles)	Diff(%)
> 1	44331.10831	38978.03581	-5353.072507	-12.07520567
> 8	43992.96512	39026.76654	-4966.198572	-11.2886198
> 128	156274.6634	58053.87421	-98220.78915	-62.85138425
> 
> Upstream: Clocks to schedule work on remote CPU (performing not accounted)
> Patched:  Clocks to grab remote cpu's spinlock and perform the needed work 
> 	  locally.
> 
> Do you have other suggestions to use as (micro-) benchmarking?
> 
> Thanks!
> Leo

One improvement which was noted when mm/page_alloc.c was converted to 
spinlock + remote drain was that, it can bypass waiting for kwork 
to be scheduled (on heavily loaded CPUs).

commit 443c2accd1b6679a1320167f8f56eed6536b806e
Author: Nicolas Saenz Julienne <nsaenzju@...hat.com>
Date:   Fri Jun 24 13:54:22 2022 +0100

    mm/page_alloc: remotely drain per-cpu lists
    
    Some setups, notably NOHZ_FULL CPUs, are too busy to handle the per-cpu
    drain work queued by __drain_all_pages().  So introduce a new mechanism to
    remotely drain the per-cpu lists.  It is made possible by remotely locking
    'struct per_cpu_pages' new per-cpu spinlocks.  A benefit of this new
    scheme is that drain operations are now migration safe.
    
    There was no observed performance degradation vs.  the previous scheme.
    Both netperf and hackbench were run in parallel to triggering the
    __drain_all_pages(NULL, true) code path around ~100 times per second.  The
    new scheme performs a bit better (~5%), although the important point here
    is there are no performance regressions vs.  the previous mechanism.
    Per-cpu lists draining happens only in slow paths.
    
    Minchan Kim tested an earlier version and reported;
    
            My workload is not NOHZ CPUs but run apps under heavy memory
            pressure so they goes to direct reclaim and be stuck on
            drain_all_pages until work on workqueue run.
    
            unit: nanosecond
            max(dur)        avg(dur)                count(dur)
            166713013       487511.77786438033      1283
    
            From traces, system encountered the drain_all_pages 1283 times and
            worst case was 166ms and avg was 487us.
    
            The other problem was alloc_contig_range in CMA. The PCP draining
            takes several hundred millisecond sometimes though there is no
            memory pressure or a few of pages to be migrated out but CPU were
            fully booked.
    
            Your patch perfectly removed those wasted time.