linux-kernel - Re: [PATCH v2 0/5] Introduce memcg_stock

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <Y+AIOQy0HdVXCw8m@P9FQF9L96D>
Date:   Sun, 5 Feb 2023 11:49:13 -0800
From:   Roman Gushchin <roman.gushchin@...ux.dev>
To:     Leonardo Brás <leobras@...hat.com>
Cc:     Michal Hocko <mhocko@...e.com>,
        Marcelo Tosatti <mtosatti@...hat.com>,
        Johannes Weiner <hannes@...xchg.org>,
        Shakeel Butt <shakeelb@...gle.com>,
        Muchun Song <muchun.song@...ux.dev>,
        Andrew Morton <akpm@...ux-foundation.org>,
        cgroups@...r.kernel.org, linux-mm@...ck.org,
        linux-kernel@...r.kernel.org
Subject: Re: [PATCH v2 0/5] Introduce memcg_stock_pcp remote draining

Hi Leonardo!

> Yes, but we are exchanging an "always schedule_work_on()", which is a kind of
> contention, for a "sometimes we hit spinlock contention".
> 
> For the spinlock proposal, on the local cpu side, the *worst case* contention
> is:
> 1 - wait the spin_unlock() for a complete <percpu cache drain process>,
> 2 - wait a cache hit for local per-cpu cacheline 
> 
> What is current implemented (schedule_work_on() approach), for the local
> cpu side there is *always* this contention:
> 1 - wait for a context switch,
> 2 - wait a cache hit from it's local per-cpu cacheline,
> 3 - wait a complete <percpu cache drain process>, 
> 4 - then for a new context switch to the current thread.

I think both Michal and me are thinking of a more generic case in which the cpu
is not exclusively consumed by 1 special process, so that the draining work can
be executed during an idle time. In this case the work is basically free.

And the introduction of a spin_lock() on the hot path is what we're are concerned
about. I agree, that on some hardware platforms it won't be that expensive, but
in general not having any spinlocks is so much better.

> 
> So moving from schedule_work_on() to spinlocks will save 2 context switches per
> cpu every time drain_all_stock() is called.
> 
> On the remote cpu side, my tests point that doing the remote draining is faster
> than scheduling a local draining, so it's also a gain.
> 
> Also, IIUC the possible contention in the spinlock approach happens only on
> page-faulting and syscalls, versus the schedule_work_on() approach that can
> interrupt user workload at any time. 
> 
> In fact, not interrupting the user workload in isolated cpus is just a bonus of
> using spinlocks.

I believe it significantly depends on the preemption model: you're right regarding
fully preemptive kernels, but with voluntary/none preemption it's exactly opposite:
the draining work will be executed at some point later (probably with 0 cost),
while the remote access from another cpu will potentially cause delays on the
spin lock as well as a need to refill the stock.

Overall I'd expect a noticeable performance regression from an introduction of
spin locks and remote draining. Maybe not on all platforms, but at least on some.
That's my main concern. And I don't think the problem we're aiming to solve here
justifies this potential regression.

Thanks!