linux-kernel - Re: [RFC 15/16] sched/fair: Account kthread runtime debt for CFS bandwidth

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <Yd83iDzoUOWPB6yH@slm.duckdns.org>
Date:   Wed, 12 Jan 2022 10:18:16 -1000
From:   Tejun Heo <tj@...nel.org>
To:     Daniel Jordan <daniel.m.jordan@...cle.com>
Cc:     Peter Zijlstra <peterz@...radead.org>,
        Alexander Duyck <alexanderduyck@...com>,
        Alex Williamson <alex.williamson@...hat.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Ben Segall <bsegall@...gle.com>,
        Cornelia Huck <cohuck@...hat.com>,
        Dan Williams <dan.j.williams@...el.com>,
        Dave Hansen <dave.hansen@...ux.intel.com>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        Herbert Xu <herbert@...dor.apana.org.au>,
        Ingo Molnar <mingo@...hat.com>,
        Jason Gunthorpe <jgg@...dia.com>,
        Johannes Weiner <hannes@...xchg.org>,
        Josh Triplett <josh@...htriplett.org>,
        Michal Hocko <mhocko@...e.com>, Nico Pache <npache@...hat.com>,
        Pasha Tatashin <pasha.tatashin@...een.com>,
        Steffen Klassert <steffen.klassert@...unet.com>,
        Steve Sistare <steven.sistare@...cle.com>,
        Tim Chen <tim.c.chen@...ux.intel.com>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        linux-mm@...ck.org, kvm@...r.kernel.org,
        linux-kernel@...r.kernel.org, linux-crypto@...r.kernel.org
Subject: Re: [RFC 15/16] sched/fair: Account kthread runtime debt for CFS
 bandwidth

Hello,

On Tue, Jan 11, 2022 at 11:29:50AM -0500, Daniel Jordan wrote:
...
> This problem arises with multithreaded jobs, but is also an issue in other
> places.  CPU activity from async memory reclaim (kswapd, cswapd?[5]) should be
> accounted to the cgroup that the memory belongs to, and similarly CPU activity
> from net rx should be accounted to the task groups that correspond to the
> packets being received.  There are also vague complaints from Android[6].

These are pretty big holes in CPU cycle accounting right now and I think
spend-first-and-backcharge is the right solution for most of them given
experiences from other controllers. That said,

> Each use case has its own requirements[7].  In padata and reclaim, the task
> group to account to is known ahead of time, but net rx has to spend cycles
> processing a packet before its destination task group is known, so any solution
> should be able to work without knowing the task group in advance.  Furthermore,
> the CPU controller shouldn't throttle reclaim or net rx in real time since both
> are doing high priority work.  These make approaches that run kthreads directly
> in a task group, like cgroup-aware workqueues[8] or a kernel path for
> CLONE_INTO_CGROUP, infeasible.  Running kthreads directly in cgroups also has a
> downside for padata because helpers' MAX_NICE priority is "shadowed" by the
> priority of the group entities they're running under.
> 
> The proposed solution of remote charging can accrue debt to a task group to be
> paid off or forgiven later, addressing all these issues.  A kthread calls the
> interface
> 
>     void cpu_cgroup_remote_begin(struct task_struct *p,
>                                  struct cgroup_subsys_state *css);
> 
> to begin remote charging to @css, causing @p's current sum_exec_runtime to be
> updated and saved.  The @css arg isn't required and can be removed later to
> facilitate the unknown cgroup case mentioned above.  Then the kthread calls
> another interface
> 
>     void cpu_cgroup_remote_charge(struct task_struct *p,
>                                   struct cgroup_subsys_state *css);
> 
> to account the sum_exec_runtime that @p has used since the first call.
> Internally, a new field cfs_bandwidth::debt is added to keep track of unpaid
> debt that's only used when the debt exceeds the quota in the current period.
> 
> Weight-based control isn't implemented for now since padata helpers run at
> MAX_NICE and so always yield to anything higher priority, meaning they would
> rarely compete with other task groups.

If we're gonna do this, let's please do it right and make weight based
control work too. Otherwise, its usefulness is pretty limited.

Thanks.

-- 
tejun