linux-kernel - Re: [PATCH 2/2] sched: Implement interface for cgroup unified hierarchy

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20170802154135.GI2311718@devbig577.frc2.facebook.com>
Date:   Wed, 2 Aug 2017 08:41:35 -0700
From:   Tejun Heo <tj@...nel.org>
To:     Peter Zijlstra <peterz@...radead.org>
Cc:     lizefan@...wei.com, hannes@...xchg.org, mingo@...hat.com,
        longman@...hat.com, cgroups@...r.kernel.org,
        linux-kernel@...r.kernel.org, kernel-team@...com, pjt@...gle.com,
        luto@...capital.net, efault@....de, torvalds@...ux-foundation.org,
        guro@...com
Subject: Re: [PATCH 2/2] sched: Implement interface for cgroup unified
 hierarchy

Hello, Peter.

On Tue, Aug 01, 2017 at 11:40:38PM +0200, Peter Zijlstra wrote:
> > * On cgroup2, there is only one hierarchy.  It'd be great to have
> >   basic resource accounting enabled by default on all cgroups.  Note
> >   that we couldn't do that on v1 because there could be any number of
> >   hierarchies and the cost would increase with the number of
> >   hierarchies.
> 
> Yes, the whole single hierarchy thing makes doing away with the double
> accounting possible.

Yeah, we can either do that or make it cheaper so that we can have
basic stats by default.

> > * It is bothersome that we're walking up the tree each time for
> >   cpuacct although being percpu && just walking up the tree makes it
> >   relatively cheap.
> 
> So even if its only CPU local accounting, you still have all the pointer
> chasing and misses, not to mention that a faster O(depth) is still
> O(depth).
> 
> >		Anyways, I'm thinking about shifting the
> >   aggregation to the reader side so that the hot path always only
> >   updates local counters in a way which can scale even when there are
> >   a lot of (idle) cgroups.  Will follow up on this later.
> 
> Not entirely sure I follow, we currently only update the current cgroup
> and its immediate parents, no? Or are you looking to only account into
> the current cgroup and propagate into the parents on reading?

Yeah, shifting the cost to the readers and being smart with
propagation so that reading isn't O(nr_descendants) but
O(nr_descendants_which_have_run_since_last_read).  That way, we can
show the basic stats without taxing the hot paths with reasonable
scalability.

I have a couple questions about cpuacct tho.

* The stat file is sampling based and the usage files are calculated
  from actual scheduling events.  Is this because the latter is more
  accurate?

* Why do we have user/sys breakdown in usage numbers?  It tries to
  distinguish user or sys by looking at task_pt_regs().  I can't see
  how this would work (e.g. interrupt handlers never schedule) and w/o
  kernel preemption, the sys part is always zero.  What is this number
  supposed to mean?

Thanks.

-- 
tejun