linux-kernel - Re: [PATCH 2/2] sched: Implement interface for cgroup unified hierarchy

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20170802160511.cb7yl65t4nmctf3y@hirez.programming.kicks-ass.net>
Date:   Wed, 2 Aug 2017 18:05:11 +0200
From:   Peter Zijlstra <peterz@...radead.org>
To:     Tejun Heo <tj@...nel.org>
Cc:     lizefan@...wei.com, hannes@...xchg.org, mingo@...hat.com,
        longman@...hat.com, cgroups@...r.kernel.org,
        linux-kernel@...r.kernel.org, kernel-team@...com, pjt@...gle.com,
        luto@...capital.net, efault@....de, torvalds@...ux-foundation.org,
        guro@...com
Subject: Re: [PATCH 2/2] sched: Implement interface for cgroup unified
 hierarchy

On Wed, Aug 02, 2017 at 08:41:35AM -0700, Tejun Heo wrote:
> > Not entirely sure I follow, we currently only update the current cgroup
> > and its immediate parents, no? Or are you looking to only account into
> > the current cgroup and propagate into the parents on reading?
> 
> Yeah, shifting the cost to the readers and being smart with
> propagation so that reading isn't O(nr_descendants) but
> O(nr_descendants_which_have_run_since_last_read).  That way, we can
> show the basic stats without taxing the hot paths with reasonable
> scalability.

Right, that would be good.

> I have a couple questions about cpuacct tho.
> 
> * The stat file is sampling based and the usage files are calculated
>   from actual scheduling events.  Is this because the latter is more
>   accurate?

So I actually don't know the history of this stuff too well. But I would
think so. This all looks rather dodgy.

> * Why do we have user/sys breakdown in usage numbers?  It tries to
>   distinguish user or sys by looking at task_pt_regs().  I can't see
>   how this would work (e.g. interrupt handlers never schedule) and w/o
>   kernel preemption, the sys part is always zero.  What is this number
>   supposed to mean?

For normal scheduler stuff we account the total runtime in ns and use
the user/kernel tick samples to divide it into user/kernel time parts.
See cputime_adjust().

But looking at the cpuacct I have no clue, that looks wonky at best.

Ideally we'd reuse the normal cputime code and do the same thing
per-cgroup, but clearly that isn't happening now.

I never really looked further than that cpuacct_charge() doing _another_
cgroup iteration, even though we already account that delta to each
cgroup (modulo scheduling class crud).