linux-kernel - Re: [RFC PATCH v4 00/19] Core scheduling v4

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [day] [month] [year] [list]

Message-ID: <20200226031336.GA622976@ziqianlu-desktop.localdomain>
Date:   Wed, 26 Feb 2020 11:13:36 +0800
From:   Aaron Lu <aaron.lwe@...il.com>
To:     Vineeth Remanan Pillai <vpillai@...italocean.com>
Cc:     Aubrey Li <aubrey.intel@...il.com>,
        Tim Chen <tim.c.chen@...ux.intel.com>,
        Julien Desfossez <jdesfossez@...italocean.com>,
        Nishanth Aravamudan <naravamudan@...italocean.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Ingo Molnar <mingo@...nel.org>,
        Thomas Gleixner <tglx@...utronix.de>,
        Paul Turner <pjt@...gle.com>,
        Linus Torvalds <torvalds@...ux-foundation.org>,
        Linux List Kernel Mailing <linux-kernel@...r.kernel.org>,
        Dario Faggioli <dfaggioli@...e.com>,
        Frédéric Weisbecker <fweisbec@...il.com>,
        Kees Cook <keescook@...omium.org>,
        Greg Kerr <kerrnel@...gle.com>, Phil Auld <pauld@...hat.com>,
        Valentin Schneider <valentin.schneider@....com>,
        Mel Gorman <mgorman@...hsingularity.net>,
        Pawan Gupta <pawan.kumar.gupta@...ux.intel.com>,
        Paolo Bonzini <pbonzini@...hat.com>
Subject: Re: [RFC PATCH v4 00/19] Core scheduling v4

On Tue, Feb 25, 2020 at 03:51:37PM -0500, Vineeth Remanan Pillai wrote:
> Hi Aaron,
> We tried reproducing this with a sample script here:
> https://gist.github.com/vineethrp/4356e66694269d1525ff254d7f213aef

Nice script.

> But the set1 cgroup processes always get their share of cpu time in
> our test. Could you please verify if its the same test that you were
> also doing? The only difference is that we run on a 2 16core/32thread
> socket bare metal using only socket 0. We also tried threads instead of
> processes, but the results are the same.

Sorry for missing one detail: I always start the noise workload first,
and then start the real workload. This is critical for this test here
since only so, the noise workload can occupy all CPUs and presents a
challange for the load balancer to balance real workload's tasks. If
both workloads are started at the same time, the initial task placement
might mitigate the problem.

BTW, your script has given 12cores/24 CPUs to the workloads and cgA
spawned 16 tasks, cgB spawned 32. This is an even more complex scenario
to test since the real workload's task number is already more than the
available core number. Perhaps just starting 12 tasks for cgA and 24
tasks for cgB is enough for now. As for the start sequence, simply sleep
5 seconds after cgB workload is started, then start cgA. I have left a
comment on the script's gist page.

> 
> 
> On a 2sockets/16cores/32threads VM, I grouped 8 sysbench(cpu mode)
> > threads into one cgroup(cgA) and another 16 sysbench(cpu mode) threads
> > into another cgroup(cgB). cgA and cgB's cpusets are set to the same
> > socket's 8 cores/16 CPUs and cgA's cpu.shares is set to 10240 while cgB's
> > cpu.shares is set to 2(so consider cgB as noise workload and cgA as
> > the real workload).
> >
> > I had expected cgA to occupy 8 cpus(with each cpu on a different core)
> 
> The expected behaviour could also be that 8 processes share 4 cores and
> 8 hw threads right? This is what we are seeing mostly
> 
> most of the time since it has way more weight than cgB, while cgB should
> > occupy almost no CPUs since:
> >  - when cgB's task is in the same CPU queue as cgA's task, then cgB's
> >    task is given very little CPU due to its small weight;
> >  - when cgB's task is in a CPU queue whose sibling's queue has cgA's
> >    task, cgB's task should be forced idle(again, due to its small weight).
> >
> We are seeing the cgA is taking half the cores and cgB taking rest half
> of the cores. Looks like the scheduler ultimately groups the tasks into
> its own cores.
> 
> 
> >
> > But testing shows cgA occupies only 2 cpus during the entire run while
> > cgB enjoys the remaining 14 cpus. As a comparison, when coresched is off,
> > cgA can occupy 8 cpus during its run.
> >
> > Not sure why we are not able to reproduce this. I have a quick patch
> which might fix this. The idea is that, we allow migration if p's
> hierarchical load or estimated utilization is more than dest_rq->curr.
> While thinking about this fix, I noticed that we are not holding the
> dest_rq lock for any of the migration patches. Migration patches would
> probably need a rework. Attaching my patch down, but it also does not
> take the dest_rq lock. I have also added a case of dest_core being
> forced_idle. I think that would be an opportunity to migrate. Ideally
> we should check if the forced idle task has the same cookie as p.
> 
> https://gist.github.com/vineethrp/887743608f42a6ce96bf7847b5b119ae

Is this on top of Aubrey's coresched_v4-v5.5.2 branch?