lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Mon, 4 Apr 2016 15:31:50 -0400
From:	Chris Metcalf <cmetcalf@...lanox.com>
To:	Rik van Riel <riel@...hat.com>,
	Frederic Weisbecker <fweisbec@...il.com>,
	Christoph Lameter <cl@...ux.com>,
	Ingo Molnar <mingo@...nel.org>,
	Luiz Capitulino <lcapitulino@...hat.com>,
	Peter Zijlstra <peterz@...radead.org>,
	Thomas Gleixner <tglx@...utronix.de>,
	Viresh Kumar <viresh.kumar@...aro.org>,
	<linux-kernel@...r.kernel.org>
Subject: Re: [PATCH] nohz_full: Make sched_should_stop_tick() more
 conservative

On 4/4/2016 3:12 PM, Rik van Riel wrote:
> On Fri, 2016-04-01 at 15:42 -0400, Chris Metcalf wrote:
>> On arm64, when calling enqueue_task_fair() from migration_cpu_stop(),
>> we find the nr_running value updated by add_nr_running(), but the
>> cfs.nr_running value has not always yet been updated.  Accordingly,
>> the sched_can_stop_tick() false returns true when we are migrating a
>> second task onto a core.
> I don't get it.
>
> Looking at the enqueue_task_fair(), I see this:
>
>          for_each_sched_entity(se) {
>                  cfs_rq = cfs_rq_of(se);
>                  cfs_rq->h_nr_running++;
> 		...
> 	}
>
>          if (!se)
>                  add_nr_running(rq, 1);
>
> What is the difference between cfs_rq->h_nr_running,
> and rq->cfs.nr_running?
>
> Why do we have two?
> Are we simply testing against the wrong one in
> sched_can_stop_tick?

It seems that using the non-CFS one is what we want.  I don't know whether
using a different CFS count instead might be more correct.

Since I'm not sure what causes the difference I see between tile (correct)
and arm64 (incorrect) it's hard for me to speculate.

>> Correct this by using rq->nr_running instead of rq->cfs.nr_running.
>> This should always be more conservative, and reverts the test to the
>> form it had before commit 76d92ac305f2 ("sched: Migrate sched to use
>> new tick dependency mask model").
> That would cause us to run the timer tick while running
> a single SCHED_RR real time task, with a single
> SCHED_OTHER task sitting in the background (which will
> not get run until the SCHED_RR task is done).

No, because in sched_can_stop_tick(), we first handle the special
cases of RR or FIFO tasks present.  For example, RR:

         if (rq->rt.rr_nr_running) {
                 if (rq->rt.rr_nr_running == 1)
                         return true;
                 else
                         return false;
         }

Once we see there's any RR tasks running, the return value
ignores any possible SCHED_OTHER tasks.  Only after the code
concludes there are no RR/FIFO tasks do we even look at
the over nr_running value.

-- 
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ