lists.openwall.net | lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening PHC | |
Open Source and information security mailing list archives
| ||
|
Date: Thu, 23 Sep 2021 08:24:03 -0400 From: Phil Auld <pauld@...hat.com> To: Vincent Guittot <vincent.guittot@...aro.org> Cc: Mike Galbraith <efault@....de>, Mel Gorman <mgorman@...hsingularity.net>, Peter Zijlstra <peterz@...radead.org>, Ingo Molnar <mingo@...nel.org>, Valentin Schneider <valentin.schneider@....com>, Aubrey Li <aubrey.li@...ux.intel.com>, Barry Song <song.bao.hua@...ilicon.com>, Srikar Dronamraju <srikar@...ux.vnet.ibm.com>, LKML <linux-kernel@...r.kernel.org> Subject: Re: [PATCH 2/2] sched/fair: Scale wakeup granularity relative to nr_running On Thu, Sep 23, 2021 at 10:40:48AM +0200 Vincent Guittot wrote: > On Thu, 23 Sept 2021 at 03:47, Mike Galbraith <efault@....de> wrote: > > > > On Wed, 2021-09-22 at 20:22 +0200, Vincent Guittot wrote: > > > On Wed, 22 Sept 2021 at 19:38, Mel Gorman <mgorman@...hsingularity.net> wrote: > > > > > > > > > > > > I'm not seeing an alternative suggestion that could be turned into > > > > an implementation. The current value for sched_wakeup_granularity > > > > was set 12 years ago was exposed for tuning which is no longer > > > > the case. The intent was to allow some dynamic adjustment between > > > > sysctl_sched_wakeup_granularity and sysctl_sched_latency to reduce > > > > over-scheduling in the worst case without disabling preemption entirely > > > > (which the first version did). > > > > I don't think those knobs were ever _intended_ for general purpose > > tuning, but they did get used that way by some folks. > > > > > > > > > > Should we just ignore this problem and hope it goes away or just let > > > > people keep poking silly values into debugfs via tuned? > > > > > > We should certainly not add a bandaid because people will continue to > > > poke silly value at the end. And increasing > > > sysctl_sched_wakeup_granularity based on the number of running threads > > > is not the right solution. > > > > Watching my desktop box stack up large piles of very short running > > threads, I agree, instantaneous load looks like a non-starter. > > > > > According to the description of your > > > problem that the current task doesn't get enough time to move forward, > > > sysctl_sched_min_granularity should be part of the solution. Something > > > like below will ensure that current got a chance to move forward > > > > Nah, progress is guaranteed, the issue is a zillion very similar short > > running threads preempting each other with no win to be had, thus > > spending cycles in the scheduler that are utterly wasted. It's a valid > > issue, trouble is teaching the scheduler to recognize that situation > > without mucking up other situations where there IS a win for even very > > short running threads say, doing a synchronous handoff; preemption is > > cheaper than scheduling off if the waker is going be awakened again in > > very short order. > > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > > index 9bf540f04c2d..39d4e4827d3d 100644 > > > --- a/kernel/sched/fair.c > > > +++ b/kernel/sched/fair.c > > > @@ -7102,6 +7102,7 @@ static void check_preempt_wakeup(struct rq *rq, > > > struct task_struct *p, int wake_ > > > int scale = cfs_rq->nr_running >= sched_nr_latency; > > > int next_buddy_marked = 0; > > > int cse_is_idle, pse_is_idle; > > > + unsigned long delta_exec; > > > > > > if (unlikely(se == pse)) > > > return; > > > @@ -7161,6 +7162,13 @@ static void check_preempt_wakeup(struct rq *rq, > > > struct task_struct *p, int wake_ > > > return; > > > > > > update_curr(cfs_rq_of(se)); > > > + delta_exec = se->sum_exec_runtime - se->prev_sum_exec_runtime; > > > + /* > > > + * Ensure that current got a chance to move forward > > > + */ > > > + if (delta_exec < sysctl_sched_min_granularity) > > > + return; > > > + > > > if (wakeup_preempt_entity(se, pse) == 1) { > > > /* > > > * Bias pick_next to pick the sched entity that is > > > > Yikes! If you do that, you may as well go the extra nanometer and rip > > wakeup preemption out entirely, same result, impressive diffstat. > > This patch is mainly there to show that there are other ways to ensure > progress without using some load heuristic. > sysctl_sched_min_granularity has the problem of scaling with the > number of cpus and this can generate large values. At least we should > use the normalized_sysctl_sched_min_granularity or even a smaller > value but wakeup preemption still happens with this change. It only > ensures that we don't waste time preempting each other without any > chance to do actual stuff. > It's capped at 8 cpus, which is pretty easy to reach these days, so the values don't get too large. That scaling is almost a no-op these days. Cheers, Phil > a 100us value should even be enough to fix Mel's problem without > impacting common wakeup preemption cases. > > > > > > -Mike > --
Powered by blists - more mailing lists