lists.openwall.net | lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening PHC | |
Open Source and information security mailing list archives
| ||
|
Date: Thu, 23 Sep 2021 10:40:48 +0200 From: Vincent Guittot <vincent.guittot@...aro.org> To: Mike Galbraith <efault@....de> Cc: Mel Gorman <mgorman@...hsingularity.net>, Peter Zijlstra <peterz@...radead.org>, Ingo Molnar <mingo@...nel.org>, Valentin Schneider <valentin.schneider@....com>, Aubrey Li <aubrey.li@...ux.intel.com>, Barry Song <song.bao.hua@...ilicon.com>, Srikar Dronamraju <srikar@...ux.vnet.ibm.com>, LKML <linux-kernel@...r.kernel.org> Subject: Re: [PATCH 2/2] sched/fair: Scale wakeup granularity relative to nr_running On Thu, 23 Sept 2021 at 03:47, Mike Galbraith <efault@....de> wrote: > > On Wed, 2021-09-22 at 20:22 +0200, Vincent Guittot wrote: > > On Wed, 22 Sept 2021 at 19:38, Mel Gorman <mgorman@...hsingularity.net> wrote: > > > > > > > > > I'm not seeing an alternative suggestion that could be turned into > > > an implementation. The current value for sched_wakeup_granularity > > > was set 12 years ago was exposed for tuning which is no longer > > > the case. The intent was to allow some dynamic adjustment between > > > sysctl_sched_wakeup_granularity and sysctl_sched_latency to reduce > > > over-scheduling in the worst case without disabling preemption entirely > > > (which the first version did). > > I don't think those knobs were ever _intended_ for general purpose > tuning, but they did get used that way by some folks. > > > > > > > Should we just ignore this problem and hope it goes away or just let > > > people keep poking silly values into debugfs via tuned? > > > > We should certainly not add a bandaid because people will continue to > > poke silly value at the end. And increasing > > sysctl_sched_wakeup_granularity based on the number of running threads > > is not the right solution. > > Watching my desktop box stack up large piles of very short running > threads, I agree, instantaneous load looks like a non-starter. > > > According to the description of your > > problem that the current task doesn't get enough time to move forward, > > sysctl_sched_min_granularity should be part of the solution. Something > > like below will ensure that current got a chance to move forward > > Nah, progress is guaranteed, the issue is a zillion very similar short > running threads preempting each other with no win to be had, thus > spending cycles in the scheduler that are utterly wasted. It's a valid > issue, trouble is teaching the scheduler to recognize that situation > without mucking up other situations where there IS a win for even very > short running threads say, doing a synchronous handoff; preemption is > cheaper than scheduling off if the waker is going be awakened again in > very short order. > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > index 9bf540f04c2d..39d4e4827d3d 100644 > > --- a/kernel/sched/fair.c > > +++ b/kernel/sched/fair.c > > @@ -7102,6 +7102,7 @@ static void check_preempt_wakeup(struct rq *rq, > > struct task_struct *p, int wake_ > > int scale = cfs_rq->nr_running >= sched_nr_latency; > > int next_buddy_marked = 0; > > int cse_is_idle, pse_is_idle; > > + unsigned long delta_exec; > > > > if (unlikely(se == pse)) > > return; > > @@ -7161,6 +7162,13 @@ static void check_preempt_wakeup(struct rq *rq, > > struct task_struct *p, int wake_ > > return; > > > > update_curr(cfs_rq_of(se)); > > + delta_exec = se->sum_exec_runtime - se->prev_sum_exec_runtime; > > + /* > > + * Ensure that current got a chance to move forward > > + */ > > + if (delta_exec < sysctl_sched_min_granularity) > > + return; > > + > > if (wakeup_preempt_entity(se, pse) == 1) { > > /* > > * Bias pick_next to pick the sched entity that is > > Yikes! If you do that, you may as well go the extra nanometer and rip > wakeup preemption out entirely, same result, impressive diffstat. This patch is mainly there to show that there are other ways to ensure progress without using some load heuristic. sysctl_sched_min_granularity has the problem of scaling with the number of cpus and this can generate large values. At least we should use the normalized_sysctl_sched_min_granularity or even a smaller value but wakeup preemption still happens with this change. It only ensures that we don't waste time preempting each other without any chance to do actual stuff. a 100us value should even be enough to fix Mel's problem without impacting common wakeup preemption cases. > > -Mike
Powered by blists - more mailing lists