linux-kernel - Re: [PATCH 4/6] sched/isolation: Residual 1Hz scheduler tick offload

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:   Mon, 29 Jan 2018 16:38:39 +0100
From:   Peter Zijlstra <peterz@...radead.org>
To:     Frederic Weisbecker <frederic@...nel.org>
Cc:     Ingo Molnar <mingo@...nel.org>,
        LKML <linux-kernel@...r.kernel.org>,
        Chris Metcalf <cmetcalf@...lanox.com>,
        Thomas Gleixner <tglx@...utronix.de>,
        Luiz Capitulino <lcapitulino@...hat.com>,
        Christoph Lameter <cl@...ux.com>,
        "Paul E . McKenney" <paulmck@...ux.vnet.ibm.com>,
        Wanpeng Li <kernellwp@...il.com>,
        Mike Galbraith <efault@....de>, Rik van Riel <riel@...hat.com>
Subject: Re: [PATCH 4/6] sched/isolation: Residual 1Hz scheduler tick offload

On Fri, Jan 19, 2018 at 01:02:18AM +0100, Frederic Weisbecker wrote:
> When a CPU runs in full dynticks mode, a 1Hz tick remains in order to
> keep the scheduler stats alive. However this residual tick is a burden
> for bare metal tasks that can't stand any interruption at all, or want
> to minimize them.
> 
> The usual boot parameters "nohz_full=" or "isolcpus=nohz" will now
> outsource these scheduler ticks to the global workqueue so that a
> housekeeping CPU handles those remotely.
> 
> Note that in the case of using isolcpus, it's still up to the user to
> affine the global workqueues to the housekeeping CPUs through
> /sys/devices/virtual/workqueue/cpumask or domains isolation
> "isolcpus=nohz,domain".

I would very much like a few words on why sched_class::task_tick() is
safe to call remote -- from a quick look I think it actually is, but it
would be good to have some words here.

> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index d72d0e9..c79500c 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3062,7 +3062,82 @@ u64 scheduler_tick_max_deferment(void)
>  
>  	return jiffies_to_nsecs(next - now);
>  }
> -#endif
> +
> +struct tick_work {
> +	int			cpu;
> +	struct delayed_work	work;
> +};
> +
> +static struct tick_work __percpu *tick_work_cpu;
> +
> +static void sched_tick_remote(struct work_struct *work)
> +{
> +	struct delayed_work *dwork = to_delayed_work(work);
> +	struct tick_work *twork = container_of(dwork, struct tick_work, work);
> +	int cpu = twork->cpu;
> +	struct rq *rq = cpu_rq(cpu);
> +	struct rq_flags rf;
> +
> +	/*
> +	 * Handle the tick only if it appears the remote CPU is running
> +	 * in full dynticks mode. The check is racy by nature, but
> +	 * missing a tick or having one too much is no big deal.
> +	 */
> +	if (!idle_cpu(cpu) && tick_nohz_tick_stopped_cpu(cpu)) {
> +		rq_lock_irq(rq, &rf);
> +		update_rq_clock(rq);
> +		rq->curr->sched_class->task_tick(rq, rq->curr, 0);
> +		rq_unlock_irq(rq, &rf);
> +	}
> +
> +	queue_delayed_work(system_unbound_wq, dwork, HZ);

Do we want something that tracks the actual interrer arrival time of
this work, such that we can detect and warn if the book-keeping thing is
failing to keep up?

> +}
> +
> +static void sched_tick_start(int cpu)
> +{
> +	struct tick_work *twork;
> +
> +	if (housekeeping_cpu(cpu, HK_FLAG_TICK))
> +		return;

This all looks very static :-(, you can't reconfigure this nohz_full
crud after boot?

> +	WARN_ON_ONCE(!tick_work_cpu);
> +
> +	twork = per_cpu_ptr(tick_work_cpu, cpu);
> +	twork->cpu = cpu;
> +	INIT_DELAYED_WORK(&twork->work, sched_tick_remote);
> +	queue_delayed_work(system_unbound_wq, &twork->work, HZ);
> +}

Similarly, I think we want a few words about how unbound workqueues are
expected to behave vs NUMA.

AFAICT unbound workqueues by default prefer to run on a cpu in the same
node, but if no cpu is available, it doesn't go looking for the nearest
node that does have a cpu, it just punts to whatever random cpu.