linux-kernel - Re: [patch 09/10] sched/core: Add migrate

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20200917142438.GH1362448@hirez.programming.kicks-ass.net>
Date:   Thu, 17 Sep 2020 16:24:38 +0200
From:   peterz@...radead.org
To:     Thomas Gleixner <tglx@...utronix.de>
Cc:     LKML <linux-kernel@...r.kernel.org>,
        Sebastian Siewior <bigeasy@...utronix.de>,
        Qais Yousef <qais.yousef@....com>,
        Scott Wood <swood@...hat.com>,
        Valentin Schneider <valentin.schneider@....com>,
        Ingo Molnar <mingo@...nel.org>,
        Juri Lelli <juri.lelli@...hat.com>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        Steven Rostedt <rostedt@...dmis.org>,
        Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
        Daniel Bristot de Oliveira <bristot@...hat.com>,
        Vincent Donnefort <vincent.donnefort@....com>
Subject: Re: [patch 09/10] sched/core: Add migrate_disable/enable()

On Thu, Sep 17, 2020 at 11:42:11AM +0200, Thomas Gleixner wrote:

> +static inline void update_nr_migratory(struct task_struct *p, long delta)
> +{
> +	if (p->nr_cpus_allowed > 1 && p->sched_class->update_migratory)
> +		p->sched_class->update_migratory(p, delta);
> +}

Right, so as you know, I totally hate this thing :-) It adds a second
(and radically different) version of changing affinity. I'm working on a
version that uses the normal *set_cpus_allowed*() interface.

> +/*
> + * The migrate_disable/enable() fastpath updates only the tasks migrate
> + * disable count which is sufficient as long as the task stays on the CPU.
> + *
> + * When a migrate disabled task is scheduled out it can become subject to
> + * load balancing. To prevent this, update task::cpus_ptr to point to the
> + * current CPUs cpumask and set task::nr_cpus_allowed to 1.
> + *
> + * If task::cpus_ptr does not point to task::cpus_mask then the update has
> + * been done already. This check is also used in in migrate_enable() as an
> + * indicator to restore task::cpus_ptr to point to task::cpus_mask
> + */
> +static inline void sched_migration_ctrl(struct task_struct *prev, int cpu)
> +{
> +	if (!prev->migration_ctrl.disable_cnt ||
> +	    prev->cpus_ptr != &prev->cpus_mask)
> +		return;
> +
> +	prev->cpus_ptr = cpumask_of(cpu);
> +	update_nr_migratory(prev, -1);
> +	prev->nr_cpus_allowed = 1;
> +}

So this thing is called from schedule(), with only rq->lock held, and
that violates the locking rules for changing the affinity.

I have a comment that explains how it's broken and why it's sort-of
working.

> +void migrate_disable(void)
> +{
> +	unsigned long flags;
> +
> +	if (!current->migration_ctrl.disable_cnt) {
> +		raw_spin_lock_irqsave(&current->pi_lock, flags);
> +		current->migration_ctrl.disable_cnt++;
> +		raw_spin_unlock_irqrestore(&current->pi_lock, flags);
> +	} else {
> +		current->migration_ctrl.disable_cnt++;
> +	}
> +}

That pi_lock seems unfortunate, and it isn't obvious what the point of
it is.

> +void migrate_enable(void)
> +{
> +	struct task_migrate_data *pending;
> +	struct task_struct *p = current;
> +	struct rq_flags rf;
> +	struct rq *rq;
> +
> +	if (WARN_ON_ONCE(p->migration_ctrl.disable_cnt <= 0))
> +		return;
> +
> +	if (p->migration_ctrl.disable_cnt > 1) {
> +		p->migration_ctrl.disable_cnt--;
> +		return;
> +	}
> +
> +	raw_spin_lock_irqsave(&p->pi_lock, rf.flags);
> +	p->migration_ctrl.disable_cnt = 0;
> +	pending = p->migration_ctrl.pending;
> +	p->migration_ctrl.pending = NULL;
> +
> +	/*
> +	 * If the task was never scheduled out while in the migrate
> +	 * disabled region and there is no migration request pending,
> +	 * return.
> +	 */
> +	if (!pending && p->cpus_ptr == &p->cpus_mask) {
> +		raw_spin_unlock_irqrestore(&p->pi_lock, rf.flags);
> +		return;
> +	}
> +
> +	rq = __task_rq_lock(p, &rf);
> +	/* Was it scheduled out while in a migrate disabled region? */
> +	if (p->cpus_ptr != &p->cpus_mask) {
> +		/* Restore the tasks CPU mask and update the weight */
> +		p->cpus_ptr = &p->cpus_mask;
> +		p->nr_cpus_allowed = cpumask_weight(&p->cpus_mask);
> +		update_nr_migratory(p, 1);
> +	}
> +
> +	/* If no migration request is pending, no further action required. */
> +	if (!pending) {
> +		task_rq_unlock(rq, p, &rf);
> +		return;
> +	}
> +
> +	/* Migrate self to the requested target */
> +	pending->res = set_cpus_allowed_ptr_locked(p, pending->mask,
> +						   pending->check, rq, &rf);
> +	complete(pending->done);
> +}

So, what I'm missing with all this are the design contraints for this
trainwreck. Because the 'sane' solution was having migrate_disable()
imply cpus_read_lock(). But that didn't fly because we can't have
migrate_disable() / migrate_enable() schedule for raisins.

And if I'm not mistaken, the above migrate_enable() *does* require being
able to schedule, and our favourite piece of futex:

	raw_spin_lock_irq(&q.pi_state->pi_mutex.wait_lock);
	spin_unlock(q.lock_ptr);

is broken. Consider that spin_unlock() doing migrate_enable() with a
pending sched_setaffinity().

Let me ponder this more..