linux-kernel - Re: [PATCH] sched/rt: Make rt_rq->pushable_tasks updates drive rto

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <xhsmhpm2prnd1.mognet@vschneid.remote.csb>
Date:   Mon, 11 Sep 2023 12:54:50 +0200
From:   Valentin Schneider <vschneid@...hat.com>
To:     Sebastian Andrzej Siewior <bigeasy@...utronix.de>
Cc:     linux-kernel@...r.kernel.org,
        Daniel Bristot de Oliveira <bristot@...hat.com>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        Ingo Molnar <mingo@...hat.com>,
        Juri Lelli <juri.lelli@...hat.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Steven Rostedt <rostedt@...dmis.org>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        Thomas Gleixner <tglx@...utronix.de>
Subject: Re: [PATCH] sched/rt: Make rt_rq->pushable_tasks updates drive
 rto_mask

Ok, back to this :)

On 15/08/23 16:21, Sebastian Andrzej Siewior wrote:
> What I still observe is:
> - CPU0 is idle. CPU0 gets a task assigned from CPU1. That task receives
>   a wakeup. CPU0 returns from idle and schedules the task.
>   pull_rt_task() on CPU1 and sometimes on other CPU observe this, too.
>   CPU1 sends irq_work to CPU0 while at the time rto_next_cpu() sees that
>   has_pushable_tasks() return 0. That bit was cleared earlier (as per
>   tracing).
>
> - CPU0 is idle. CPU0 gets a task assigned from CPU1. The task on CPU0 is
>   woken up without an IPI (yay). But then pull_rt_task() decides that
>   send irq_work and has_pushable_tasks() said that is has tasks left
>   so….
>   Now: rto_push_irq_work_func() run once once on CPU0, does nothing,
>   rto_next_cpu() return CPU0 again and enqueues itself again on CPU0.
>   Usually after the second or third round the scheduler on CPU0 makes
>   enough progress to remove the task/ clear the CPU from mask.
>

If CPU0 is selected for the push IPI, then we should have

  rd->rto_cpu == CPU0

So per the

  cpumask_next(rd->rto_cpu, rd->rto_mask);

in rto_next_cpu(), it shouldn't be able to re-select itself.

Do you have a simple enough reproducer I could use to poke at this?

> I understand that there is a race and the CPU is cleared from rto_mask
> shortly after checking. Therefore I would suggest to look at
> has_pushable_tasks() before returning a CPU in rto_next_cpu() as I did
> just to avoid the interruption which does nothing.
>
> For the second case the irq_work seems to make no progress. I don't see
> any trace_events in hardirq, the mask is cleared outside hardirq (idle
> code). The NEED_RESCHED bit is set for current therefore it doesn't make
> sense to send irq_work to reschedule if the current already has this on
> its agenda.
>
> So what about something like:
>
> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> index 00e0e50741153..d963408855e25 100644
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -2247,8 +2247,23 @@ static int rto_next_cpu(struct root_domain *rd)
>
>               rd->rto_cpu = cpu;
>
> -		if (cpu < nr_cpu_ids)
> +		if (cpu < nr_cpu_ids) {
> +			struct task_struct *t;
> +
> +			if (!has_pushable_tasks(cpu_rq(cpu)))
> +				continue;
> +

IIUC that's just to plug the race between the CPU emptying its
pushable_tasks list and it removing itself from the rto_mask - that looks
fine to me.

> +			rcu_read_lock();
> +			t = rcu_dereference(rq->curr);
> +			/* if (test_preempt_need_resched_cpu(cpu_rq(cpu))) */
> +			if (test_tsk_need_resched(t)) {

We need to make sure this doesn't cause us to loose IPIs we actually need.

We do have a call to put_prev_task_balance() through entering __schedule()
if the previous task is RT/DL, and balance_rt() can issue a push
IPI, but AFAICT only if the previous task was the last DL task. So I don't
think we can do this.

> +				rcu_read_unlock();
> +				continue;
> +			}
> +			rcu_read_unlock();
> +
>                       return cpu;
> +		}
>
>               rd->rto_cpu = -1;
>
> Sebastian