linux-kernel - Re: [PATCH] sched/rt: Make rt_rq->pushable_tasks updates drive rto

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20230920133806.HyAqFKOa@linutronix.de>
Date:   Wed, 20 Sep 2023 15:38:06 +0200
From:   Sebastian Andrzej Siewior <bigeasy@...utronix.de>
To:     Valentin Schneider <vschneid@...hat.com>
Cc:     linux-kernel@...r.kernel.org,
        Daniel Bristot de Oliveira <bristot@...hat.com>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        Ingo Molnar <mingo@...hat.com>,
        Juri Lelli <juri.lelli@...hat.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Steven Rostedt <rostedt@...dmis.org>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        Thomas Gleixner <tglx@...utronix.de>
Subject: Re: [PATCH] sched/rt: Make rt_rq->pushable_tasks updates drive
 rto_mask

On 2023-09-11 12:54:50 [+0200], Valentin Schneider wrote:
> Ok, back to this :)
> 
> On 15/08/23 16:21, Sebastian Andrzej Siewior wrote:
> > What I still observe is:
> > - CPU0 is idle. CPU0 gets a task assigned from CPU1. That task receives
> >   a wakeup. CPU0 returns from idle and schedules the task.
> >   pull_rt_task() on CPU1 and sometimes on other CPU observe this, too.
> >   CPU1 sends irq_work to CPU0 while at the time rto_next_cpu() sees that
> >   has_pushable_tasks() return 0. That bit was cleared earlier (as per
> >   tracing).
> >
> > - CPU0 is idle. CPU0 gets a task assigned from CPU1. The task on CPU0 is
> >   woken up without an IPI (yay). But then pull_rt_task() decides that
> >   send irq_work and has_pushable_tasks() said that is has tasks left
> >   so….
> >   Now: rto_push_irq_work_func() run once once on CPU0, does nothing,
> >   rto_next_cpu() return CPU0 again and enqueues itself again on CPU0.
> >   Usually after the second or third round the scheduler on CPU0 makes
> >   enough progress to remove the task/ clear the CPU from mask.
> >
> 
> If CPU0 is selected for the push IPI, then we should have
> 
>   rd->rto_cpu == CPU0
> 
> So per the
> 
>   cpumask_next(rd->rto_cpu, rd->rto_mask);
> 
> in rto_next_cpu(), it shouldn't be able to re-select itself.
> 
> Do you have a simple enough reproducer I could use to poke at this?

Not really a reproducer. What I had earlier was a high priority RT task
(ntpsec at prio 99) and cyclictest below it (prio 90). And PREEMPT_RT
which adds a few tasks (due to threaded interrupts). 
Then I added trace-printks to observe. Initially I had latency spikes
due to ntpsec but also a bunch IRQ-work-IPIs which I decided to look at.

> > I understand that there is a race and the CPU is cleared from rto_mask
> > shortly after checking. Therefore I would suggest to look at
> > has_pushable_tasks() before returning a CPU in rto_next_cpu() as I did
> > just to avoid the interruption which does nothing.
> >
> > For the second case the irq_work seems to make no progress. I don't see
> > any trace_events in hardirq, the mask is cleared outside hardirq (idle
> > code). The NEED_RESCHED bit is set for current therefore it doesn't make
> > sense to send irq_work to reschedule if the current already has this on
> > its agenda.
> >
> > So what about something like:
> >
> > diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> > index 00e0e50741153..d963408855e25 100644
> > --- a/kernel/sched/rt.c
> > +++ b/kernel/sched/rt.c
> > @@ -2247,8 +2247,23 @@ static int rto_next_cpu(struct root_domain *rd)
> >
> >               rd->rto_cpu = cpu;
> >
> > -		if (cpu < nr_cpu_ids)
> > +		if (cpu < nr_cpu_ids) {
> > +			struct task_struct *t;
> > +
> > +			if (!has_pushable_tasks(cpu_rq(cpu)))
> > +				continue;
> > +
> 
> IIUC that's just to plug the race between the CPU emptying its
> pushable_tasks list and it removing itself from the rto_mask - that looks
> fine to me.
> 
> > +			rcu_read_lock();
> > +			t = rcu_dereference(rq->curr);
> > +			/* if (test_preempt_need_resched_cpu(cpu_rq(cpu))) */
> > +			if (test_tsk_need_resched(t)) {
> 
> We need to make sure this doesn't cause us to loose IPIs we actually need.
> 
> We do have a call to put_prev_task_balance() through entering __schedule()
> if the previous task is RT/DL, and balance_rt() can issue a push
> IPI, but AFAICT only if the previous task was the last DL task. So I don't
> think we can do this.

I observed that the CPU/ task on that CPU already had the need-resched
bit set so a task-switch is in progress. Therefore it looks like any
further IPIs are needless because the IRQ-work IPI just "leave early"
via resched_curr() and don't do anything useful. So they don't
contribute anything but stall the CPU from making progress and
performing the actual context switch.

> > +				rcu_read_unlock();
> > +				continue;
> > +			}
> > +			rcu_read_unlock();
> > +
> >                       return cpu;
> > +		}
> >
> >               rd->rto_cpu = -1;

Sebastian