linux-kernel - Re: [PATCH] Fix tasks being forgotten for a long time on SMP

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <20160923081504.GB5008@twins.programming.kicks-ass.net>
Date:   Fri, 23 Sep 2016 10:15:04 +0200
From:   Peter Zijlstra <peterz@...radead.org>
To:     Yuriy Romanenko <yromanenko@...rietech.com>
Cc:     Ingo Molnar <mingo@...hat.com>, linux-kernel@...r.kernel.org
Subject: Re: [PATCH] Fix tasks being forgotten for a long time on SMP

On Thu, Sep 22, 2016 at 12:06:03PM -0700, Yuriy Romanenko wrote:
> Hello,
> 
> I just spent three hours studying TOT scheduler code, and I am leaning
> towards that this patch being no longer relevant (or being never
> relevant, really).
> If it ever fixed a real issue, I think it fixed it incorrectly. However, I would
> love to understand what exactly we were seeing and why this helped at all.
> 
> So a little background on this. I found this change I made that I failed to
> upstream from a few years ago, so my memory is a little rusty. This
> was definitely an issue on the setup we had, which was an arch/arm
> mach-msm (now known as mach-qcom, I believe) fork of 3.4.0

Urgh, 3.4 is a _long_ time ago ;-) Let me check it out to remind myself
what it looked like.

> > WTH his SWFI and which arch has that?
> 
> SWFI is "Stop and Wait For Interrupt".  This appears to be a qcom
> specific term that they stopped using, so I apologize for that. It
> specifically refers to a CPU that is sitting in some sort of a "wait"
> or "wfi" instruction and not executing anything until an interrupt
> occurs.

Ah, ok. So I'm somewhat familiar with the ARM 'WFI' instruction and had
a vague idea this might be an ARM related part.

> > If the remote cpu is running the idle task, check_preempt_curr() should
> > very much wake it up, if its not the idle class, it should never get
> > there because there is now an actually runnable task on.
> 
> > Please explain in detail what happens and/or provide traces of this
> > happening.
> 
> We observed two relevant scenarios that are similar but distinct.
> Looking through top-of-tree code more carefully I don't see either as
> possible anymore, so the patch probably does not apply, but I would
> still defer to your expertise
> 
> We have an SCHED_FIFO thread that falls asleep in a poll() waiting for
> an interrupt from a device driver.
> 
> 1)  It is the last thread on that CPU, the CPU enters idle and goes
> into a 'wfi' instruction because of cpuidle_idle_call(), and will not
> wake up until it receives an interrupt of some sort (like IPI).  The
> interrupt occurs on a different CPU, the task is woken, but doesn't
> run until much later.
> 
> 2)  It is not the last thread on that CPU. The CPU switches to a lower
> priority task in a different class.  The interrupt occurs on a
> different CPU, the task is woken, but doesn't preempt until much
> later.
> 
> Unfortunately I don't have the ftrace logs from back then, and it
> would be hard to recreate the scenario.
> 
> If I recall correctly, the problem seemed to occur because the task
> cpus_share_cache() was true, the task was never put on the wake_list,
> and correspondingly the scheduler_ipi() that was triggered by
> check_preempt_curr() did nothing, when it returned it would somehow
> wind up in schedule() where it did nothing because task was not
> TASK_RUNNING.  Does that make any amount of sense at all?

So if cpus_share_cache() is true, we'll do the remote enqueue, that is,
the waking CPU will acquire the rq->lock of the CPU we want to wake the
task on and enqueue.

The code is like:

   raw_spin_lock(rq->lock);
   ttwu_do_activate(rq, p, 0);
     ttwu_activate(...); // does the actual enqueue
     ttwu_do_wakeup(...);
       check_preempt_curr() // <-- this is supposed to issue the IPI

   raw_spin_unlock(rq->lock);

In either scenario, 'current' is of a lower scheduling class (idle or
fair) than the newly woken task (fifo), so check_preempt_curr() should
find (class == p->sched_class) true and issue resched_task(rq->curr).

resched_task() in its turn tests TIF_NEED_RESCHED, if not set, it will
set it and issue an IPI (lets ignore the polling thing for now).

The IPI will execute scheduler_ipi(), and as you say, fail to find work
to do -- this is expected and OK.

_However_ on the return to user path (from interrupt), you're supposed
to check TIF_NEED_RESCHED and call schedule(). _THIS_ is what should
affect the actual task switch and get our freshly woken FIFO task
running.

Not to be confused with CONFIG_PREEMPT which should check
TIF_NEED_RESCHED on any return from interrupt path and call
preempt_schedule().