linux-kernel - Re: [PATCH] sched/rt.c: pick and check task if double_lock

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20170925231843.140d3a21@vmware.local.home>
Date:   Mon, 25 Sep 2017 23:18:43 -0400
From:   Steven Rostedt <rostedt@...dmis.org>
To:     zhouchengming <zhouchengming1@...wei.com>
Cc:     <linux-kernel@...r.kernel.org>, <mingo@...hat.com>,
        <peterz@...radead.org>, <huawei.libin@...wei.com>
Subject: Re: [PATCH] sched/rt.c: pick and check task if
 double_lock_balance() unlock the rq

On Tue, 26 Sep 2017 09:23:20 +0800
zhouchengming <zhouchengming1@...wei.com> wrote:

> On 2017/9/26 3:40, Steven Rostedt wrote:
> > On Mon, 11 Sep 2017 14:51:49 +0800
> > Zhou Chengming<zhouchengming1@...wei.com>  wrote:
> >  
> >> push_rt_task() pick the first pushable task and find an eligible
> >> lowest_rq, then double_lock_balance(rq, lowest_rq). So if
> >> double_lock_balance() unlock the rq (when double_lock_balance() return 1),
> >> we have to check if this task is still on the rq.
> >>
> >> The problem is that the check conditions are not sufficient:
> >>
> >> if (unlikely(task_rq(task) != rq ||
> >> 	     !cpumask_test_cpu(lowest_rq->cpu,&task->cpus_allowed) ||
> >> 	     task_running(rq, task) ||
> >> 	     !rt_task(task) ||
> >> 	     !task_on_rq_queued(task))) {
> >>
> >> cpu2				cpu1			cpu0
> >> push_rt_task(rq1)
> >>    pick task_A on rq1
> >>    find rq0
> >>      double_lock_balance(rq1, rq0)
> >>        unlock(rq1)
> >> 				rq1 __schedule
> >> 				  pick task_A run
> >> 				task_A sleep (dequeued)
> >>        lock(rq0)
> >>        lock(rq1)
> >>      do_above_check(task_A)
> >>        task_rq(task_A) == rq1
> >>        cpus_allowed unchanged
> >>        task_running == false
> >>        rt_task(task_A) == true
> >> 							try_to_wake_up(task_A)
> >> 							  select_cpu = cpu3
> >> 							  enqueue(rq3, task_A)  
> > How can this happen? The try_to_wake_up(task_A) needs to grab the rq
> > that task A is on, and we have that rq lock.
> >
> > /me confused.
> >
> > -- Steve  
> 
> Thanks for the reply!
> After the task_A sleep on cpu1, the try_to_wake_up(task_A) on cpu0 select a different cpu3,
> so it will grab the rq3 lock, not the rq1 lock.

Ah crap. This is caused by 7608dec2ce20 ("sched: Drop the rq argument
to sched_class::select_task_rq()"). Because this code depends on
try_to_wake_up() grabbing the task's rq lock. But it no longer does
that, and it causes this race.

OK, I need to look at this deeper when I'm not so jetlagged and typing
this because I can't sleep at 5am.

Thanks for pointing this out!

It may be fixed by simply grabbing the run queue lock on migration, as
that would sync things up.

Peter?


-- Steve