lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <53270C01.6090209@linux.vnet.ibm.com>
Date:	Mon, 17 Mar 2014 10:51:45 -0400
From:	"Jason J. Herne" <jjherne@...ux.vnet.ibm.com>
To:	Peter Zijlstra <peterz@...radead.org>, Tejun Heo <tj@...nel.org>
CC:	Lai Jiangshan <laijs@...fujitsu.com>, linux-kernel@...r.kernel.org,
	Ingo Molnar <mingo@...hat.com>
Subject: Re: Subject: Warning in workqueue.c

On 03/10/2014 10:37 AM, Jason J. Herne wrote:
> On 02/25/2014 05:37 AM, Peter Zijlstra wrote:
>> On Mon, Feb 24, 2014 at 01:35:01PM -0500, Tejun Heo wrote:
>>
>>> That's a bummer but it at least isn't a very new regression.  Peter,
>>> any ideas on debugging this?  I can make workqueue to play block /
>>> unblock dance to try to work around the issue but that'd be very
>>> yucky.  It'd be great to root cause where the cpu selection anomaly is
>>> coming from.
>>
>> I'm assuming you're using set_cpus_allowed_ptr() to flip them between
>> CPUs; the below adds some error paths to that code. In particular we
>> propagate the __migrate_task() fail (returns the number of tasks
>> migrated) through the stop_one_cpu() into set_cpus_allowed_ptr().
>>
>> This way we can see if there was a problem with the migration.
>>
>> You should be able to now reliably use the return value of
>> set_cpus_allowed_ptr() to tell if it is now running on a CPU in its
>> allowed mask.
>>
>> I've also included an #if 0 retry loop for the fail case; but I suspect
>> that that might end up deadlocking your machine if you hit that just
>> wrong, something like the waking CPU endlessly trying to migrate the
>> task over while the wakee CPU is waiting for completion of something
>> from the waking CPU.
>>
>> But its worth a prod I suppose.
>>
>>
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 84b23cec0aeb..4c384efac8b3 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -4554,18 +4554,28 @@ int set_cpus_allowed_ptr(struct task_struct
>> *p, const struct cpumask *new_mask)
>>
>>       do_set_cpus_allowed(p, new_mask);
>>
>> +again: __maybe_unused
>> +
>>       /* Can the task run on the task's current CPU? If so, we're done */
>> -    if (cpumask_test_cpu(task_cpu(p), new_mask))
>> +    if (cpumask_test_cpu(task_cpu(p), tsk_cpus_allowed(p)))
>>           goto out;
>>
>> -    dest_cpu = cpumask_any_and(cpu_active_mask, new_mask);
>> +    dest_cpu = cpumask_any_and(cpu_active_mask, tsk_cpus_allowed(p));
>>       if (p->on_rq) {
>>           struct migration_arg arg = { p, dest_cpu };
>> +
>>           /* Need help from migration thread: drop lock and wait. */
>>           task_rq_unlock(rq, p, &flags);
>> -        stop_one_cpu(cpu_of(rq), migration_cpu_stop, &arg);
>> +        ret = stop_one_cpu(cpu_of(rq), migration_cpu_stop, &arg);
>> +#if 0
>> +        if (ret) {
>> +            rq = task_rq_lock(p, &flags);
>> +            goto again;
>> +        }
>> +#endif
>>           tlb_migrate_finish(p->mm);
>> -        return 0;
>> +
>> +        return ret;
>>       }
>>   out:
>>       task_rq_unlock(rq, p, &flags);
>> @@ -4679,15 +4689,18 @@ void sched_setnuma(struct task_struct *p, int
>> nid)
>>   static int migration_cpu_stop(void *data)
>>   {
>>       struct migration_arg *arg = data;
>> +    int ret = 0;
>>
>>       /*
>>        * The original target cpu might have gone down and we might
>>        * be on another cpu but it doesn't matter.
>>        */
>>       local_irq_disable();
>> -    __migrate_task(arg->task, raw_smp_processor_id(), arg->dest_cpu);
>> +    if (!__migrate_task(arg->task, raw_smp_processor_id(),
>> arg->dest_cpu))
>> +        ret = -EAGAIN;
>>       local_irq_enable();
>> -    return 0;
>> +
>> +    return ret;
>>   }
>>
>>   #ifdef CONFIG_HOTPLUG_CPU
>>
>
> Peter,
>
> Did you intend for me to run with this patch or was it posted for
> discussion only? If you want it run, please tell me what to look for.
> Also, if I should run this, should I include any other patches, either
> the last one you posted in this thread or any of Tejun's?
>
> Thanks.
>

Ping?

-- 
-- Jason J. Herne (jjherne@...ux.vnet.ibm.com)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ