lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Tue, 21 May 2013 22:14:54 +0200
From:	Frederic Weisbecker <fweisbec@...il.com>
To:	Steven Rostedt <rostedt@...dmis.org>
Cc:	LKML <linux-kernel@...r.kernel.org>,
	Ingo Molnar <mingo@...nel.org>,
	Peter Zijlstra <peterz@...radead.org>,
	Thomas Gleixner <tglx@...utronix.de>,
	"Paul E. McKenney" <paulmck@...ibm.com>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Paul Gortmaker <paul.gortmaker@...driver.com>,
	Tejun Heo <tj@...nel.org>
Subject: Re: [PATCH][3.10] nohz: Fix lockup on restart from wrong error code

2013/5/21 Steven Rostedt <rostedt@...dmis.org>:
> commit a382bf934449 "nohz: Assign timekeeping duty to a CPU outside the
> full dynticks range" added a cpu notifier callback that would prevent
> the time keeping CPU from going offline if the have_nohz_full_mask was
> set.
>
> This also prevents the CPU from going offline on system reboot.
>
> Worse yet, the return code was -EINVAL, but the notifier does not
> recognize error codes, and it must be wrapped by a notifier_from_errno()
> function. This means that even though the CPU would fail to go down, the
> notifier would think it succeeded, and the cpu down process would
> continue.
>
> This caused two different problems. One, the migration thread after
> moving tasks from the CPU would park itself and then a task, namely the
> reboot task, could migrate onto that CPU. Then the reboot task spins
> waiting for the cpu to go idle. But because the reboot task happens to
> be spinning on the cpu its waiting for, the system hangs.

Can that happen if that CPU is the boot CPU? Note this is the only
possible timekeeper with the upstream code.

>
> The other error that happened was that the sched_domain re-setup would
> get confused, and in get_group() the cpu = cpumask_first() would process
> a mask that had nothing set, and return cpu > nr_cpu_ids. Later it would
> reference the per_cpu sg with this CPU and get a bogus pointer and
> crash.

Ouch, when are we doing this domain re-setup? I remember we
repartition the domains after cpu down/up but I don't understand how
that can interfere with this issue.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ