lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <1343929732.3010.614.camel@misato.fc.hp.com>
Date:	Thu, 02 Aug 2012 11:48:52 -0600
From:	Toshi Kani <toshi.kani@...com>
To:	Igor Mammedov <imammedo@...hat.com>
Cc:	linux-kernel@...r.kernel.org, prarit@...hat.com, oleg@...hat.com,
	rob@...dley.net, tglx@...utronix.de, mingo@...hat.com,
	hpa@...or.com, x86@...nel.org, luto@....edu,
	suresh b siddha <suresh.b.siddha@...el.com>, avi@...hat.com,
	a p zijlstra <a.p.zijlstra@...llo.nl>, johnstul@...ibm.com,
	a17711@...orola.com, shuahkhan@...il.com
Subject: Re: [PATCH 1/2] x86: abort secondary CPU bring-up gracefully if
 do_boot_cpu timed out on cpu_callin_mask

On Thu, 2012-08-02 at 11:34 +0200, Igor Mammedov wrote:
> Hi Toshi,
> 
> I'm sorry for delayed response, I was on vacation. Thanks for looking at 
> the patch, my comments are below.

Hi Igor,

Welcome back!

> PS:
> I'm not happy with introducing one more sync point and bitmat. Well, 
> it's necessary to somehow notify being on-lined CPU that master CPU will 
> wait for it, but perhaps it could be done even earlier than in this 
> patch and less stuff should be backed-out.
> 
> Currently master CPU spins on cpu_callin_mask till secondary CPU sets it
> in smp_callin(). Then master CPU leaves do_boot_cpu() and does nothing 
> in native_cpu_up() first waiting for secondary CPU in 
> check_tsc_sync_source() or if that is skipped then immediately spinning 
> on 'while (!cpu_online(cpu))'.
> Master CPU will do nothing and will not call any CPU notifiers until 
> secondary CPU reports that it is ONLINE by setting bit in 
> cpu_online_mask at the end of start_secondary().
> 
> So questions to experts are:
> 
>   1. what is purpose of cpu_callin_mask
> 
>   2. why master CPU spins on cpu_callin_mask but not on cpu_initialized_mask
> 
> In current state of code, it looks like cpu_callin_mask is not necessary 
> and we could remove it completely and spin on cpu_initialized_mask in 
> do_boot_cpu() instead. Then when master CPU
> sees secondary CPU in cpu_initialized_mask it could set cpu_callout_mask 
> to ack its intention not to cancel and wait until
> secondary CPU is booted.

I agree and I'd like to know the answers as well.  This way, the master
does not have to deal with secondary's back-back out procedure.

> PS2:
> I wish x86 maintainers were more responsive to the topic and in 
> discussion we could find a way to fix problem. With their expertise, 
> It's surely easier to spot potential issues and see correct approach for 
> the fix.
> 

(snip)

> > > +
> > > +die:
> > > +#ifdef CONFIG_HOTPLUG_CPU
> > > +	numa_remove_cpu(cpuid);
> >
> > smp_callin() calls numa_add_cpu(), so it makes sense to perform this
> > back-out from here.  However, do_boot_cpu() also calls this function
> > in
> > its error path.  I think we should change do_boot_cpu() to perform its
> > back-out to the master CPU's initialization code only.  This would keep
> > their responsibility clear and avoid any race condition.
> Reason to keep numa_remove_cpu() in do_boot_cpu() is for the case where
> being onlined CPU is permanently stuck on boot. In this case 
> numa_remove_cpu() would not be called from smp_callin().
> Anyway race is still there:
>   master CPU: numa_remove_cpu()
>    ... window with incorrect numa state
>   secondary CPU: numa_add_cpu()
>   secondary CPU: numa_remove_cpu()

Right.  Another example is that the master can call numa_remove_cpu()
after a secondary called numa_add_cpu().  If the secondary's next
procedure relies on numa_add_cpu() be done, it causes a problem.

Anyway, I like your idea of the master to wait for cpu_initialized_mask.
This should eliminate the need of the master to perform secondary's
back-out procedure.

> > Also, it would be helpful to have a comment like /* was set by
> > smp_store_cpu_info() */ to state this responsibility clearly.
> I'll fix it in next version.
> 
> >
> > > +	remove_siblinginfo(cpuid);
> > > +	clear_local_APIC();
> > > +	/* was set by cpu_init() */
> > > +	cpumask_clear_cpu(cpuid, cpu_initialized_mask);
> >
> > This is also called by do_boot_cpu().  Same comment as above.
> >
> > > +	cpumask_clear_cpu(cpuid, cpu_callin_mask);
> > > +	play_dead();
> > > +#endif
> > > +	panic("%s: Failed to online CPU%d!\n", __func__, cpuid);
> >
> > Why does it panic in case of !CONFIG_HOTPLUG_CPU?  Is this because user
> > cannot online this CPU later on, so it might be better off rebooting
> > with a panic?  Can I also assume that user can try to on-line this
> > failed CPU if CONFIG_HOTPLUG_CPU is set?  Some comment would be helpful
> > to clarify this behavior.
> User isn't able to online/offline CPUs if kernel is built without 
> CONFIG_HOTPLUG_CPU.
> Define is here to cover failed on boot CPU for non hotplug capable 
> kernel. A bunch of code to stop CPU is just not built for non hotplug
> kernel so what else to do except of panicking?

Another option is to simply boot-up the system with a reduced number of
CPUs for all cases.  This way has advantage when:

 - If resume/suspend works without CONFIG_HOTPLUG_CPU, it avoids
crashing the system at resume.
 - User can boot-up and uses the system with a reduced number of CPUs
even if the error persists after a reboot.


Thanks,
-Toshi 




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ