lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20150517052612.GA16607@gmail.com>
Date:	Sun, 17 May 2015 07:26:48 +0200
From:	Ingo Molnar <mingo@...nel.org>
To:	Len Brown <lenb@...nel.org>
Cc:	Jan H. Schönherr <jschoenh@...zon.de>,
	Thomas Gleixner <tglx@...utronix.de>, X86 ML <x86@...nel.org>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	Anthony Liguori <aliguori@...zon.com>,
	Ingo Molnar <mingo@...hat.com>,
	"H. Peter Anvin" <hpa@...or.com>, Tim Deegan <tim@....org>,
	Gang Wei <gang.wei@...el.com>,
	Linus Torvalds <torvalds@...ux-foundation.org>
Subject: Re: [PATCH] x86: skip delays during SMP initialization similar to Xen


* Len Brown <lenb@...nel.org> wrote:

> On Thu, May 14, 2015 at 1:57 PM, Ingo Molnar <mingo@...nel.org> wrote:
> >
> > * "Jan H. Schönherr" <jschoenh@...zon.de> wrote:
> >
> >> Ingo, do you want an updated version of the original patch, which
> >> takes care not get stuck, when the INIT deassertion is skipped, or
> >> do you prefer to address delays "one by one" as you wrote elsewhere?
> >
> > So I'm not against improving this code at all, but instead of this
> > hard to follow mixing of old and new code, I'd find the following
> > approach cleaner and more acceptable: create a 'modern' and a 'legacy'
> > SMP-bootup variant function, and do a clean separation based on the
> > CPU model cutoff condition used by Len's patches:
> >
> >         /* if modern processor, use no delay */
> >         if (((boot_cpu_data.x86_vendor == X86_VENDOR_INTEL) && (boot_cpu_data.x86 == 6)) ||
> >             ((boot_cpu_data.x86_vendor == X86_VENDOR_AMD) && (boot_cpu_data.x86 >= 0xF)))
> >                 init_udelay = 0;
> >
> > Then in the modern variant we can become even more aggressive and
> > remove these kinds of delays as well:
> 
> Not sure it is worth two versions, since this is not where the big 
> time is spent.

There aren't many low hanging fruits left, so reducing boot time is 
going to be 'lots of hard work', and part of it is to make the SMP 
bringup path easier to review and tweak.

That having said, we could pass in a delay scaling parameter, which 
could be zero for modern hardware. So something like:

	udelay(100*scale);

would automatically turn into a 0 delay thing on modern hw. On old 
hardware it could be 1.

OTOH there are other legacies here as well, so maybe a split version 
would still be the most robust approach, so that we can have a 
completely separate legacy path we don't change all that often.

OTOH that would also mean 'we don't test it all that often', which 
could result in more breakages than sharing it with modern bootup 
code.

So I'm a bit torn about that.

> See below.
> 
> >
> >                 udelay(300);
> 
> FWIW, MPS 1.4 suggests this should be 200, not 300.
> 
> >                 udelay(200);
> >
> > plus I'd suggest making these poll loops in smpboot.c loops narrower:
> >
> >                         udelay(100);
> 
> FWIW, on my dekstop, this one executed 17 times (1700usec)
> This is the time for the remote CPU to wake and get to cpu_init().
> Why is it a benefit to have any udelay() before invoking schedule()?

See further below.

> 
> >                         udelay(100);
> 
> This one didn't execute at all.  Indeed, I don't understand why it exists,
> per question above.
> 
>                 /*
>                  * Wait till AP completes initial initialization
>                  */
>                 while (!cpumask_test_cpu(cpu, cpu_callin_mask)) {
>                         /*
>                          * Allow other tasks to run while we wait for the
>                          * AP to come online. This also gives a chance
>                          * for the MTRR work(triggered by the AP coming online)
>                          * to be completed in the stop machine context.
>                          */
>                         udelay(100);
>                         schedule();
>                 }

So this is essentially a poor man's sched_yield().

I'd really love to see this moved to a 
wait_for_completion_timeout()/complete() loop instead. For that it 
would have to move out from under the irqs-off section on the boot CPU 
(so that timer irqs can come in), but that should be OK: the fragile 
part is on the AP CPU, not on the BP, I think.

That would allow us, on truly modern hardware, to implement parallel 
bringup of all 120 CPUs, i.e. to first do a loop of triggering bootup 
on all the secondary CPUs, then collect them in a completion loop.

I'd add an easy debug boot option as well to serialize it all again, 
in case something breaks.

This would help parallelize much of the initialization overhead in the 
secondary CPUs:

> So, the latest TIP has the INIT udelay(10,000) removed,
> but cpu_up() still takes nearly 19,000 usec on a HSW dekstop.
> 
> A quick scan of the ftrace shows some high runners:
> 
> 18949.45 us cpu_up()
>         2450.580 us notifier_call_chain
>                  102.751 us thermal_throttle_cpu_callback
>                  289.313 us dpm_sysfs_add
>                 1019.594 us msr_class_cpu_callback
>                 ...
>         8455.462 us native_cpu_up()
>                  500.000 us = udelay(300) + udelay(200) Startup IPI
>                  500.000 us = udelay(300) + udelay(200) Startup IPI
>                 1700.000 us = 17 x udelay(100) waiting for AP in initialized_map
>                 2004.172 us  check_tsc_warp()
> 
>         7977.799 us cpu_notify()
>                 1588.108 us cpuset_cpu_active
>                 3043.955 us  cacheinfo_cpu_callback
>                 1146.234 us  mce_cpu_callback
>                  541.105 us  cpufreq_cpu_callback
>                  213.685 us  coretemp_cpu_callback
> 
> 
> cacheinfo_cpu_callback() time appears to be spent creating a bunch
> of sysfs nodes, which is apparetly an expensive operation.
> 
> check_tsc_warp() is hard-coded to take 2ms. I don't know if 2ms is a 
> magic number or if shorter has same value. It seems a bit sad to do 
> this serially for every CPU at boot, when we could do all the CPUs 
> in parallel after they are on-line. Perhaps this should be invoked 
> only for boot-time and hot-add time. It shouldn't be needed at all 
> for soft online and resume.

So how come the TSC isn't X86_FEATURE_CONSTANT_TSC? That should skip 
the TSC sync-test.

> Startup IPI delays.
> MPS 1.4 actually says 200+200, not 300+200, as Linux reads.
> I don't know where the 300 came from, maybe it was a typo?

No, I think it was simple paranoia - our udelay() used to be imprecise 
(and still can be).

I'd not touch the delays for legacies, I'd create a 'modern' bootup 
path with aggressive optimizations, that new hardware should try hard 
to hit. If it causes problems on some hw we can quirk those out. If 
the quirks look too widespread, we can make things less aggressive.

> msr_class_cpu_callback -- making device nodes is not fast.
> 
> I don't know if anything can be done for the 1700us wait
> for the remote processor to mark itself initialized.
> That is the 1st thing it does when it enters cpu_init().

So that 1.7 msecs delay is the firmware in essence?

Thanks,

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ