linux-kernel - Re: [PATCH v1 0/2] x86, apic: Disable BSP if boot cpu is AP

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20121018.120805.464238435.d.hatayama@jp.fujitsu.com>
Date:	Thu, 18 Oct 2012 12:08:05 +0900 (JST)
From:	HATAYAMA Daisuke <d.hatayama@...fujitsu.com>
To:	vgoyal@...hat.com
Cc:	linux-kernel@...r.kernel.org, kexec@...ts.infradead.org,
	x86@...nel.org, mingo@...e.hu, tglx@...utronix.de, hpa@...or.com,
	len.brown@...el.com, fenghua.yu@...el.com, ebiederm@...ssion.com,
	grant.likely@...retlab.ca, rob.herring@...xeda.com
Subject: Re: [PATCH v1 0/2] x86, apic: Disable BSP if boot cpu is AP

From: Vivek Goyal <vgoyal@...hat.com>
Subject: Re: [PATCH v1 0/2] x86, apic: Disable BSP if boot cpu is AP
Date: Wed, 17 Oct 2012 10:12:34 -0400

> On Tue, Oct 16, 2012 at 01:35:17PM +0900, HATAYAMA Daisuke wrote:
>> Multiple CPUs are useful for CPU-bound processing like compression and
>> I do want to use compression to generate crash dump quickly. But now
>> we cannot wakeup the 2nd and later cpus in the kdump 2nd kernel if
>> crash happens on AP. If crash happens on AP, kexec enters the 2nd
>> kernel with the AP, and there BSP in the 1st kernel is expected to be
>> haling in the 1st kernel or possibly in any fatal system error state.
> 
> Hatayama san,
> 

Hello Vivek,

> Do you have any rough numbers on what kind of speed up we are looking
> at. IOW, what % of time is gone compressing a filetered dump. On large
> memory machines, saving huge dump files is anyway not an option due to
> time it takes. So we need to filter it to bare minimum and after that
> vmcore size should be reasonable and compression time might not be a
> big factor. Hence I am curious what kind of gains we are looking at.
> 

I did two kinds of benchmark 1) to evaluate how well compression and
writing dump into multiple disks performs on crash dump and 2) to
compare three kinds of compression algorhythm --- zlib, lzo and snappy
--- for use of crash dump.

>From 1), 4 disks with 4 cpus performs 300 MB/s on compression with
snappy. 1 hour for 1 TB. But on this benchmark, sample data is
intentionally randomized enough so data size is not reduced during
compression, it must be quicker on most of actual dumps. See also
bench_comp_multi_IO.tar.xz for image of graph.

In the future, I'm going to do this benchmark again using quicker SSD
disks if I get them.

>From 2), zlib, used when doing makedumpfile -c, turns out to be too
slow to use it for crash dump. lzo and snappy is quick and relatively
as good compression ratio as zlib. In particular, snappy speed is
stable on any ratio of randomized part. See also
bench_compare_zlib_lzo_snappy.tar.xz for image of graph.

BTW, makedumpfile has already supported lzo since v1.4.4 and is going
to support snappy on v1.5.1.

OTOH, we have some requirements where we cannot use filtering.
Examples are:

- high-availability cluster system where application triggers crash
  dump to switch the active node to inactive node quickly. We retrieve
  the application image as process core dump later and analize it. We
  cannot filter user-space memory.

- On qemu/kvm environment, we sometimes face a complicated bug caused
  by interaction between guest and host.

  For example, previously, there was a bug causing guest machine to
  hang, where IO scheduler handled guest's request as wrongly lower
  request than the actual one and guest was waiting for IO completion
  endlessly, repeating VMenter-VMexit forever.

  To address such kind of bug, we first reproduce the bug, get host's
  crash dump to capture the situation, and then analyze the bug by
  investigating the situation from both host's and guest's views. On
  the bug above, we saw guest machine was waiting for IO, and we could
  resolve the issue relatively quickly. For this kind of complicated
  bug relevant to qemu/kvm, both host and guest views are very
  helpful.

  guest image is in user-space memory, qemu process, and again we
  cannot filter user-space memory.

- Filesystem people say page cache is often necessary for analysis of
  crash dump.

Of course, we use filtering positively on the system where no such
requreirement is given.

>> 
>> To wake up AP, we use the method called INIT-INIT-SIPI. INIT causes
>> BSP to jump into BIOS init code. A typical visible behaviour is hang
>> or immediate reset, depending on the BIOS init code.
>> 
>> AP can be initiated by INIT even in a fatal state: MP spec explains
>> that processor-specific INIT can be used to recover AP from a fatal
>> system error. On the other hand, there's no method for BSP to recover;
>> it might be possible to do so by NMI plus any hand-coded reset code
>> that is carefully designed, but at least I have no idea in this
>> direction now.
>> 
>> Therefore, the idea I do in this patch set is simply to disable BSP if
>> vboot cpu is AP.
> 
> So in regular boot BSP still works as we boot on BSP. So this will take
> effect only in kdump kernel?
> 

Yes, this patch is effective only for the case where boot cpu is not
BSP, AP, and this happens in kexec case only.

> How well does it work with nr_cpus kernel parameter. Currently we boot
> with nr_cpus=1 to save upon amount of memory to be reserved. I guess
> you might want to boot with nr_cpus=2 or nr_cpus=4 in your case to
> speed up compression?

Exactly, it seems reasonable to specify at most nr_cpus=4 on usual
machines becaue reserved memory is severely limited, and many disks
are difficult to connect only for crash dump use without special
requrement.

But there might be the system where crash dump is definitely done
quickly and for it, more reserved memory and more disks are no
problem. On such system, I think it's necessary to be able to set up
more reserved memory and more cpus.

> 
> 
> [..]
>> Note: recent upstream kernel fails reserving memory for kdump 2nd
>> kernel. To run kdump, please apply the patch below on top of this
>> patch set:
>> https://lkml.org/lkml/2012/8/31/238
> 
> Above is a big issue. 3.6 kernel is broken and I can't take dump on F18
> either. (works only on one machine). I have not looked enough into it
> the issue to figure out what's the issue at hand, but we really need
> atleast a stop gap fix (assuming others are working on longer term 
> fix).
> 

To be honest, I didn't follow this in some monthes, but I expect the
topic below to fix this issue. What's the status of bzImage
limitation?

http://lists.infradead.org/pipermail/kexec/2012-September/006855.html

Thanks.
HATAYAMA, Daisuke

Download attachment "bench_comp_multi_IO.tar.xz" of type "Application/Octet-Stream" (62112 bytes)

Download attachment "bench_compare_zlib_lzo_snappy.tar.xz" of type "Application/Octet-Stream" (147120 bytes)