linux-kernel - Re: [PATCH v12 7/7] x86/crash: Add x86 crash hotplug support

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <Y0d+mFivS+88+Chr@MiWiFi-R3L-srv>
Date:   Thu, 13 Oct 2022 10:57:28 +0800
From:   Baoquan He <bhe@...hat.com>
To:     Borislav Petkov <bp@...en8.de>
Cc:     Eric DeVolder <eric.devolder@...cle.com>,
        Oscar Salvador <osalvador@...e.de>,
        Andrew Morton <akpm@...ux-foundation.org>, david@...hat.com,
        linux-kernel@...r.kernel.org, x86@...nel.org,
        kexec@...ts.infradead.org, ebiederm@...ssion.com,
        dyoung@...hat.com, vgoyal@...hat.com, tglx@...utronix.de,
        mingo@...hat.com, dave.hansen@...ux.intel.com, hpa@...or.com,
        nramas@...ux.microsoft.com, thomas.lendacky@....com,
        robh@...nel.org, efault@....de, rppt@...nel.org,
        sourabhjain@...ux.ibm.com, linux-mm@...ck.org
Subject: Re: [PATCH v12 7/7] x86/crash: Add x86 crash hotplug support

On 10/12/22 at 10:41pm, Borislav Petkov wrote:
> On Wed, Oct 12, 2022 at 03:19:19PM -0500, Eric DeVolder wrote:
> > We run here QEMU with the ability for 1024 DIMM slots.
> 
> QEMU, haha.
> 
> What is the highest count of DIMM slots which are hotpluggable on a
> real, *physical* system today? Are you saying you can have 1K DIMM slots
> on a board?

The concern to range number mainly is on Virt guest systems. On
baremetal system, basically only very high end server support memory hotplug.
I ever visited customer's lab and saw one server, it owns 8 slots, on
each slot a box containing about 20 cpus and 2T memory at most can be
plugged in at one time. So people won't make too many slots for
hotplugging since it's too expensive.

I checked user space kexec code, the maximum memory range number is
honored to x86_64 because of a HPE SGI system. After that, nobody
complains about it. Please see below user space kexec-tools commit in
https://git.kernel.org/pub/scm/utils/kernel/kexec/kexec-tools.git

The memory ranges may be not all made by different DIMM slots, could be
firmware reservatoin, e.g efi/BIOS diggged out physical memory, or the
cpu logical address space is occupied by pci or other stuffs. I don't
have a HPE SGI system at hand to check.

commit 4a6d67d9e938a7accf128aff23f8ad4bda67f729
Author: Xunlei Pang <xlpang@...hat.com>
Date:   Thu Mar 23 19:16:59 2017 +0800

    x86: Support large number of memory ranges

    We got a problem on one SGI 64TB machine, the current kexec-tools
    failed to work due to the insufficient ranges(MAX_MEMORY_RANGES)
    allowed which is defined as 1024(less than the ranges on the machine).
    The kcore header is insufficient due to the same reason as well.

    To solve this, this patch simply doubles "MAX_MEMORY_RANGES" and
    "KCORE_ELF_HEADERS_SIZE".

    Signed-off-by: Xunlei Pang <xlpang@...hat.com>
    Tested-by: Frank Ramsay <frank.ramsay@....com>
    Signed-off-by: Simon Horman <horms@...ge.net.au>

diff --git a/kexec/arch/i386/kexec-x86.h b/kexec/arch/i386/kexec-x86.h
index 33df3524f4e2..51855f8db762 100644
--- a/kexec/arch/i386/kexec-x86.h
+++ b/kexec/arch/i386/kexec-x86.h
@@ -1,7 +1,7 @@
 #ifndef KEXEC_X86_H
 #define KEXEC_X86_H

-#define MAX_MEMORY_RANGES 1024
+#define MAX_MEMORY_RANGES 2048

> 
> I hardly doubt that.

The questioning is reasonable. 32K truly looks too much. 

Now CONFIG_NR_CPUS has the maximum number as 8192. And user space 
kexec-tools has maximum memory range number as 2048. We can take
the current 8192 + 2048  = 10K as default value conservatively. Or
take 8192 + 2048 * 2 = 12K which has two times of maximum memory range
bumber in kexec-tools. What do you think?

> 
> > So, for example, 1TiB requires 1024 DIMMs of 1GiB each with 128MiB
> > memblocks, that results in 8K possible memory regions. So just going
> > to 4TiB reaches 32K memory regions.
> 
> Lemme see if I understand this correctly: when a system like that
> crashes, you want to kdump *all* those 4TiB in a vmcore? How long would
> that dump take to complete? A day?

That is not a problem. The time of vmcore dumping mainly depends on the
actual memory size, not on memory range numbers. when dumping vmcore,
people use makedumpfile to filter zero page, free page, cache page, or user
date page according to configuration. If memory is huge, they can use
nr_cpus=x to enable multiple cpu to do multi-thread dumping. Kdump now
support more than 10 TB vmcore dumping.