[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <mafs0h64dk6wf.fsf@kernel.org>
Date: Fri, 28 Feb 2025 23:04:48 +0000
From: Pratyush Yadav <pratyush@...nel.org>
To: Mike Rapoport <rppt@...nel.org>
Cc: Pratyush Yadav <pratyush@...nel.org>, linux-kernel@...r.kernel.org,
Alexander Graf <graf@...zon.com>, Andrew Morton
<akpm@...ux-foundation.org>, Andy Lutomirski <luto@...nel.org>, Anthony
Yznaga <anthony.yznaga@...cle.com>, Arnd Bergmann <arnd@...db.de>,
Ashish Kalra <ashish.kalra@....com>, Benjamin Herrenschmidt
<benh@...nel.crashing.org>, Borislav Petkov <bp@...en8.de>, Catalin
Marinas <catalin.marinas@....com>, Dave Hansen
<dave.hansen@...ux.intel.com>, David Woodhouse <dwmw2@...radead.org>,
Eric Biederman <ebiederm@...ssion.com>, Ingo Molnar <mingo@...hat.com>,
James Gowans <jgowans@...zon.com>, Jonathan Corbet <corbet@....net>,
Krzysztof Kozlowski <krzk@...nel.org>, Mark Rutland
<mark.rutland@....com>, Paolo Bonzini <pbonzini@...hat.com>, Pasha
Tatashin <pasha.tatashin@...een.com>, "H. Peter Anvin" <hpa@...or.com>,
Peter Zijlstra <peterz@...radead.org>, Rob Herring <robh+dt@...nel.org>,
Rob Herring <robh@...nel.org>, Saravana Kannan <saravanak@...gle.com>,
Stanislav Kinsburskii <skinsburskii@...ux.microsoft.com>, Steven Rostedt
<rostedt@...dmis.org>, Thomas Gleixner <tglx@...utronix.de>, Tom
Lendacky <thomas.lendacky@....com>, Usama Arif
<usama.arif@...edance.com>, Will Deacon <will@...nel.org>,
devicetree@...r.kernel.org, kexec@...ts.infradead.org,
linux-arm-kernel@...ts.infradead.org, linux-doc@...r.kernel.org,
linux-mm@...ck.org, x86@...nel.org
Subject: Re: [PATCH v4 00/14] kexec: introduce Kexec HandOver (KHO)
On Fri, Feb 28 2025, Mike Rapoport wrote:
> Hi Pratyush,
>
> On Wed, Feb 26, 2025 at 08:08:27PM +0000, Pratyush Yadav wrote:
>> Hi Mike,
>>
>> On Thu, Feb 06 2025, Mike Rapoport wrote:
>>
>> > From: "Mike Rapoport (Microsoft)" <rppt@...nel.org>
>> >
>> > Hi,
>> >
>> > This a next version of Alex's "kexec: Allow preservation of ftrace buffers"
>> > series (https://lore.kernel.org/all/20240117144704.602-1-graf@amazon.com),
>> > just to make things simpler instead of ftrace we decided to preserve
>> > "reserve_mem" regions.
>> [...]
>>
>> I applied the patches on top of v6.14-rc1 and tried them out on an x86
>> qemu machine . When I do a plain KHO activate and kexec, I get the below
>> errors on boot. This causes networking to fail on the VM. The errors are
>> consistent and happen every kexec-reboot, though fairly late in boot
>> after systemd tries to bring up network. The same setup has worked fine
>> with Alex's v3 of KHO patches.
>>
>> Do you see anything obvious that might cause this? I can try to debug
>> this tomorrow, but if it rings any loud bells it would be nice to know.
>
> Thanks for the report!
> It didn't ring any bells, but after I've found the issue and a
> fast-and-dirty fix.
>
> The scratch areas are allocated from high addresses and there is no scratch
> memory to satisfy memblock_alloc_low() in swiotb, so second kernel produces
> a couple of
>
> software IO TLB: swiotlb_memblock_alloc: Failed to allocate 67108864 bytes for tlb structure
I also did some digging today and ended up finding the same thing out
but it seems you got there before me :-)
>
> and without those buffers e1000 can't dma :(
>
> A quick fix would be to add another scratch area in the lower memory
> (below). I'll work on a better fix.
I have already written a less-quick fix (patch pasted below) so I
suppose we can use that to review the idea instead. It adds a dedicated
scratch area for lowmem, similar to your patch, and adds some tracking
to calculate the size.
I am not sure if the size estimation is completely right though, since
it is possible that allocations that don't _need_ to be in lowmem end up
being there, causing the scratch area to be too big (or perhaps even
causing allocation failures if the scale is big enough). Maybe we would
be better off tracking lowmem allocation requests separately?
----- 8< -----
>From d60aeb2c4a1c0eea05e1a13b48b268d6192a615e Mon Sep 17 00:00:00 2001
From: Pratyush Yadav <ptyadav@...zon.de>
Date: Fri, 28 Feb 2025 22:36:06 +0000
Subject: [PATCH] KHO: always have a lowmem scratch region
During initialization, some callers need to allocate low memory from
memblock. One such caller is swiotlb_memblock_alloc() on x86. The global
and per-node scratch regions are allocated without any constraints on
the range. This can lead to having no scratch region in lowmem. If that
happens, the lowmem allocations can fail, leading to failures during
boot or later down the line.
Always ensure there is some scratch memory available in low memory by
having a separate scratch area for it, along with the global and
per-node ones, and allow specifying its size via the command line.
To more accurately guess suitable scratch sizes, add
memblock_reserved_kern_lowmem_size() and
memblock_reserved_kern_highmem_size() which calculate how much memory
was allocated in low and high memory, along with some helper functions
to calculate scratch sizes.
Signed-off-by: Pratyush Yadav <ptyadav@...zon.de>
---
.../admin-guide/kernel-parameters.txt | 9 +-
Documentation/kho/usage.rst | 10 +--
include/linux/memblock.h | 2 +
kernel/kexec_handover.c | 83 ++++++++++++++-----
mm/memblock.c | 28 +++++++
5 files changed, 103 insertions(+), 29 deletions(-)
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index ed656e2fb05ef..7c5afd45ad9dc 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2705,7 +2705,7 @@
"1" | "on" | "y" - kexec handover is enabled
kho_scratch= [KEXEC,EARLY]
- Format: nn[KMG],mm[KMG] | nn%
+ Format: ll[KMG],nn[KMG],mm[KMG] | nn%
Defines the size of the KHO scratch region. The KHO
scratch regions are physically contiguous memory
ranges that can only be used for non-kernel
@@ -2715,9 +2715,10 @@
bootstrap itself.
It is possible to specify the exact amount of
- memory in the form of "nn[KMG],mm[KMG]" where the
- first parameter defines the size of a global
- scratch area and the second parameter defines the
+ memory in the form of "ll[KMG],nn[KMG],mm[KMG]" where the
+ first parameter defines the size of a low memory scratch
+ area, the second parameter defines the size of a global
+ scratch area and the third parameter defines the
size of additional per-node scratch areas.
The form "nn%" defines scale factor (in percents)
of memory that was used during boot.
diff --git a/Documentation/kho/usage.rst b/Documentation/kho/usage.rst
index e7300fbb309c1..6a6011809795d 100644
--- a/Documentation/kho/usage.rst
+++ b/Documentation/kho/usage.rst
@@ -19,11 +19,11 @@ at compile time. Every KHO producer may have its own config option that you
need to enable if you would like to preserve their respective state across
kexec.
-To use KHO, please boot the kernel with the ``kho=on`` command line
-parameter. You may use ``kho_scratch`` parameter to define size of the
-scratch regions. For example ``kho_scratch=512M,512M`` will reserve a 512
-MiB for a global scratch region and 512 MiB per NUMA node scratch regions
-on boot.
+To use KHO, please boot the kernel with the ``kho=on`` command line parameter.
+You may use ``kho_scratch`` parameter to define size of the scratch regions. For
+example ``kho_scratch=128M,512M,512M`` will reserve a 128 MiB low memory scratch
+region, a 512 MiB global scratch region and 512 MiB per NUMA node scratch
+regions on boot.
Perform a KHO kexec
-------------------
diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index 20887e199cdbd..9f5c5aec4b1d4 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -504,6 +504,8 @@ static inline __init_memblock bool memblock_bottom_up(void)
phys_addr_t memblock_phys_mem_size(void);
phys_addr_t memblock_reserved_size(void);
phys_addr_t memblock_reserved_kern_size(int nid);
+phys_addr_t memblock_reserved_kern_lowmem_size(void);
+phys_addr_t memblock_reserved_kern_highmem_size(void);
unsigned long memblock_estimated_nr_free_pages(void);
phys_addr_t memblock_start_of_DRAM(void);
phys_addr_t memblock_end_of_DRAM(void);
diff --git a/kernel/kexec_handover.c b/kernel/kexec_handover.c
index c26753d613cbc..29eeed09ceb31 100644
--- a/kernel/kexec_handover.c
+++ b/kernel/kexec_handover.c
@@ -547,20 +547,21 @@ late_initcall(kho_init);
*
* kho_scratch=N%
*
- * It is also possible to explicitly define size for a global and per-node
- * scratch areas:
+ * It is also possible to explicitly define size for a lowmem, a global and
+ * per-node scratch areas:
*
- * kho_scratch=n[KMG],m[KMG]
+ * kho_scratch=l[KMG],n[KMG],m[KMG]
*
* The explicit size definition takes precedence over scale definition.
*/
static unsigned int scratch_scale __initdata = 200;
static phys_addr_t scratch_size_global __initdata;
static phys_addr_t scratch_size_pernode __initdata;
+static phys_addr_t scratch_size_lowmem __initdata;
static int __init kho_parse_scratch_size(char *p)
{
- unsigned long size, size_pernode;
+ unsigned long size, size_pernode, size_global;
char *endptr, *oldp = p;
if (!p)
@@ -578,15 +579,25 @@ static int __init kho_parse_scratch_size(char *p)
if (*p != ',')
return -EINVAL;
+ oldp = p;
+ size_global = memparse(p + 1, &p);
+ if (!size_global || p == oldp)
+ return -EINVAL;
+
+ if (*p != ',')
+ return -EINVAL;
+
size_pernode = memparse(p + 1, &p);
if (!size_pernode)
return -EINVAL;
- scratch_size_global = size;
+ scratch_size_lowmem = size;
+ scratch_size_global = size_global;
scratch_size_pernode = size_pernode;
scratch_scale = 0;
- pr_notice("scratch areas: global: %lluMB pernode: %lldMB\n",
+ pr_notice("scratch areas: lowmem: %lluMB global: %lluMB pernode: %lldMB\n",
+ (u64)(scratch_size_lowmem >> 20),
(u64)(scratch_size_global >> 20),
(u64)(scratch_size_pernode >> 20));
}
@@ -595,18 +606,38 @@ static int __init kho_parse_scratch_size(char *p)
}
early_param("kho_scratch", kho_parse_scratch_size);
-static phys_addr_t __init scratch_size(int nid)
+static phys_addr_t __init scratch_size_low(void)
+{
+ phys_addr_t size;
+
+ if (scratch_scale)
+ size = memblock_reserved_kern_lowmem_size() * scratch_scale / 100;
+ else
+ size = scratch_size_lowmem;
+
+ return round_up(size, CMA_MIN_ALIGNMENT_BYTES);
+}
+
+static phys_addr_t __init scratch_size_high(void)
+{
+ phys_addr_t size;
+
+ if (scratch_scale)
+ size = memblock_reserved_kern_highmem_size() * scratch_scale / 100;
+ else
+ size = scratch_size_global;
+
+ return round_up(size, CMA_MIN_ALIGNMENT_BYTES);
+}
+
+static phys_addr_t __init scratch_size_node(int nid)
{
phys_addr_t size;
- if (scratch_scale) {
+ if (scratch_scale)
size = memblock_reserved_kern_size(nid) * scratch_scale / 100;
- } else {
- if (numa_valid_node(nid))
- size = scratch_size_pernode;
- else
- size = scratch_size_global;
- }
+ else
+ size = scratch_size_pernode;
return round_up(size, CMA_MIN_ALIGNMENT_BYTES);
}
@@ -623,29 +654,41 @@ static phys_addr_t __init scratch_size(int nid)
static void kho_reserve_scratch(void)
{
phys_addr_t addr, size;
- int nid, i = 1;
+ int nid, i = 0;
if (!kho_enable)
return;
/* FIXME: deal with node hot-plug/remove */
- kho_scratch_cnt = num_online_nodes() + 1;
+ kho_scratch_cnt = num_online_nodes() + 2;
size = kho_scratch_cnt * sizeof(*kho_scratch);
kho_scratch = memblock_alloc(size, PAGE_SIZE);
if (!kho_scratch)
goto err_disable_kho;
- /* reserve large contiguous area for allocations without nid */
- size = scratch_size(NUMA_NO_NODE);
- addr = memblock_phys_alloc(size, CMA_MIN_ALIGNMENT_BYTES);
+ /* reserve area for lowmem allocations. */
+ size = scratch_size_low();
+ addr = memblock_phys_alloc_range(size, CMA_MIN_ALIGNMENT_BYTES, 0,
+ ARCH_LOW_ADDRESS_LIMIT);
if (!addr)
goto err_free_scratch_desc;
kho_scratch[0].addr = addr;
kho_scratch[0].size = size;
+ i++;
+
+ /* reserve large contiguous area for allocations without nid */
+ size = scratch_size_high();
+ addr = memblock_phys_alloc(size, CMA_MIN_ALIGNMENT_BYTES);
+ if (!addr)
+ goto err_free_scratch_areas;
+
+ kho_scratch[1].addr = addr;
+ kho_scratch[1].size = size;
+ i++;
for_each_online_node(nid) {
- size = scratch_size(nid);
+ size = scratch_size_node(nid);
addr = memblock_alloc_range_nid(size, CMA_MIN_ALIGNMENT_BYTES,
0, MEMBLOCK_ALLOC_ACCESSIBLE,
nid, true);
diff --git a/mm/memblock.c b/mm/memblock.c
index fdb08b60efc17..da7abf5e5e504 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -1837,6 +1837,34 @@ phys_addr_t __init_memblock memblock_reserved_size(void)
return memblock.reserved.total_size;
}
+phys_addr_t __init_memblock memblock_reserved_kern_lowmem_size(void)
+{
+ struct memblock_region *r;
+ phys_addr_t total = 0;
+
+ for_each_reserved_mem_region(r) {
+ if ((r->flags & MEMBLOCK_RSRV_KERN) &&
+ (r->base + r->size <= ARCH_LOW_ADDRESS_LIMIT))
+ total += r->size;
+ }
+
+ return total;
+}
+
+phys_addr_t __init_memblock memblock_reserved_kern_highmem_size(void)
+{
+ struct memblock_region *r;
+ phys_addr_t total = 0;
+
+ for_each_reserved_mem_region(r) {
+ if ((r->flags & MEMBLOCK_RSRV_KERN) &&
+ (r->base + r->size > ARCH_LOW_ADDRESS_LIMIT))
+ total += r->size;
+ }
+
+ return total;
+}
+
phys_addr_t __init_memblock memblock_reserved_kern_size(int nid)
{
struct memblock_region *r;
--
Regards,
Pratyush Yadav
Powered by blists - more mailing lists