lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CACSyD1OU1zjvc2YdQpi-F_qZ0EWyb6jLNODJkRDPQa6m1eObJg@mail.gmail.com>
Date: Wed, 21 May 2025 15:47:28 +0800
From: Zhongkun He <hezhongkun.hzk@...edance.com>
To: Alexander Graf <graf@...zon.com>
Cc: kexec@...ts.infradead.org, linux-kernel@...r.kernel.org, 
	Andrew Morton <akpm@...ux-foundation.org>, Baoquan He <bhe@...hat.com>, 
	Pasha Tatashin <pasha.tatashin@...een.com>, nh-open-source@...zon.com
Subject: Re: [External] [PATCH v3] kexec: Enable CMA based contiguous allocation

On Wed, May 14, 2025 at 2:40 AM Alexander Graf <graf@...zon.com> wrote:
>
> When booting a new kernel with kexec_file, the kernel picks a target
> location that the kernel should live at, then allocates random pages,
> checks whether any of those patches magically happens to coincide with
> a target address range and if so, uses them for that range.
>
> For every page allocated this way, it then creates a page list that the
> relocation code - code that executes while all CPUs are off and we are
> just about to jump into the new kernel - copies to their final memory
> location. We can not put them there before, because chances are pretty
> good that at least some page in the target range is already in use by
> the currently running Linux environment. Copying is happening from a
> single CPU at RAM rate, which takes around 4-50 ms per 100 MiB.
>
> All of this is inefficient and error prone.
>
> To successfully kexec, we need to quiesce all devices of the outgoing
> kernel so they don't scribble over the new kernel's memory. We have seen
> cases where that does not happen properly (*cough* GIC *cough*) and hence
> the new kernel was corrupted. This started a month long journey to root
> cause failing kexecs to eventually see memory corruption, because the new
> kernel was corrupted severely enough that it could not emit output to
> tell us about the fact that it was corrupted. By allocating memory for the
> next kernel from a memory range that is guaranteed scribbling free, we can
> boot the next kernel up to a point where it is at least able to detect
> corruption and maybe even stop it before it becomes severe. This increases
> the chance for successful kexecs.
>
> Since kexec got introduced, Linux has gained the CMA framework which
> can perform physically contiguous memory mappings, while keeping that
> memory available for movable memory when it is not needed for contiguous
> allocations. The default CMA allocator is for DMA allocations.
>
> This patch adds logic to the kexec file loader to attempt to place the
> target payload at a location allocated from CMA. If successful, it uses
> that memory range directly instead of creating copy instructions during
> the hot phase. To ensure that there is a safety net in case anything goes
> wrong with the CMA allocation, it also adds a flag for user space to force
> disable CMA allocations.
>
> Using CMA allocations has two advantages:
>
>   1) Faster by 4-50 ms per 100 MiB. There is no more need to copy in the
>      hot phase.
>   2) More robust. Even if by accident some page is still in use for DMA,
>      the new kernel image will be safe from that access because it resides
>      in a memory region that is considered allocated in the old kernel and
>      has a chance to reinitialize that component.
>
> Signed-off-by: Alexander Graf <graf@...zon.com>
>
> ---
>
> v1 -> v2:
>
>   - Clarify patch description
>   - Move cma pointer out of kexec_segment. That is a sneaky UAPI struct we
>     can not modify. Fixes non kexec_file path
>   - Coding style
>   - Move memset(0) to only clear remainder
>   - Move kexec_alloc_contig() into kexec_locate_mem_hole(). Makes the code
>     flow easier to read.
>   - Sanitize return values
>
> v2 -> v3:
>
>   - Fix refactoring bug which meant we never exercised the new code path
> ---
>  arch/riscv/kernel/elf_kexec.c |   1 +
>  include/linux/kexec.h         |  10 ++++
>  include/uapi/linux/kexec.h    |   1 +
>  kernel/kexec.c                |   2 +-
>  kernel/kexec_core.c           | 100 +++++++++++++++++++++++++++++++---
>  kernel/kexec_file.c           |  47 +++++++++++++++-
>  kernel/kexec_internal.h       |   2 +-
>  7 files changed, 152 insertions(+), 11 deletions(-)
>
> diff --git a/arch/riscv/kernel/elf_kexec.c b/arch/riscv/kernel/elf_kexec.c
> index e783a72d051f..d81647c98c92 100644
> --- a/arch/riscv/kernel/elf_kexec.c
> +++ b/arch/riscv/kernel/elf_kexec.c
> @@ -109,6 +109,7 @@ static int elf_find_pbase(struct kimage *image, unsigned long kernel_len,
>         kbuf.mem = KEXEC_BUF_MEM_UNKNOWN;
>         kbuf.memsz = ALIGN(kernel_len, PAGE_SIZE);
>         kbuf.top_down = false;
> +       kbuf.cma = NULL;
>         ret = arch_kexec_locate_mem_hole(&kbuf);
>         if (!ret) {
>                 *old_pbase = lowest_paddr;
> diff --git a/include/linux/kexec.h b/include/linux/kexec.h
> index c8971861521a..7821b23bd1e9 100644
> --- a/include/linux/kexec.h
> +++ b/include/linux/kexec.h
> @@ -75,6 +75,12 @@ extern note_buf_t __percpu *crash_notes;
>
>  typedef unsigned long kimage_entry_t;
>
> +/*
> + * This is a copy of the UAPI struct kexec_segment and must be identical
> + * to it because it gets copied straight from user space into kernel
> + * memory. Do not modify this structure unless you change the way segments
> + * get ingested from user space.
> + */
>  struct kexec_segment {
>         /*
>          * This pointer can point to user memory if kexec_load() system
> @@ -169,6 +175,7 @@ int kexec_image_post_load_cleanup_default(struct kimage *image);
>   * @buf_min:   The buffer can't be placed below this address.
>   * @buf_max:   The buffer can't be placed above this address.
>   * @top_down:  Allocate from top of memory.
> + * @cma:       CMA page if the buffer is backed by CMA.
>   */
>  struct kexec_buf {
>         struct kimage *image;
> @@ -180,6 +187,7 @@ struct kexec_buf {
>         unsigned long buf_min;
>         unsigned long buf_max;
>         bool top_down;
> +       struct page *cma;
>  };
>
>  int kexec_load_purgatory(struct kimage *image, struct kexec_buf *kbuf);
> @@ -310,6 +318,7 @@ struct kimage {
>
>         unsigned long nr_segments;
>         struct kexec_segment segment[KEXEC_SEGMENT_MAX];
> +       struct page *segment_cma[KEXEC_SEGMENT_MAX];
>
>         struct list_head control_pages;
>         struct list_head dest_pages;
> @@ -331,6 +340,7 @@ struct kimage {
>          */
>         unsigned int hotplug_support:1;
>  #endif
> +       unsigned int no_cma:1;
>
>  #ifdef ARCH_HAS_KIMAGE_ARCH
>         struct kimage_arch arch;
> diff --git a/include/uapi/linux/kexec.h b/include/uapi/linux/kexec.h
> index 5ae1741ea8ea..8958ebfcff94 100644
> --- a/include/uapi/linux/kexec.h
> +++ b/include/uapi/linux/kexec.h
> @@ -27,6 +27,7 @@
>  #define KEXEC_FILE_ON_CRASH    0x00000002
>  #define KEXEC_FILE_NO_INITRAMFS        0x00000004
>  #define KEXEC_FILE_DEBUG       0x00000008
> +#define KEXEC_FILE_NO_CMA      0x00000010
>
>  /* These values match the ELF architecture values.
>   * Unless there is a good reason that should continue to be the case.
> diff --git a/kernel/kexec.c b/kernel/kexec.c
> index a6b3f96bb50c..28008e3d462e 100644
> --- a/kernel/kexec.c
> +++ b/kernel/kexec.c
> @@ -152,7 +152,7 @@ static int do_kexec_load(unsigned long entry, unsigned long nr_segments,
>                 goto out;
>
>         for (i = 0; i < nr_segments; i++) {
> -               ret = kimage_load_segment(image, &image->segment[i]);
> +               ret = kimage_load_segment(image, i);
>                 if (ret)
>                         goto out;
>         }
> diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
> index 3e62b944c883..b5b680dd1796 100644
> --- a/kernel/kexec_core.c
> +++ b/kernel/kexec_core.c
> @@ -40,6 +40,7 @@
>  #include <linux/hugetlb.h>
>  #include <linux/objtool.h>
>  #include <linux/kmsg_dump.h>
> +#include <linux/dma-map-ops.h>
>
>  #include <asm/page.h>
>  #include <asm/sections.h>
> @@ -553,6 +554,24 @@ static void kimage_free_entry(kimage_entry_t entry)
>         kimage_free_pages(page);
>  }
>
> +static void kimage_free_cma(struct kimage *image)
> +{
> +       unsigned long i;
> +
> +       for (i = 0; i < image->nr_segments; i++) {
> +               struct page *cma = image->segment_cma[i];
> +               u32 nr_pages = image->segment[i].memsz >> PAGE_SHIFT;
> +
> +               if (!cma)
> +                       continue;
> +
> +               arch_kexec_pre_free_pages(page_address(cma), nr_pages);
> +               dma_release_from_contiguous(NULL, cma, nr_pages);
> +               image->segment_cma[i] = NULL;
> +       }
> +
> +}
> +
>  void kimage_free(struct kimage *image)
>  {
>         kimage_entry_t *ptr, entry;
> @@ -591,6 +610,9 @@ void kimage_free(struct kimage *image)
>         /* Free the kexec control pages... */
>         kimage_free_page_list(&image->control_pages);
>
> +       /* Free CMA allocations */
> +       kimage_free_cma(image);
> +
>         /*
>          * Free up any temporary buffers allocated. This might hit if
>          * error occurred much later after buffer allocation.
> @@ -716,9 +738,69 @@ static struct page *kimage_alloc_page(struct kimage *image,
>         return page;
>  }
>
> -static int kimage_load_normal_segment(struct kimage *image,
> -                                        struct kexec_segment *segment)
> +static int kimage_load_cma_segment(struct kimage *image, int idx)
> +{
> +       struct kexec_segment *segment = &image->segment[idx];
> +       struct page *cma = image->segment_cma[idx];
> +       char *ptr = page_address(cma);
> +       unsigned long maddr;
> +       size_t ubytes, mbytes;
> +       int result = 0;
> +       unsigned char __user *buf = NULL;
> +       unsigned char *kbuf = NULL;
> +
> +       if (image->file_mode)
> +               kbuf = segment->kbuf;
> +       else
> +               buf = segment->buf;
> +       ubytes = segment->bufsz;
> +       mbytes = segment->memsz;
> +       maddr = segment->mem;
> +
> +       /* Then copy from source buffer to the CMA one */
> +       while (mbytes) {
> +               size_t uchunk, mchunk;
> +
> +               ptr += maddr & ~PAGE_MASK;
> +               mchunk = min_t(size_t, mbytes,
> +                               PAGE_SIZE - (maddr & ~PAGE_MASK));
> +               uchunk = min(ubytes, mchunk);
> +
> +               if (uchunk) {
> +                       /* For file based kexec, source pages are in kernel memory */
> +                       if (image->file_mode)
> +                               memcpy(ptr, kbuf, uchunk);
> +                       else
> +                               result = copy_from_user(ptr, buf, uchunk);
> +                       ubytes -= uchunk;
> +                       if (image->file_mode)
> +                               kbuf += uchunk;
> +                       else
> +                               buf += uchunk;
> +               }
> +
> +               if (result) {
> +                       result = -EFAULT;
> +                       goto out;
> +               }
> +
> +               ptr    += mchunk;
> +               maddr  += mchunk;
> +               mbytes -= mchunk;
> +
> +               cond_resched();
> +       }
> +
> +       /* Clear any remainder */
> +       memset(ptr, 0, mbytes);
> +
> +out:
> +       return result;
> +}
> +
> +static int kimage_load_normal_segment(struct kimage *image, int idx)
>  {
> +       struct kexec_segment *segment = &image->segment[idx];
>         unsigned long maddr;
>         size_t ubytes, mbytes;
>         int result;
> @@ -733,6 +815,9 @@ static int kimage_load_normal_segment(struct kimage *image,
>         mbytes = segment->memsz;
>         maddr = segment->mem;
>
> +       if (image->segment_cma[idx])
> +               return kimage_load_cma_segment(image, idx);
> +
>         result = kimage_set_destination(image, maddr);
>         if (result < 0)
>                 goto out;
> @@ -787,13 +872,13 @@ static int kimage_load_normal_segment(struct kimage *image,
>  }
>
>  #ifdef CONFIG_CRASH_DUMP
> -static int kimage_load_crash_segment(struct kimage *image,
> -                                       struct kexec_segment *segment)
> +static int kimage_load_crash_segment(struct kimage *image, int idx)
>  {
>         /* For crash dumps kernels we simply copy the data from
>          * user space to it's destination.
>          * We do things a page at a time for the sake of kmap.
>          */
> +       struct kexec_segment *segment = &image->segment[idx];
>         unsigned long maddr;
>         size_t ubytes, mbytes;
>         int result;
> @@ -858,18 +943,17 @@ static int kimage_load_crash_segment(struct kimage *image,
>  }
>  #endif
>
> -int kimage_load_segment(struct kimage *image,
> -                               struct kexec_segment *segment)
> +int kimage_load_segment(struct kimage *image, int idx)
>  {
>         int result = -ENOMEM;
>
>         switch (image->type) {
>         case KEXEC_TYPE_DEFAULT:
> -               result = kimage_load_normal_segment(image, segment);
> +               result = kimage_load_normal_segment(image, idx);
>                 break;
>  #ifdef CONFIG_CRASH_DUMP
>         case KEXEC_TYPE_CRASH:
> -               result = kimage_load_crash_segment(image, segment);
> +               result = kimage_load_crash_segment(image, idx);
>                 break;
>  #endif
>         }
> diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c
> index fba686487e3b..916beae68fb6 100644
> --- a/kernel/kexec_file.c
> +++ b/kernel/kexec_file.c
> @@ -27,6 +27,7 @@
>  #include <linux/kernel_read_file.h>
>  #include <linux/syscalls.h>
>  #include <linux/vmalloc.h>
> +#include <linux/dma-map-ops.h>
>  #include "kexec_internal.h"
>
>  #ifdef CONFIG_KEXEC_SIG
> @@ -230,6 +231,8 @@ kimage_file_prepare_segments(struct kimage *image, int kernel_fd, int initrd_fd,
>                 ret = 0;
>         }
>
> +       image->no_cma = !!(flags & KEXEC_FILE_NO_CMA);
> +
>         if (cmdline_len) {
>                 image->cmdline_buf = memdup_user(cmdline_ptr, cmdline_len);
>                 if (IS_ERR(image->cmdline_buf)) {
> @@ -406,7 +409,7 @@ SYSCALL_DEFINE5(kexec_file_load, int, kernel_fd, int, initrd_fd,
>                               i, ksegment->buf, ksegment->bufsz, ksegment->mem,
>                               ksegment->memsz);
>
> -               ret = kimage_load_segment(image, &image->segment[i]);
> +               ret = kimage_load_segment(image, i);
>                 if (ret)
>                         goto out;
>         }
> @@ -632,6 +635,39 @@ static int kexec_walk_resources(struct kexec_buf *kbuf,
>                 return walk_system_ram_res(0, ULONG_MAX, kbuf, func);
>  }
>
> +static int kexec_alloc_contig(struct kexec_buf *kbuf)
> +{
> +       size_t nr_pages = kbuf->memsz >> PAGE_SHIFT;
> +       unsigned long mem;
> +       struct page *p;
> +
> +       /* User space disabled CMA allocations, bail out. */
> +       if (kbuf->image->no_cma)
> +               return -EPERM;
> +
> +       p = dma_alloc_from_contiguous(NULL, nr_pages, get_order(kbuf->buf_align), true);
> +       if (!p)
> +               return -ENOMEM;
> +
> +       pr_debug("allocated %zu DMA pages at 0x%lx", nr_pages, page_to_boot_pfn(p));
> +
> +       mem = page_to_boot_pfn(p) << PAGE_SHIFT;
> +
> +       if (kimage_is_destination_range(kbuf->image, mem, mem + kbuf->memsz)) {
> +               /* Our region is already in use by a statically defined one. Bail out. */
> +               pr_debug("CMA overlaps existing mem: 0x%lx+0x%lx\n", mem, kbuf->memsz);
> +               dma_release_from_contiguous(NULL, p, nr_pages);
> +               return -EBUSY;
> +       }
> +
> +       kbuf->mem = page_to_boot_pfn(p) << PAGE_SHIFT;
> +       kbuf->cma = p;
> +
> +       arch_kexec_post_alloc_pages(page_address(p), (int)nr_pages, 0);
> +
> +       return 0;
> +}
> +
>  /**
>   * kexec_locate_mem_hole - find free memory for the purgatory or the next kernel
>   * @kbuf:      Parameters for the memory search.
> @@ -648,6 +684,13 @@ int kexec_locate_mem_hole(struct kexec_buf *kbuf)
>         if (kbuf->mem != KEXEC_BUF_MEM_UNKNOWN)
>                 return 0;
>
> +       /*
> +        * Try to find a free physically contiguous block of memory first. With that, we
> +        * can avoid any copying at kexec time.
> +        */
> +       if (!kexec_alloc_contig(kbuf))
> +               return 0;

IIUC, The kexec_locate_mem_hole() function is also used in
the KEXEC_TYPE_CRASH kimage, but kexec_alloc_contig() does not skip it.
This can cause kdump to fail and lead to CMA memory leakage.

So I ran some tests, as listed below:
The CMA memory decreases with each execution of the kdump-config
reload operation.

/home/hezhongkun.hzk# kdump-config reload
unloaded kdump kernel.
Creating symlink /var/lib/kdump/vmlinuz.
Creating symlink /var/lib/kdump/initrd.img.
kexec_file_load failed: Cannot assign requested address
failed to load kdump kernel ... failed!

[ 1387.536346] kexec_file: allocated 1817 DMA pages at 0x11b16e
[ 1399.147677] kexec_file: allocated 113 DMA pages at 0x119087
[ 1399.148915] kexec_file: allocated 5 DMA pages at 0x1140f8
[ 1399.150266] kexec_file: allocated 2 DMA pages at 0x1118fd
[ 1399.151474] kexec_file: allocated 8302 DMA pages at 0x11b900

/home/hezhongkun.hzk# cat /proc/meminfo | grep Cma
CmaTotal:        1048576 kB
CmaFree:          679972 kB
/home/hezhongkun.hzk# cat /proc/meminfo | grep Cma
CmaTotal:        1048576 kB
CmaFree:          639016 kB
/home/hezhongkun.hzk# cat /proc/meminfo | grep Cma
CmaTotal:        1048576 kB
CmaFree:          557104 kB

cat /proc/meminfo | grep Cma
CmaTotal:        1048576 kB
CmaFree:          557104 kB

> +
>         if (!IS_ENABLED(CONFIG_ARCH_KEEP_MEMBLOCK))
>                 ret = kexec_walk_resources(kbuf, locate_mem_hole_callback);
>         else
> @@ -693,6 +736,7 @@ int kexec_add_buffer(struct kexec_buf *kbuf)
>         /* Ensure minimum alignment needed for segments. */
>         kbuf->memsz = ALIGN(kbuf->memsz, PAGE_SIZE);
>         kbuf->buf_align = max(kbuf->buf_align, PAGE_SIZE);
> +       kbuf->cma = NULL;
>
>         /* Walk the RAM ranges and allocate a suitable range for the buffer */
>         ret = arch_kexec_locate_mem_hole(kbuf);
> @@ -705,6 +749,7 @@ int kexec_add_buffer(struct kexec_buf *kbuf)
>         ksegment->bufsz = kbuf->bufsz;
>         ksegment->mem = kbuf->mem;
>         ksegment->memsz = kbuf->memsz;
> +       kbuf->image->segment_cma[kbuf->image->nr_segments] = kbuf->cma;
>         kbuf->image->nr_segments++;
>         return 0;
>  }
> diff --git a/kernel/kexec_internal.h b/kernel/kexec_internal.h
> index d35d9792402d..29e6cebe0c43 100644
> --- a/kernel/kexec_internal.h
> +++ b/kernel/kexec_internal.h
> @@ -10,7 +10,7 @@ struct kimage *do_kimage_alloc_init(void);
>  int sanity_check_segment_list(struct kimage *image);
>  void kimage_free_page_list(struct list_head *list);
>  void kimage_free(struct kimage *image);
> -int kimage_load_segment(struct kimage *image, struct kexec_segment *segment);
> +int kimage_load_segment(struct kimage *image, int idx);
>  void kimage_terminate(struct kimage *image);
>  int kimage_is_destination_range(struct kimage *image,
>                                 unsigned long start, unsigned long end);
> --
> 2.34.1
>
>
>
>
> Amazon Web Services Development Center Germany GmbH
> Tamara-Danz-Str. 13
> 10243 Berlin
> Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
> Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B
> Sitz: Berlin
> Ust-ID: DE 365 538 597
>
>

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ