linux-kernel - Re: [PATCH 1/1] nvdimm: allow exposing RAM carveouts as NVDIMM DIMM devices

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAAi7L5dWpzfodg3J4QqGP564qDnLmqPCKHJ-1BTmzwMUhz6rLg@mail.gmail.com>
Date: Wed, 1 Oct 2025 16:14:25 +0200
From: Michał Cłapiński <mclapinski@...gle.com>
To: dan.j.williams@...el.com
Cc: Mike Rapoport <rppt@...nel.org>, Ira Weiny <ira.weiny@...el.com>, 
	Dave Jiang <dave.jiang@...el.com>, Vishal Verma <vishal.l.verma@...el.com>, jane.chu@...cle.com, 
	Pasha Tatashin <pasha.tatashin@...een.com>, Tyler Hicks <code@...icks.com>, 
	linux-kernel@...r.kernel.org, nvdimm@...ts.linux.dev
Subject: Re: [PATCH 1/1] nvdimm: allow exposing RAM carveouts as NVDIMM DIMM devices

On Fri, Sep 26, 2025 at 8:45 PM <dan.j.williams@...el.com> wrote:
>
> Michał Cłapiński wrote:
> [..]
> > > As Mike says you would lose 128K at the end, but that indeed becomes
> > > losing that 1GB given alignment constraints.
> > >
> > > However, I think that could be solved by just separately vmalloc'ing the
> > > label space for this. Then instead of kernel parameters to sub-divide a
> > > region, you just have an initramfs script to do the same.
> > >
> > > Does that meet your needs?
> >
> > Sorry, I'm having trouble imagining this.
> > If I wanted 500 1GB chunks, I would request a region of 500GB+space
> > for the label? Or is that a label and info-blocks?
>
> You would specify an memmap= range of 500GB+128K*.
>
> Force attach that range to Mike's RAMDAX driver.
>
> [ modprobe -r nd_e820, don't build nd_820, or modprobe policy blocks nd_e820 ]
> echo ramdax > /sys/bus/platform/devices/e820_pmem/driver_override
> echo e820_pmem > /sys/bus/platform/drivers/ramdax
>
> * forget what I said about vmalloc() previously, not needed
>
> > Then on each boot the kernel would check if there is an actual
> > label/info-blocks in that space and if yes, it would recreate my
> > devices (including the fsdax/devdax type)?
>
> Right, if that range is persistent the kernel would automatically parse
> the label space each boot and divide up the 500GB region space into
> namespaces.
>
> 128K of label spaces gives you 509 potential namespaces.

That's not enough for us. We would need ~1 order of magnitude more.
Sorry I'm being vague about this but I can't discuss the actual
machine sizes.

> > One of the requirements for live update is that the kexec reboot has
> > to be fast. My solution introduced a delay of tens of milliseconds
> > since the actual device creation is asynchronous. Manually dividing a
> > region into thousands of devices from userspace would be very slow but
>
> Wait, 500GB Region / 1GB Namespace = thousands of Namespaces?

I was talking about devices and AFAIK 1 namespace equals 5 devices for
us currently (nd/{namespace, pfn, btt, dax}, dax/dax). Though the
device creation is asynchronous so I guess the actual device count is
not important.

> > I would have to do that only on the first boot, right?
>
> Yes, the expectation is only incur that overhead once. It also allows
> for VMs to be able to lookup their capacity by name. So you do not need
> a separate mapping of 1GB Namepsace blocks to VMs. Just give some VMs
> bigger Namespaces than others by name.

Sure, I can do that at first. But after some time fragmentation will
happen, right? At some point I will have to give VMs a bunch of
smaller namespaces here and there.

Btw. one more thing I don't understand. Why are maintainers so much
against adding new kernel parameters?