[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAAi7L5cpkYrf5KjKdX0tsRS2Q=3B_6NZ4urNG6EfjNYADqhYSA@mail.gmail.com>
Date: Tue, 9 Dec 2025 21:10:26 +0100
From: Michał Cłapiński <mclapinski@...gle.com>
To: dan.j.williams@...el.com
Cc: Mike Rapoport <rppt@...nel.org>, Ira Weiny <ira.weiny@...el.com>,
Dave Jiang <dave.jiang@...el.com>, Vishal Verma <vishal.l.verma@...el.com>, jane.chu@...cle.com,
Pasha Tatashin <pasha.tatashin@...een.com>, Tyler Hicks <code@...icks.com>,
linux-kernel@...r.kernel.org, nvdimm@...ts.linux.dev
Subject: Re: [PATCH 1/1] nvdimm: allow exposing RAM carveouts as NVDIMM DIMM devices
On Thu, Oct 2, 2025 at 12:28 AM <dan.j.williams@...el.com> wrote:
>
> Michał Cłapiński wrote:
> > On Fri, Sep 26, 2025 at 8:45 PM <dan.j.williams@...el.com> wrote:
> > >
> > > Michał Cłapiński wrote:
> > > [..]
> > > > > As Mike says you would lose 128K at the end, but that indeed becomes
> > > > > losing that 1GB given alignment constraints.
> > > > >
> > > > > However, I think that could be solved by just separately vmalloc'ing the
> > > > > label space for this. Then instead of kernel parameters to sub-divide a
> > > > > region, you just have an initramfs script to do the same.
> > > > >
> > > > > Does that meet your needs?
> > > >
> > > > Sorry, I'm having trouble imagining this.
> > > > If I wanted 500 1GB chunks, I would request a region of 500GB+space
> > > > for the label? Or is that a label and info-blocks?
> > >
> > > You would specify an memmap= range of 500GB+128K*.
> > >
> > > Force attach that range to Mike's RAMDAX driver.
> > >
> > > [ modprobe -r nd_e820, don't build nd_820, or modprobe policy blocks nd_e820 ]
> > > echo ramdax > /sys/bus/platform/devices/e820_pmem/driver_override
> > > echo e820_pmem > /sys/bus/platform/drivers/ramdax
> > >
> > > * forget what I said about vmalloc() previously, not needed
> > >
> > > > Then on each boot the kernel would check if there is an actual
> > > > label/info-blocks in that space and if yes, it would recreate my
> > > > devices (including the fsdax/devdax type)?
> > >
> > > Right, if that range is persistent the kernel would automatically parse
> > > the label space each boot and divide up the 500GB region space into
> > > namespaces.
> > >
> > > 128K of label spaces gives you 509 potential namespaces.
> >
> > That's not enough for us. We would need ~1 order of magnitude more.
> > Sorry I'm being vague about this but I can't discuss the actual
> > machine sizes.
>
> Sure, then make it 1280K of label space. There's no practical limit in
> the implementation.
Hi Dan,
I just had the time to try this out. So I modified the code to
increase the label space to 2M and I was able to create the
namespaces. It put the metadata in volatile memory.
But the infoblocks are still within the namespaces, right? If I try to
create a 3G namespace with alignment set to 1G, its actual usable size
is 2G. So I can't divide the whole pmem into 1G devices with 1G
alignment.
If I modify the code to remove the infoblocks, the namespace mode
won't be persistent, right? In my solution I get that information from
the kernel command line, so I don't need the infoblocks.
> > > > One of the requirements for live update is that the kexec reboot has
> > > > to be fast. My solution introduced a delay of tens of milliseconds
> > > > since the actual device creation is asynchronous. Manually dividing a
> > > > region into thousands of devices from userspace would be very slow but
> > >
> > > Wait, 500GB Region / 1GB Namespace = thousands of Namespaces?
> >
> > I was talking about devices and AFAIK 1 namespace equals 5 devices for
> > us currently (nd/{namespace, pfn, btt, dax}, dax/dax). Though the
> > device creation is asynchronous so I guess the actual device count is
> > not important.
>
> I do not see how it is relevant. You also get 1000s of devices with
> plain memory block devices.
>
> > > > I would have to do that only on the first boot, right?
> > >
> > > Yes, the expectation is only incur that overhead once. It also allows
> > > for VMs to be able to lookup their capacity by name. So you do not need
> > > a separate mapping of 1GB Namepsace blocks to VMs. Just give some VMs
> > > bigger Namespaces than others by name.
> >
> > Sure, I can do that at first. But after some time fragmentation will
> > happen, right?
>
> Why would fragementation be more of a problem with labels vs command
> line if the expectation is maintaining a persistent namespace layout
> over time?
>
> > At some point I will have to give VMs a bunch of smaller namespaces
> > here and there.
> >
> > Btw. one more thing I don't understand. Why are maintainers so much
> > against adding new kernel parameters?
>
> This label code is already written and it is less burden to maintain a
> new use of existing code vs new mechanism for a niche use case. Also,
> memmap= has long been a footgun, making that problem worse for
> questionable benefit to wider Linux project does not feel like the right
> tradeoff.
>
> The other alternative to labels is ACPI NFIT table injection. Again the
> tradeoff is that is just another reuse of an existing well worn
> mechanism for delineating PMEM.
Powered by blists - more mailing lists