lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <68ddab118dcd4_1fa21007f@dwillia2-mobl4.notmuch>
Date: Wed, 1 Oct 2025 15:28:33 -0700
From: <dan.j.williams@...el.com>
To: Michał Cłapiński <mclapinski@...gle.com>,
	<dan.j.williams@...el.com>
CC: Mike Rapoport <rppt@...nel.org>, Ira Weiny <ira.weiny@...el.com>, "Dave
 Jiang" <dave.jiang@...el.com>, Vishal Verma <vishal.l.verma@...el.com>,
	<jane.chu@...cle.com>, Pasha Tatashin <pasha.tatashin@...een.com>, "Tyler
 Hicks" <code@...icks.com>, <linux-kernel@...r.kernel.org>,
	<nvdimm@...ts.linux.dev>
Subject: Re: [PATCH 1/1] nvdimm: allow exposing RAM carveouts as NVDIMM DIMM
 devices

Michał Cłapiński wrote:
> On Fri, Sep 26, 2025 at 8:45 PM <dan.j.williams@...el.com> wrote:
> >
> > Michał Cłapiński wrote:
> > [..]
> > > > As Mike says you would lose 128K at the end, but that indeed becomes
> > > > losing that 1GB given alignment constraints.
> > > >
> > > > However, I think that could be solved by just separately vmalloc'ing the
> > > > label space for this. Then instead of kernel parameters to sub-divide a
> > > > region, you just have an initramfs script to do the same.
> > > >
> > > > Does that meet your needs?
> > >
> > > Sorry, I'm having trouble imagining this.
> > > If I wanted 500 1GB chunks, I would request a region of 500GB+space
> > > for the label? Or is that a label and info-blocks?
> >
> > You would specify an memmap= range of 500GB+128K*.
> >
> > Force attach that range to Mike's RAMDAX driver.
> >
> > [ modprobe -r nd_e820, don't build nd_820, or modprobe policy blocks nd_e820 ]
> > echo ramdax > /sys/bus/platform/devices/e820_pmem/driver_override
> > echo e820_pmem > /sys/bus/platform/drivers/ramdax
> >
> > * forget what I said about vmalloc() previously, not needed
> >
> > > Then on each boot the kernel would check if there is an actual
> > > label/info-blocks in that space and if yes, it would recreate my
> > > devices (including the fsdax/devdax type)?
> >
> > Right, if that range is persistent the kernel would automatically parse
> > the label space each boot and divide up the 500GB region space into
> > namespaces.
> >
> > 128K of label spaces gives you 509 potential namespaces.
> 
> That's not enough for us. We would need ~1 order of magnitude more.
> Sorry I'm being vague about this but I can't discuss the actual
> machine sizes.

Sure, then make it 1280K of label space. There's no practical limit in
the implementation.

> > > One of the requirements for live update is that the kexec reboot has
> > > to be fast. My solution introduced a delay of tens of milliseconds
> > > since the actual device creation is asynchronous. Manually dividing a
> > > region into thousands of devices from userspace would be very slow but
> >
> > Wait, 500GB Region / 1GB Namespace = thousands of Namespaces?
> 
> I was talking about devices and AFAIK 1 namespace equals 5 devices for
> us currently (nd/{namespace, pfn, btt, dax}, dax/dax). Though the
> device creation is asynchronous so I guess the actual device count is
> not important.

I do not see how it is relevant. You also get 1000s of devices with
plain memory block devices.

> > > I would have to do that only on the first boot, right?
> >
> > Yes, the expectation is only incur that overhead once. It also allows
> > for VMs to be able to lookup their capacity by name. So you do not need
> > a separate mapping of 1GB Namepsace blocks to VMs. Just give some VMs
> > bigger Namespaces than others by name.
> 
> Sure, I can do that at first. But after some time fragmentation will
> happen, right?

Why would fragementation be more of a problem with labels vs command
line if the expectation is maintaining a persistent namespace layout
over time?

> At some point I will have to give VMs a bunch of smaller namespaces
> here and there.
> 
> Btw. one more thing I don't understand. Why are maintainers so much
> against adding new kernel parameters?

This label code is already written and it is less burden to maintain a
new use of existing code vs new mechanism for a niche use case. Also,
memmap= has long been a footgun, making that problem worse for
questionable benefit to wider Linux project does not feel like the right
tradeoff.

The other alternative to labels is ACPI NFIT table injection. Again the
tradeoff is that is just another reuse of an existing well worn
mechanism for delineating PMEM.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ