[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CA+CK2bD9QF-8dxd92UBoyvO0rBJ3uTN27pXzO2bALw4v_2D_8g@mail.gmail.com>
Date: Tue, 22 Apr 2025 09:10:39 -0400
From: Pasha Tatashin <pasha.tatashin@...een.com>
To: Dan Williams <dan.j.williams@...el.com>
Cc: Michal Clapinski <mclapinski@...gle.com>, Vishal Verma <vishal.l.verma@...el.com>,
Dave Jiang <dave.jiang@...el.com>, Ira Weiny <ira.weiny@...el.com>,
Jonathan Corbet <corbet@....net>, nvdimm@...ts.linux.dev, linux-doc@...r.kernel.org,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH v2 1/1] libnvdimm/e820: Add a new parameter to configure
many regions per e820 entry
On Mon, Apr 21, 2025 at 7:21 PM Dan Williams <dan.j.williams@...el.com> wrote:
>
> Michal Clapinski wrote:
> > Currently, the user has to specify each memory region to be used with
> > nvdimm via the memmap parameter. Due to the character limit of the
> > command line, this makes it impossible to have a lot of pmem devices.
> > This new parameter solves this issue by allowing users to divide
> > one e820 entry into many nvdimm regions.
> >
> > This change is needed for the hypervisor live update. VMs' memory will
> > be backed by those emulated pmem devices. To support various VM shapes
> > I want to create devdax devices at 1GB granularity similar to hugetlb.
>
> This looks fairly straightforward, but if this moves forward I would
> explicitly call the parameter something like "split" instead of "pmem"
> to align it better with its usage.
>
> However, while this is expedient I wonder if you would be better
> served with ACPI table injection to get more control and configuration
> options...
>
> > It's also possible to expand this parameter in the future,
> > e.g. to specify the type of the device (fsdax/devdax).
>
> ...for example, if you injected or customized your BIOS to supply an
> ACPI NFIT table you could get to deeper degrees of customization without
> wrestling with command lines. Supply an ACPI NFIT that carves up a large
> memory-type range into an aribtrary number of regions. In the NFIT there
> is a natural place to specify whether the range gets sent to PMEM. See
> call to nvdimm_pmem_region_create() near NFIT_SPA_PM in
> acpi_nfit_register_region()", and "simply" pick a new guid to signify
> direct routing to device-dax. I say simply, but that implies new ACPI
> NFIT driver plumbing for the new mode.
>
> Another overlooked detail about NFIT is that there is an opportunity to
> determine cases where the platform might have changed the physical
> address map from one boot to the next. In other words, I cringe at the
> fragility of memmap=, but I understand that it has the benefit of being
> simple. See the "nd_set cookie" concept in
> acpi_nfit_init_interleave_set().
I also dislike the potential fragility of the memmap= parameter;
however, in our environment, kernel parameters are specifically
crafted for target machine configurations and supplied separately from
the kernel binary, giving us good control.
Regarding the ACPI NFIT suggestion: Our use case involves reusing the
same physical machines (with unchanged firmware) for various
configurations (similar to loaning them out). An advantage for us is
that switching the machine's role only requires changing the kernel
parameters. The ACPI approach, potentially requiring firmware changes,
would break this dynamic reconfiguration.
As I understand, using ACPI injection instead of firmware change
doesn't eliminate fragility concerns either. We would still need to
carefully reserve the specific physical range for a particular machine
configuration, and it also adds a dependency on managing and packaging
an external NFIT injection file and process. We have a process for
kernel parameters but doing this externally would complicate things
for us.
Also, I might be missing something, but I haven't found a standard way
to automatically create devdax devices using NFIT injection. Our
current plan is to expand the proposed kernel parameter. We are
working on making it default to creating either fsdax or devdax type
regions, without requiring explicit labels, and ensuring these regions
remain stable across kexec as long as the kernel parameter itself
doesn't change (in a way kernel parameters take the role of the
labels).
Pasha
Powered by blists - more mailing lists