linux-kernel - Re: [PATCH 3/3] dax/kmem: Always enroll hotplugged memory for memmap_on

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5a8e9b1b6c8d6d9e5405ca35abb9be3ed09761c3.camel@intel.com>
Date:   Thu, 13 Jul 2023 06:45:20 +0000
From:   "Verma, Vishal L" <vishal.l.verma@...el.com>
To:     "david@...hat.com" <david@...hat.com>,
        "akpm@...ux-foundation.org" <akpm@...ux-foundation.org>,
        "rafael@...nel.org" <rafael@...nel.org>,
        "osalvador@...e.de" <osalvador@...e.de>,
        "aneesh.kumar@...ux.ibm.com" <aneesh.kumar@...ux.ibm.com>,
        "Williams, Dan J" <dan.j.williams@...el.com>,
        "lenb@...nel.org" <lenb@...nel.org>,
        "Jiang, Dave" <dave.jiang@...el.com>
CC:     "Huang, Ying" <ying.huang@...el.com>,
        "linux-mm@...ck.org" <linux-mm@...ck.org>,
        "linux-cxl@...r.kernel.org" <linux-cxl@...r.kernel.org>,
        "nvdimm@...ts.linux.dev" <nvdimm@...ts.linux.dev>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "linux-acpi@...r.kernel.org" <linux-acpi@...r.kernel.org>,
        "dave.hansen@...ux.intel.com" <dave.hansen@...ux.intel.com>
Subject: Re: [PATCH 3/3] dax/kmem: Always enroll hotplugged memory for
 memmap_on_memory

On Tue, 2023-07-11 at 17:21 +0200, David Hildenbrand wrote:
> On 11.07.23 16:30, Aneesh Kumar K.V wrote:
> > David Hildenbrand <david@...hat.com> writes:
> > > 
> > > Maybe the better alternative is teach
> > > add_memory_resource()/try_remove_memory() to do that internally.
> > > 
> > > In the add_memory_resource() case, it might be a loop around that
> > > memmap_on_memory + arch_add_memory code path (well, and the error path
> > > also needs adjustment):
> > > 
> > >         /*
> > >          * Self hosted memmap array
> > >          */
> > >         if (mhp_flags & MHP_MEMMAP_ON_MEMORY) {
> > >                 if (!mhp_supports_memmap_on_memory(size)) {
> > >                         ret = -EINVAL;
> > >                         goto error;
> > >                 }
> > >                 mhp_altmap.free = PHYS_PFN(size);
> > >                 mhp_altmap.base_pfn = PHYS_PFN(start);
> > >                 params.altmap = &mhp_altmap;
> > >         }
> > > 
> > >         /* call arch's memory hotadd */
> > >         ret = arch_add_memory(nid, start, size, &params);
> > >         if (ret < 0)
> > >                 goto error;
> > > 
> > > 
> > > Note that we want to handle that on a per memory-block basis, because we
> > > don't want the vmemmap of memory block #2 to end up on memory block #1.
> > > It all gets messy with memory onlining/offlining etc otherwise ...
> > > 
> > 
> > I tried to implement this inside add_memory_driver_managed() and also
> > within dax/kmem. IMHO doing the error handling inside dax/kmem is
> > better. Here is how it looks:
> > 
> > 1. If any blocks got added before (mapped > 0) we loop through all successful request_mem_regions
> > 2. For each succesful request_mem_regions if any blocks got added, we
> > keep the resource. If none got added, we will kfree the resource
> > 
> 
> Doing this unconditional splitting outside of 
> add_memory_driver_managed() is undesirable for at least two reasons
> 
> 1) You end up always creating individual entries in the resource tree
>     (/proc/iomem) even if MHP_MEMMAP_ON_MEMORY is not effective.
> 2) As we call arch_add_memory() in memory block granularity (e.g., 128
>     MiB on x86), we might not make use of large PUDs (e.g., 1 GiB) in the
>     identify mapping -- even if MHP_MEMMAP_ON_MEMORY is not effective.
> 
> While you could sense for support and do the split based on that, it 
> will be beneficial for other users (especially DIMMs) if we do that 
> internally -- where we already know if MHP_MEMMAP_ON_MEMORY can be 
> effective or not.

I'm taking a shot at implementing the splitting internally in
memory_hotplug.c. The caller (kmem) side does become trivial with this
approach, but there's a slight complication if I don't have the module
param override (patch 1 of this series).

The kmem diff now looks like:

   diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
   index 898ca9505754..8be932f63f90 100644
   --- a/drivers/dax/kmem.c
   +++ b/drivers/dax/kmem.c
   @@ -105,6 +105,8 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
           data->mgid = rc;
    
           for (i = 0; i < dev_dax->nr_range; i++) {
   +               mhp_t mhp_flags = MHP_NID_IS_MGID | MHP_MEMMAP_ON_MEMORY |
   +                                 MHP_SPLIT_MEMBLOCKS;
                   struct resource *res;
                   struct range range;
    
   @@ -141,7 +143,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
                    * this as RAM automatically.
                    */
                   rc = add_memory_driver_managed(data->mgid, range.start,
   -                               range_len(&range), kmem_name, MHP_NID_IS_MGID);
   +                               range_len(&range), kmem_name, mhp_flags);
    
                   if (rc) {
                           dev_warn(dev, "mapping%d: %#llx-%#llx memory add failed\n",
   

However this begins to fail if the memmap_on_memory modparam is not
set, as add_memory_driver_managed EINVALs from the
mhp_supports_memmap_on_memory() check.

The way to work around this would probably include doing the
mhp_supports_memmap_on_memory() check in kmem, in a loop to check for
each memblock sized chunk, and that feels like a leak of the
implementation details into the caller.

Any suggestions on how to go about this?
> 
> In general, we avoid placing important kernel data-structures on slow
> memory. That's one of the reasons why PMEM decided to mostly always use 
> ZONE_MOVABLE such that exactly what this patch does would not happen. So 
> I'm wondering if there would be demand for an additional toggle.
> 
> Because even with memmap_on_memory enabled in general, you might not 
> want to do that for dax/kmem.
> 
> IMHO, this patch should be dropped from your ppc64 series, as it's an
> independent change that might be valuable for other architectures as well.
> 
Sure thing, I can pick this back up and Aneesh can drop this from his set.