lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CA+CK2bCpay7V=4_AGiTEyX2OG_6rvqR42MkVSJwEhGa7h+5R4g@mail.gmail.com>
Date: Fri, 9 Aug 2024 14:09:53 -0400
From: Pasha Tatashin <pasha.tatashin@...een.com>
To: David Hildenbrand <david@...hat.com>
Cc: akpm@...ux-foundation.org, linux-kernel@...r.kernel.org, 
	linux-mm@...ck.org, linux-cxl@...r.kernel.org, cerasuolodomenico@...il.com, 
	hannes@...xchg.org, j.granados@...sung.com, lizhijian@...itsu.com, 
	muchun.song@...ux.dev, nphamcs@...il.com, rientjes@...gle.com, 
	rppt@...nel.org, souravpanda@...gle.com, vbabka@...e.cz, willy@...radead.org, 
	dan.j.williams@...el.com, yi.zhang@...hat.com, alison.schofield@...el.com, 
	yosryahmed@...gle.com
Subject: Re: [PATCH v4 3/3] mm: don't account memmap per-node

On Fri, Aug 9, 2024 at 3:31 AM David Hildenbrand <david@...hat.com> wrote:
>
> On 08.08.24 23:34, Pasha Tatashin wrote:
> > Fix invalid access to pgdat during hot-remove operation:
> > ndctl users reported a GPF when trying to destroy a namespace:
> > $ ndctl destroy-namespace all -r all -f
> >   Segmentation fault
> >   dmesg:
> >   Oops: general protection fault, probably for
> >   non-canonical address 0xdffffc0000005650: 0000 [#1] PREEMPT SMP KASAN
> >   PTI
> >   KASAN: probably user-memory-access in range
> >   [0x000000000002b280-0x000000000002b287]
> >   CPU: 26 UID: 0 PID: 1868 Comm: ndctl Not tainted 6.11.0-rc1 #1
> >   Hardware name: Dell Inc. PowerEdge R640/08HT8T, BIOS
> >   2.20.1 09/13/2023
> >   RIP: 0010:mod_node_page_state+0x2a/0x110
> >
> > cxl-test users report a GPF when trying to unload the test module:
> > $ modrpobe -r cxl-test
> >   dmesg
> >   BUG: unable to handle page fault for address: 0000000000004200
> >   #PF: supervisor read access in kernel mode
> >   #PF: error_code(0x0000) - not-present page
> >   PGD 0 P4D 0
> >   Oops: Oops: 0000 [#1] PREEMPT SMP PTI
> >   CPU: 0 UID: 0 PID: 1076 Comm: modprobe Tainted: G O N 6.11.0-rc1 #197
> >   Tainted: [O]=OOT_MODULE, [N]=TEST
> >   Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/15
> >   RIP: 0010:mod_node_page_state+0x6/0x90
> >
> > Currently, when memory is hot-plugged or hot-removed the accounting is
> > done based on the assumption that memmap is allocated from the same node
> > as the hot-plugged/hot-removed memory, which is not always the case.
> >
> > In addition, there are challenges with keeping the node id of the memory
> > that is being remove to the time when memmap accounting is actually
> > performed: since this is done after remove_pfn_range_from_zone(), and
> > also after remove_memory_block_devices(). Meaning that we cannot use
> > pgdat nor walking though memblocks to get the nid.
> >
> > Given all of that, account the memmap overhead system wide instead.
> >
> > For this we are going to be using global atomic counters, but given that
> > memmap size is rarely modified, and normally is only modified either
> > during early boot when there is only one CPU, or under a hotplug global
> > mutex lock, therefore there is no need for per-cpu optimizations.
> >
> > Also, while we are here rename nr_memmap to nr_memmap_pages, and
> > nr_memmap_boot to nr_memmap_boot_pages to be self explanatory that the
> > units are in page count.
> >
> > Reported-by: Yi Zhang <yi.zhang@...hat.com>
> > Closes: https://lore.kernel.org/linux-cxl/CAHj4cs9Ax1=CoJkgBGP_+sNu6-6=6v=_L-ZBZY0bVLD3wUWZQg@mail.gmail.com
> > Reported-by: Alison Schofield <alison.schofield@...el.com>
> > Closes: https://lore.kernel.org/linux-mm/Zq0tPd2h6alFz8XF@aschofie-mobl2/#t
> >
> > Fixes: 15995a352474 ("mm: report per-page metadata information")
> > Signed-off-by: Pasha Tatashin <pasha.tatashin@...een.com>
> > Tested-by: Dan Williams <dan.j.williams@...el.com>
> > ---
>
> [...]
>
> In general
>
> Acked-by: David Hildenbrand <david@...hat.com>
>
> Two nits below:
>
>
> >   static void free_map_bootmem(struct page *memmap)
> > diff --git a/mm/vmstat.c b/mm/vmstat.c
> > index 6f8aa4766f16..ad82c1bf0e63 100644
> > --- a/mm/vmstat.c
> > +++ b/mm/vmstat.c
> > @@ -1033,6 +1033,23 @@ unsigned long node_page_state(struct pglist_data *pgdat,
> >   }
> >   #endif
> >
> > +/*
> > + * Count number of pages "struct page" and "struct page_ext" consume.
> > + * nr_memmap_boot: # of pages allocated by boot allocator & not part of MemTotal
> > + * nr_memmap: # of pages that were allocated by buddy allocator
> > + */
> > +static atomic_long_t nr_memmap_boot, nr_memmap;
>
> I *think* the clean and portable way to do it is use ATOMIC_INIT(0) for
> both. [even though what you have likely works on all archs]

Yeah, it is not necessary, but I will add ATOMIC_LONG_INIT(0),

>
> > +
> > +void mod_memmap_boot(long delta)
> > +{
> > +     atomic_long_add(delta, &nr_memmap_boot);
> > +}
> > +
> > +void mod_memmap(long delta)
> > +{
> > +     atomic_long_add(delta, &nr_memmap);
> > +}
> > +
>
> Nit picking: (up to you)
>
> I'd do it similar to totalram_pages_add():
>
> memmap_pages_add()
> memmap_boot_pages_add()
>
> And call the variables something like
>
> static atomic_long_t memmap_pages_boot, memmap_pages;

Sure, I will rename them.

Pasha

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ