[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <df8220ac-4214-5ff6-0048-35553fea8c8c@redhat.com>
Date: Fri, 16 Apr 2021 12:33:34 +0200
From: David Hildenbrand <david@...hat.com>
To: Oscar Salvador <osalvador@...e.de>,
Andrew Morton <akpm@...ux-foundation.org>
Cc: Michal Hocko <mhocko@...nel.org>,
Anshuman Khandual <anshuman.khandual@....com>,
Pavel Tatashin <pasha.tatashin@...een.com>,
Vlastimil Babka <vbabka@...e.cz>, linux-mm@...ck.org,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH v8 4/8] mm,memory_hotplug: Allocate memmap from the added
memory range
On 16.04.21 12:21, Oscar Salvador wrote:
> Physical memory hotadd has to allocate a memmap (struct page array) for
> the newly added memory section. Currently, alloc_pages_node() is used
> for those allocations.
>
> This has some disadvantages:
> a) an existing memory is consumed for that purpose
> (eg: ~2MB per 128MB memory section on x86_64)
> b) if the whole node is movable then we have off-node struct pages
> which has performance drawbacks.
> c) It might be there are no PMD_ALIGNED chunks so memmap array gets
> populated with base pages.
>
> This can be improved when CONFIG_SPARSEMEM_VMEMMAP is enabled.
>
> Vmemap page tables can map arbitrary memory.
> That means that we can simply use the beginning of each memory section and
> map struct pages there.
> struct pages which back the allocated space then just need to be treated
> carefully.
>
> Implementation wise we will reuse vmem_altmap infrastructure to override
> the default allocator used by __populate_section_memmap.
> Part of the implementation also relies on memory_block structure gaining
> a new field which specifies the number of vmemmap_pages at the beginning.
> This patch also introduces the following functions:
>
> - mhp_init_memmap_on_memory:
> Initializes vmemmap pages by calling move_pfn_range_to_zone(),
> calls kasan_add_zero_shadow(), and onlines as many sections
> as vmemmap pages fully span.
> - mhp_deinit_memmap_on_memory:
> Undoes what mhp_init_memmap_on_memory.
>
> The new function memory_block_online() calls mhp_init_memmap_on_memory() before
> doing the actual online_pages(). Should online_pages() fail, we clean up
> by calling mhp_deinit_memmap_on_memory().
> Adjusting of present_pages is done at the end once we know that online_pages()
> succedeed.
>
> On offline, memory_block_offline() needs to unaccount vmemmap pages from
> present_pages() before calling offline_pages().
> This is necessary because offline_pages() tears down some structures based
> on the fact whether the node or the zone become empty.
> If offline_pages() fails, we account back vmemmap pages.
> If it succeeds, we call mhp_deinit_memmap_on_memory().
>
> Hot-remove:
>
> We need to be careful when removing memory, as adding and
> removing memory needs to be done with the same granularity.
> To check that this assumption is not violated, we check the
> memory range we want to remove and if a) any memory block has
> vmemmap pages and b) the range spans more than a single memory
> block, we scream out loud and refuse to proceed.
>
> If all is good and the range was using memmap on memory (aka vmemmap pages),
> we construct an altmap structure so free_hugepage_table does the right
> thing and calls vmem_altmap_free instead of free_pagetable.
>
> Signed-off-by: Oscar Salvador <osalvador@...e.de>
> ---
> drivers/base/memory.c | 75 ++++++++++++++++--
> include/linux/memory.h | 8 +-
> include/linux/memory_hotplug.h | 17 +++-
> include/linux/memremap.h | 2 +-
> include/linux/mmzone.h | 7 +-
> mm/Kconfig | 5 ++
> mm/memory_hotplug.c | 171 ++++++++++++++++++++++++++++++++++++++---
> mm/sparse.c | 2 -
> 8 files changed, 265 insertions(+), 22 deletions(-)
>
> diff --git a/drivers/base/memory.c b/drivers/base/memory.c
> index f209925a5d4e..179857d53982 100644
> --- a/drivers/base/memory.c
> +++ b/drivers/base/memory.c
> @@ -173,16 +173,76 @@ static int memory_block_online(struct memory_block *mem)
> {
> unsigned long start_pfn = section_nr_to_pfn(mem->start_section_nr);
> unsigned long nr_pages = PAGES_PER_SECTION * sections_per_block;
> + unsigned long nr_vmemmap_pages = mem->nr_vmemmap_pages;
> + struct zone *zone;
> + int ret;
> +
> + zone = mhp_get_target_zone(start_pfn, nr_pages, mem->nid,
> + mem->online_type);
> +
> + /*
> + * Although vmemmap pages have a different lifecycle than the pages
> + * they describe (they remain until the memory is unplugged), doing
> + * its initialization and accounting at hot-{online,offline} stage
s/its/their/
s/hot-{online,offline}/memory onlining/offlining stage/
> + * simplifies things a lot
> + */
> + if (nr_vmemmap_pages) {
> + ret = mhp_init_memmap_on_memory(start_pfn, nr_vmemmap_pages, zone);
> + if (ret)
> + return ret;
> + }
> +
> + ret = online_pages(start_pfn + nr_vmemmap_pages,
> + nr_pages - nr_vmemmap_pages, zone);
> + if (ret) {
> + if (nr_vmemmap_pages)
> + mhp_deinit_memmap_on_memory(start_pfn, nr_vmemmap_pages);
> + return ret;
> + }
> +
> + /*
> + * Account once onlining succeeded. If the page was unpopulated, it is
s/page/zone/
> + * now already properly populated.
> + */
> + if (nr_vmemmap_pages)
> + adjust_present_page_count(zone, nr_vmemmap_pages);
>
> - return online_pages(start_pfn, nr_pages, mem->online_type, mem->nid);
> + return ret;
> }
>
> static int memory_block_offline(struct memory_block *mem)
> {
> unsigned long start_pfn = section_nr_to_pfn(mem->start_section_nr);
> unsigned long nr_pages = PAGES_PER_SECTION * sections_per_block;
> + unsigned long nr_vmemmap_pages = mem->nr_vmemmap_pages;
> + struct zone *zone;
> + int ret;
> +
> + zone = page_zone(pfn_to_page(start_pfn));
>
> - return offline_pages(start_pfn, nr_pages);
> + /*
> + * Unaccount before offlining, such that unpopulated zone and kthreads
> + * can properly be torn down in offline_pages().
> + */
> + if (nr_vmemmap_pages)
> + adjust_present_page_count(zone, -nr_vmemmap_pages);
> +
> + ret = offline_pages(start_pfn + nr_vmemmap_pages,
> + nr_pages - nr_vmemmap_pages);
> + if (ret) {
> + /* offline_pages() failed. Account back. */
> + if (nr_vmemmap_pages)
> + adjust_present_page_count(zone, nr_vmemmap_pages);
> + return ret;
> + }
> +
> + /*
> + * Re-adjust present pages if offline_pages() fails.
> + */
That comment is stale. I'd just drop it.
> + if (nr_vmemmap_pages)
> + mhp_deinit_memmap_on_memory(start_pfn, nr_vmemmap_pages);
> +
> + return ret;
> }
[...]
> -static void adjust_present_page_count(struct zone *zone, long nr_pages)
> +/*
> + * This function should only be called by memory_block_{online,offline},
> + * and {online,offline}_pages.
> + */
> +void adjust_present_page_count(struct zone *zone, long nr_pages)
> {
> unsigned long flags;
>
> @@ -839,12 +850,64 @@ static void adjust_present_page_count(struct zone *zone, long nr_pages)
> pgdat_resize_unlock(zone->zone_pgdat, &flags);
> }
>
> -int __ref online_pages(unsigned long pfn, unsigned long nr_pages,
> - int online_type, int nid)
> +struct zone *mhp_get_target_zone(unsigned long pfn, unsigned long nr_pages,
> + int nid, int online_type)
> +{
> + return zone_for_pfn_range(online_type, nid, pfn, nr_pages);
> +}
> +
Oh, you can just use zone_for_pfn_range() directly for now. No need for
mhp_get_target_zone(). Sorry for not realizing this.
> +int mhp_init_memmap_on_memory(unsigned long pfn, unsigned long nr_pages,
> + struct zone *zone)
> +{
> + unsigned long end_pfn = pfn + nr_pages;
> + int ret;
> +
> + /*
> + * Initialize vmemmap pages with the corresponding node, zone links set.
> + */
> + move_pfn_range_to_zone(zone, pfn, nr_pages, NULL, MIGRATE_UNMOVABLE);
> +
> + ret = kasan_add_zero_shadow(__va(PFN_PHYS(pfn)), PFN_PHYS(nr_pages));
> + if (ret) {
> + remove_pfn_range_from_zone(zone, pfn, nr_pages);
> + return ret;
> + }
IIRC, we have to add the zero shadow first, before touching the memory.
This is also what mm/memremap.c does.
In mhp_deinit_memmap_on_memory(), you already remove in the proper
(reversed) order :)
> +
> +int __ref online_pages(unsigned long pfn, unsigned long nr_pages, struct zone *zone)
> {
> unsigned long flags;
> - struct zone *zone;
> int need_zonelists_rebuild = 0;
> + int nid;
> int ret;
> struct memory_notify arg;
>
> @@ -860,8 +923,9 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages,
>
> mem_hotplug_begin();
>
> + nid = zone_to_nid(zone);
I'd do that right above
const int nid = zone_to_nid(zone);
[...]
--
Thanks,
David / dhildenb
Powered by blists - more mailing lists