[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <60afb5ca-230e-265f-9579-dac66a152c33@redhat.com>
Date: Thu, 25 Feb 2021 19:58:01 +0100
From: David Hildenbrand <david@...hat.com>
To: Oscar Salvador <osalvador@...e.de>,
Andrew Morton <akpm@...ux-foundation.org>
Cc: Michal Hocko <mhocko@...nel.org>, VlastimilBabkavbabka@...e.cz,
pasha.tatashin@...een.com, linux-kernel@...r.kernel.org,
linux-mm@...ck.org, Anshuman Khandual <anshuman.khandual@....com>
Subject: Re: [PATCH v2 1/7] mm,memory_hotplug: Allocate memmap from the added
memory range
On 09.02.21 14:38, Oscar Salvador wrote:
> Physical memory hotadd has to allocate a memmap (struct page array) for
> the newly added memory section. Currently, alloc_pages_node() is used
> for those allocations.
>
> This has some disadvantages:
> a) an existing memory is consumed for that purpose
> (eg: ~2MB per 128MB memory section on x86_64)
> b) if the whole node is movable then we have off-node struct pages
> which has performance drawbacks.
> c) It might be there are no PMD_ALIGNED chunks so memmap array gets
> populated with base pages.
>
> This can be improved when CONFIG_SPARSEMEM_VMEMMAP is enabled.
>
> Vmemap page tables can map arbitrary memory.
> That means that we can simply use the beginning of each memory section and
> map struct pages there.
> struct pages which back the allocated space then just need to be treated
> carefully.
>
> Implementation wise we will reuse vmem_altmap infrastructure to override
> the default allocator used by __populate_section_memmap.
> Part of the implementation also relies on memory_block structure gaining
> a new field which specifies the number of vmemmap_pages at the beginning.
> This comes in handy as in {online,offline}_pages, all the isolation and
> migration is being done on (buddy_start_pfn, end_pfn] range,
> being buddy_start_pfn = start_pfn + nr_vmemmap_pages.
>
> In this way, we have:
>
> (start_pfn, buddy_start_pfn - 1] = Initialized and PageReserved
> (buddy_start_pfn, end_pfn] = Initialized and sent to buddy
nit: shouldn't it be
[start_pfn, buddy_start_pfn - 1]
[buddy_start_pfn, end_pfn - 1]
or
[start_pfn, buddy_start_pfn)
[buddy_start_pfn, end_pfn)
(I remember that "[" means inclusive and "(" means exclusive, I might be wrong :) )
I actually prefer the first variant.
>
> Hot-remove:
>
> We need to be careful when removing memory, as adding and
> removing memory needs to be done with the same granularity.
> To check that this assumption is not violated, we check the
> memory range we want to remove and if a) any memory block has
> vmemmap pages and b) the range spans more than a single memory
> block, we scream out loud and refuse to proceed.
>
> If all is good and the range was using memmap on memory (aka vmemmap pages),
> we construct an altmap structure so free_hugepage_table does the right
> thing and calls vmem_altmap_free instead of free_pagetable.
>
> Signed-off-by: Oscar Salvador <osalvador@...e.de>
[...]
> /*
> * online_page_callback contains pointer to current page onlining function.
> * Initially it is generic_online_page(). If it is required it could be
> @@ -638,10 +640,20 @@ void generic_online_page(struct page *page, unsigned int order)
> }
> EXPORT_SYMBOL_GPL(generic_online_page);
>
> -static void online_pages_range(unsigned long start_pfn, unsigned long nr_pages)
> +static void online_pages_range(unsigned long start_pfn, unsigned long nr_pages,
> + unsigned long buddy_start_pfn)
> {
> const unsigned long end_pfn = start_pfn + nr_pages;
> - unsigned long pfn;
> + unsigned long pfn = buddy_start_pfn;
> +
> + /*
> + * When using memmap_on_memory, the range might be unaligned as the
> + * first pfns are used for vmemmap pages. Align it in case we need to.
> + */
> + if (pfn & ((1 << (MAX_ORDER - 1)) - 1)) {
if (!IS_ALIGNED(pfn, MAX_ORDER_NR_PAGES))
> + (*online_page_callback)(pfn_to_page(pfn), pageblock_order);
> + pfn += 1 << pageblock_order;
pfn += pageblock_nr_pages;
Can you add a comment why we can be sure that we are off by a single pageblock? What about s390x where a MAX_ORDER_NR_PAGES == 4 * pageblock_nr_pages?
Would it make thing simpler to just do a
while (!IS_ALIGNED(pfn, MAX_ORDER_NR_PAGES)) {
(*online_page_callback)(pfn_to_page(pfn), 0);
pfn++;
}
[...]
>
> /*
> * Freshly onlined pages aren't shuffled (e.g., all pages are placed to
> @@ -1064,6 +1085,12 @@ static int online_memory_block(struct memory_block *mem, void *arg)
> return device_online(&mem->dev);
> }
>
> +bool mhp_supports_memmap_on_memory(unsigned long size)
> +{
> + return memmap_on_memory_enabled &&
> + size == memory_block_size_bytes();
Regarding my other comments as reply to the other patches, I'd move all magic you have when trying to enable right here.
>
> -int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
> +int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages,
> + unsigned long nr_vmemmap_pages)
> {
> const unsigned long end_pfn = start_pfn + nr_pages;
> - unsigned long pfn, system_ram_pages = 0;
> + unsigned long pfn, buddy_start_pfn, buddy_nr_pages, system_ram_pages = 0;
> unsigned long flags;
> struct zone *zone;
> struct memory_notify arg;
> @@ -1578,6 +1620,9 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
> !IS_ALIGNED(start_pfn | nr_pages, PAGES_PER_SECTION)))
> return -EINVAL;
>
> + buddy_start_pfn = start_pfn + nr_vmemmap_pages;
> + buddy_nr_pages = nr_pages - nr_vmemmap_pages;
> +
> mem_hotplug_begin();
>
> /*
> @@ -1613,7 +1658,7 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
> zone_pcp_disable(zone);
>
> /* set above range as isolated */
> - ret = start_isolate_page_range(start_pfn, end_pfn,
> + ret = start_isolate_page_range(buddy_start_pfn, end_pfn,
> MIGRATE_MOVABLE,
> MEMORY_OFFLINE | REPORT_FAILURE);
Did you take care to properly adjust undo_isolate_page_range() as well? I can't spot it.
> if (ret) {
> @@ -1633,7 +1678,7 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
> }
>
> do {
> - pfn = start_pfn;
> + pfn = buddy_start_pfn;
> do {
> if (signal_pending(current)) {
> ret = -EINTR;
> @@ -1664,18 +1709,18 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
> * offlining actually in order to make hugetlbfs's object
> * counting consistent.
> */
> - ret = dissolve_free_huge_pages(start_pfn, end_pfn);
> + ret = dissolve_free_huge_pages(buddy_start_pfn, end_pfn);
> if (ret) {
> reason = "failure to dissolve huge pages";
> goto failed_removal_isolated;
> }
>
> - ret = test_pages_isolated(start_pfn, end_pfn, MEMORY_OFFLINE);
> + ret = test_pages_isolated(buddy_start_pfn, end_pfn, MEMORY_OFFLINE);
>
> } while (ret);
>
> /* Mark all sections offline and remove free pages from the buddy. */
> - __offline_isolated_pages(start_pfn, end_pfn);
> + __offline_isolated_pages(start_pfn, end_pfn, buddy_start_pfn);> pr_debug("Offlined Pages %ld\n", nr_pages);
>
> /*
> @@ -1684,13 +1729,13 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
> * of isolated pageblocks, memory onlining will properly revert this.
> */
> spin_lock_irqsave(&zone->lock, flags);
> - zone->nr_isolate_pageblock -= nr_pages / pageblock_nr_pages;
> + zone->nr_isolate_pageblock -= buddy_nr_pages / pageblock_nr_pages;
> spin_unlock_irqrestore(&zone->lock, flags);
>
> zone_pcp_enable(zone);
>
> /* removal success */
> - adjust_managed_page_count(pfn_to_page(start_pfn), -nr_pages);
> + adjust_managed_page_count(pfn_to_page(start_pfn), -buddy_nr_pages);
> zone->present_pages -= nr_pages;
>
> pgdat_resize_lock(zone->zone_pgdat, &flags);
> @@ -1750,6 +1795,14 @@ static int check_memblock_offlined_cb(struct memory_block *mem, void *arg)
> return 0;
> }
>
[...]
> +static int get_nr_vmemmap_pages_cb(struct memory_block *mem, void *arg)
> +{
> + unsigned long *nr_vmemmap_pages = (unsigned long *)arg;
> +
> + *nr_vmemmap_pages += mem->nr_vmemmap_pages;
> + return mem->nr_vmemmap_pages;
> +}
> +
I think you can do this easier, all you want to know is if there
is any block that has nr_vmemmap_pages set - and return the value.
static int first_set_nr_vmemmap_pages_cb(struct memory_block *mem, void *arg)
{
/* If not set, continue with the next block. */
return mem->nr_vmemmap_pages;
}
...
> + walk_memory_blocks(start, size, &nr_vmemmap_pages,
> + get_nr_vmemmap_pages_cb);
...
mem->nr_vmemmap_pages = walk_memory_blocks(start ...)
Looks quite promising, only a couple of things to fine-tune :) Thanks!
--
Thanks,
David / dhildenb
Powered by blists - more mailing lists