linux-kernel - Re: [PATCH v3 2/5] mm: Introduce a new Vmemmap page-type

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <dbd19aea-fe18-ec42-7932-f03109cb399e@redhat.com>
Date:   Fri, 26 Jul 2019 11:41:46 +0200
From:   David Hildenbrand <david@...hat.com>
To:     Oscar Salvador <osalvador@...e.de>
Cc:     akpm@...ux-foundation.org, dan.j.williams@...el.com,
        pasha.tatashin@...een.com, mhocko@...e.com,
        anshuman.khandual@....com, Jonathan.Cameron@...wei.com,
        vbabka@...e.cz, linux-mm@...ck.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v3 2/5] mm: Introduce a new Vmemmap page-type

On 26.07.19 11:25, Oscar Salvador wrote:
> On Fri, Jul 26, 2019 at 10:48:54AM +0200, David Hildenbrand wrote:
>>> Signed-off-by: Oscar Salvador <osalvador@...e.de>
>>> ---
>>>  include/linux/mm.h         | 17 +++++++++++++++++
>>>  include/linux/mm_types.h   |  5 +++++
>>>  include/linux/page-flags.h | 19 +++++++++++++++++++
>>>  3 files changed, 41 insertions(+)
>>>
>>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>>> index 45f0ab0ed4f7..432175f8f8d2 100644
>>> --- a/include/linux/mm.h
>>> +++ b/include/linux/mm.h
>>> @@ -2904,6 +2904,23 @@ static inline bool debug_guardpage_enabled(void) { return false; }
>>>  static inline bool page_is_guard(struct page *page) { return false; }
>>>  #endif /* CONFIG_DEBUG_PAGEALLOC */
>>>  
>>> +static __always_inline struct page *vmemmap_head(struct page *page)
>>> +{
>>> +	return (struct page *)page->vmemmap_head;
>>> +}
>>> +
>>> +static __always_inline unsigned long vmemmap_nr_sections(struct page *page)
>>> +{
>>> +	struct page *head = vmemmap_head(page);
>>> +	return head->vmemmap_sections;
>>> +}
>>> +
>>> +static __always_inline unsigned long vmemmap_nr_pages(struct page *page)
>>> +{
>>> +	struct page *head = vmemmap_head(page);
>>> +	return head->vmemmap_pages - (page - head);
>>> +}
>>> +
>>>  #if MAX_NUMNODES > 1
>>>  void __init setup_nr_node_ids(void);
>>>  #else
>>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
>>> index 6a7a1083b6fb..51dd227f2a6b 100644
>>> --- a/include/linux/mm_types.h
>>> +++ b/include/linux/mm_types.h
>>> @@ -170,6 +170,11 @@ struct page {
>>>  			 * pmem backed DAX files are mapped.
>>>  			 */
>>>  		};
>>> +		struct {        /* Vmemmap pages */
>>> +			unsigned long vmemmap_head;
>>> +			unsigned long vmemmap_sections; /* Number of sections */
>>> +			unsigned long vmemmap_pages;    /* Number of pages */
>>> +		};
>>>  
>>>  		/** @rcu_head: You can use this to free a page by RCU. */
>>>  		struct rcu_head rcu_head;
>>> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
>>> index f91cb8898ff0..75f302a532f9 100644
>>> --- a/include/linux/page-flags.h
>>> +++ b/include/linux/page-flags.h
>>> @@ -708,6 +708,7 @@ PAGEFLAG_FALSE(DoubleMap)
>>>  #define PG_kmemcg	0x00000200
>>>  #define PG_table	0x00000400
>>>  #define PG_guard	0x00000800
>>> +#define PG_vmemmap     0x00001000
>>>  
>>>  #define PageType(page, flag)						\
>>>  	((page->page_type & (PAGE_TYPE_BASE | flag)) == PAGE_TYPE_BASE)
>>> @@ -764,6 +765,24 @@ PAGE_TYPE_OPS(Table, table)
>>>   */
>>>  PAGE_TYPE_OPS(Guard, guard)
>>>  
>>> +/*
>>> + * Vmemmap pages refers to those pages that are used to create the memmap
>>> + * array, and reside within the same memory range that was hotppluged, so
>>> + * they are self-hosted. (see include/linux/memory_hotplug.h)
>>> + */
>>> +PAGE_TYPE_OPS(Vmemmap, vmemmap)
>>> +static __always_inline void SetPageVmemmap(struct page *page)
>>> +{
>>> +	__SetPageVmemmap(page);
>>> +	__SetPageReserved(page);
>>
>> So, the issue with some vmemmap pages is that the "struct pages" reside
>> on the memory they manage. (it is nice, but complicated - e.g. when
>> onlining/offlining)
> 
> Hi David,
> 
> Not really.
> Vemmap pages are just skipped when onling/offlining handling.
> We do not need them to a) send to the buddy and b) migrate them over.
> A look at patch#4 will probably help, as the crux of the matter is there.

Right, but you have to hinder onlining code from trying to reinitialize
the vmemmap - when you try to online the first memory block. Will dive
into the details (patch #4) next (maybe not today, but early next week) :)

> 
>>
>> I would expect that you properly initialize the struct pages for the
>> vmemmap pages (now it gets confusing :) ) when adding memory. The other
>> struct pages are initialized when onlining/offlining.
>>
>> So, at this point, the pages should already be marked reserved, no? Or
>> are the struct pages for the vmemmap never initialized?
>>
>> What zone do these vmemmap pages have? They are not assigned to any zone
>> and will never be :/
> 
> This patch is only a preparation, the real "fun" is in patch#4.
> 
> Vmemmap pages initialization occurs in mhp_mark_vmemmap_pages, called from
> __add_pages() (patch#4).
> In there we a) mark the page as vmemmap and b) initialize the fields we need to
> track some medata (sections, pages and head).
> 
> In __init_single_page(), when onlining, the rest of the fields will be set up
> properly (zone, refcount, etc).
> 
> Chunk from patch#4:
> 
> static void __meminit __init_single_page(struct page *page, unsigned long pfn,
>                                 unsigned long zone, int nid)
> {
>         if (PageVmemmap(page))
>                 /*
>                  * Vmemmap pages need to preserve their state.
>                  */
>                 goto preserve_state;

Can you be sure there are no false positives? (if I remember correctly,
this memory might be completely uninitialized - I might be wrong)

> 
>         mm_zero_struct_page(page);
>         page_mapcount_reset(page);
>         INIT_LIST_HEAD(&page->lru);
> preserve_state:
>         init_page_count(page);
>         set_page_links(page, zone, nid, pfn);
>         page_cpupid_reset_last(page);
>         page_kasan_tag_reset(page);
> 
> So, vmemmap pages will fall within the same zone as the range we are adding,
> that does not change.

I wonder if that is the right thing to do, hmmmm, because they are
effectively not part of that zone (not online)

Will have a look at the details :)

> 
>>> +}
>>> +
>>> +static __always_inline void ClearPageVmemmap(struct page *page)
>>> +{
>>> +	__ClearPageVmemmap(page);
>>> +	__ClearPageReserved(page);
>>
>> You sure you want to clear the reserved flag here? Is this function
>> really needed?
>>
>> (when you add memory, you can mark all relevant pages as vmemmap pages,
>> which is valid until removing the memory)
>>
>> Let's draw a picture so I am not confused
>>
>> [ ------ added memory ------ ]
>> [ vmemmap]
>>
>> The first page of the added memory is a vmemmap page AND contains its
>> own vmemmap, right?
> 
> Not only the first page.
> Depending on how large is the chunk you are adding, the number of vmemmap
> pages will vary, because we need to cover the memmaps for the range.
> 
> e.g:
> 
>  - 128MB (1 section) = 512 vmemmap pages at the beginning of the range
>  - 256MB (2 section) = 1024 vmemmap pages at the beginning of the range
>  ...
> 

Right.

>> When adding memory, you would initialize set all struct pages of the
>> vmemmap (residing on itself) and set them to SetPageVmemmap().
>>
>> When removing memory, there is nothing to do, all struct pages are
>> dropped. So why do we need the ClearPageVmemmap() ?
> 
> Well, it is not really needed as we only call ClearPageVmemmap when we are
> actually removing the memory with vmemmap_free()->...
> So one could argue that since the memory is going away, there is no need
> to clear anything in there.
> 
> I just made it for consistency purposes.
> 
> Can drop it if feeling strong here.

Not strong, was just wondering why that is needed at all in the big
picture :)

-- 

Thanks,

David / dhildenb