lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <46b8e215-97d4-9097-9941-58e050c5efcc@redhat.com>
Date:   Thu, 29 Aug 2019 18:55:42 +0200
From:   David Hildenbrand <david@...hat.com>
To:     Alexander Duyck <alexander.duyck@...il.com>,
        Dan Williams <dan.j.williams@...el.com>
Cc:     LKML <linux-kernel@...r.kernel.org>, linux-mm <linux-mm@...ck.org>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Oscar Salvador <osalvador@...e.com>,
        Michal Hocko <mhocko@...e.com>,
        Pavel Tatashin <pasha.tatashin@...een.com>,
        Wei Yang <richard.weiyang@...il.com>, Qian Cai <cai@....pw>,
        "Matthew Wilcox (Oracle)" <willy@...radead.org>,
        "Kirill A. Shutemov" <kirill.shutemov@...ux.intel.com>,
        Stephen Rothwell <sfr@...b.auug.org.au>,
        Dave Airlie <airlied@...hat.com>,
        Andrey Konovalov <andreyknvl@...gle.com>,
        Alexander Duyck <alexander.h.duyck@...ux.intel.com>,
        Ira Weiny <ira.weiny@...el.com>,
        John Hubbard <jhubbard@...dia.com>,
        Arun KS <arunks@...eaurora.org>,
        Souptick Joarder <jrdr.linux@...il.com>,
        Robin Murphy <robin.murphy@....com>,
        Yang Shi <yang.shi@...ux.alibaba.com>,
        Jason Gunthorpe <jgg@...pe.ca>,
        Logan Gunthorpe <logang@...tatee.com>,
        Vlastimil Babka <vbabka@...e.cz>,
        Mel Gorman <mgorman@...hsingularity.net>,
        Mike Rapoport <rppt@...ux.vnet.ibm.com>,
        Alexander Potapenko <glider@...gle.com>
Subject: Re: [PATCH v3 01/11] mm/memremap: Get rid of
 memmap_init_zone_device()

On 29.08.19 18:39, Alexander Duyck wrote:
> On Thu, Aug 29, 2019 at 12:00 AM David Hildenbrand <david@...hat.com> wrote:
>>
>> As far as I can see, the original split wanted to speed up a duplicate
>> initialization. We now only initialize once - and we really should
>> initialize under the lock, when resizing the zones.
> 
> What do you mean by duplicate initialization? Basically the issue was
> that we can have systems with massive memory footprints and I was just
> trying to get the initialization time under a minute. The compromise I
> made was to split the initialization so that we only initialized the
> pages in the altmap and defer the rest so that they can be initialized
> in parallel.
> 
> What this patch does is serialize the initialization so it will likely
> take 2 to 4 minutes or more to initialize memory on a system where I
> had brought the init time under about 30 seconds.
> 
>> As soon as we drop the lock we might have memory unplug running, trying
>> to shrink the zone again. In case the memmap was not properly initialized,
>> the zone/node shrinking code might get false negatives when search for
>> the new zone/node boundaries - bad. We suddenly could no longer span the
>> memory we just added.
> 
> The issue as I see it is that we can have systems with multiple nodes
> and each node can populate a fairly significant amount of data. By
> pushing this all back under the hotplug lock you are serializing the
> initialization for each node resulting in a 4x or 8x increase in
> initialization time on some of these bigger systems.
> 
>> Also, once we want to fix set_zone_contiguous(zone) inside
>> move_pfn_range_to_zone() to actually work with ZONE_DEVICE (instead of
>> always immediately stopping and never setting zone->contiguous) we have
>> to have the whole memmap initialized at that point. (not sure if we
>> want that in the future, just a note)
>>
>> Let's just keep things simple and initialize the memmap when resizing
>> the zones under the lock.
>>
>> If this is a real performance issue, we have to watch out for
>> alternatives.
> 
> My concern is that this is going to become a performance issue, but I
> don't have access to the hardware at the moment to test how much of
> one. I'll have to check to see if I can get access to a system to test
> this patch set.
> 

Thanks for having a look - I already dropped this patch again. We will
rather stop shrinking the ZONE_DEVICE. So assume this patch is gone.

[...]

> So if you are moving this all under the lock then this is going to
> serialize initialization and it is going to be quite expensive on bit
> systems. I was only ever able to get the init time down to something
> like ~20s with the optimized function. Since that has been torn apart
> and you are back to doing things with memmap_init_zone we are probably
> looking at more like 25-30s for each node, and on a 4 node system we
> are looking at 2 minutes or so which may lead to issues if people are
> mounting this.
> 
> Instead of forcing this all to be done under the hotplug lock is there
> some way we could do this under the zone span_seqlock instead to
> achieve the same result?

I guess the right approach really is as Michal suggest to not shrink at
all (at least ZONE_DEVICE) :)

> 
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index c5d62f1c2851..44038665fe8e 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -5845,7 +5845,7 @@ overlap_memmap_init(unsigned long zone, unsigned long *pfn)
>>   */
>>  void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
>>                 unsigned long start_pfn, enum memmap_context context,
>> -               struct vmem_altmap *altmap)
>> +               struct dev_pagemap *pgmap)
>>  {
>>         unsigned long pfn, end_pfn = start_pfn + size;
>>         struct page *page;
>> @@ -5853,24 +5853,6 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
>>         if (highest_memmap_pfn < end_pfn - 1)
>>                 highest_memmap_pfn = end_pfn - 1;
>>
>> -#ifdef CONFIG_ZONE_DEVICE
>> -       /*
>> -        * Honor reservation requested by the driver for this ZONE_DEVICE
>> -        * memory. We limit the total number of pages to initialize to just
>> -        * those that might contain the memory mapping. We will defer the
>> -        * ZONE_DEVICE page initialization until after we have released
>> -        * the hotplug lock.
>> -        */
>> -       if (zone == ZONE_DEVICE) {
>> -               if (!altmap)
>> -                       return;
>> -
>> -               if (start_pfn == altmap->base_pfn)
>> -                       start_pfn += altmap->reserve;
>> -               end_pfn = altmap->base_pfn + vmem_altmap_offset(altmap);
>> -       }
>> -#endif
>> -
>>         for (pfn = start_pfn; pfn < end_pfn; pfn++) {
>>                 /*
>>                  * There can be holes in boot-time mem_map[]s handed to this
>> @@ -5892,6 +5874,20 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
>>                 if (context == MEMMAP_HOTPLUG)
>>                         __SetPageReserved(page);
>>
>> +#ifdef CONFIG_ZONE_DEVICE
>> +               if (zone == ZONE_DEVICE) {
>> +                       WARN_ON_ONCE(!pgmap);
>> +                       /*
>> +                        * ZONE_DEVICE pages union ->lru with a ->pgmap back
>> +                        * pointer and zone_device_data. It is a bug if a
>> +                        * ZONE_DEVICE page is ever freed or placed on a driver
>> +                        * private list.
>> +                        */
>> +                       page->pgmap = pgmap;
>> +                       page->zone_device_data = NULL;
>> +               }
>> +#endif
>> +
> 
> So I am not sure this is right. Part of the reason for the split is
> that the pages that were a part of the altmap had an LRU setup, not
> the pgmap/zone_device_data setup. This is changing that.

Yeah, you might be right, we would have to handle the altmap part
separately.

> 
> Also, I am more a fan of just testing pgmap and if it is present then
> assign page->pgmap and reset zone_device_data. Then you can avoid the
> test for zone on every iteration and the WARN_ON_ONCE check, or at
> least you could move it outside the loop so we don't incur the cost
> with each page. Are there situations where you would have pgmap but
> not a ZONE_DEVICE page?

Don't think so. But I am definitely not a ZONE_DEVICE expert.

> 
>>                 /*
>>                  * Mark the block movable so that blocks are reserved for
>>                  * movable at startup. This will force kernel allocations
>> @@ -5951,14 +5947,6 @@ void __ref memmap_init_zone_device(struct zone *zone,
>>                  */
>>                 __SetPageReserved(page);
>>
>> -               /*
>> -                * ZONE_DEVICE pages union ->lru with a ->pgmap back pointer
>> -                * and zone_device_data.  It is a bug if a ZONE_DEVICE page is
>> -                * ever freed or placed on a driver-private list.
>> -                */
>> -               page->pgmap = pgmap;
>> -               page->zone_device_data = NULL;
>> -
>>                 /*
>>                  * Mark the block movable so that blocks are reserved for
>>                  * movable at startup. This will force kernel allocations
> 
> Shouldn't you be removing this function instead of just gutting it?
> I'm kind of surprised you aren't getting warnings about unused code
> since you also pulled the declaration for it from the header.
> 

Don't ask me what went wrong here, I guess I messed this up while rebasing.

-- 

Thanks,

David / dhildenb

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ