lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAPcyv4jo91jKjwn-M7cOhG=6vJ3c-QCyp0W+T+CtmiKGyZP1ng@mail.gmail.com>
Date:   Mon, 16 Jul 2018 13:30:50 -0700
From:   Dan Williams <dan.j.williams@...el.com>
To:     Pavel Tatashin <pasha.tatashin@...cle.com>
Cc:     Andrew Morton <akpm@...ux-foundation.org>,
        "Luck, Tony" <tony.luck@...el.com>,
        Huaisheng Ye <yehs1@...ovo.com>,
        Vishal L Verma <vishal.l.verma@...el.com>,
        Jan Kara <jack@...e.cz>, Matthew Wilcox <willy@...radead.org>,
        Dave Jiang <dave.jiang@...el.com>,
        "H. Peter Anvin" <hpa@...or.com>,
        Thomas Gleixner <tglx@...utronix.de>,
        Rich Felker <dalias@...c.org>,
        Fenghua Yu <fenghua.yu@...el.com>,
        Daniel Jordan <daniel.m.jordan@...cle.com>,
        Yoshinori Sato <ysato@...rs.sourceforge.jp>,
        Benjamin Herrenschmidt <benh@...nel.crashing.org>,
        Michal Hocko <mhocko@...e.com>,
        Paul Mackerras <paulus@...ba.org>,
        Christoph Hellwig <hch@....de>,
        Jérôme Glisse <jglisse@...hat.com>,
        Ingo Molnar <mingo@...hat.com>,
        Michael Ellerman <mpe@...erman.id.au>,
        Heiko Carstens <heiko.carstens@...ibm.com>,
        X86 ML <x86@...nel.org>, Logan Gunthorpe <logang@...tatee.com>,
        Ross Zwisler <ross.zwisler@...ux.intel.com>,
        jmoyer <jmoyer@...hat.com>,
        Johannes Thumshirn <jthumshirn@...e.de>,
        Martin Schwidefsky <schwidefsky@...ibm.com>,
        Linux Memory Management List <linux-mm@...ck.org>,
        linux-nvdimm <linux-nvdimm@...ts.01.org>,
        LKML <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH v2 00/14] mm: Asynchronous + multithreaded memmap init for ZONE_DEVICE

On Mon, Jul 16, 2018 at 12:12 PM, Pavel Tatashin
<pasha.tatashin@...cle.com> wrote:
> On Mon, Jul 16, 2018 at 1:10 PM Dan Williams <dan.j.williams@...el.com> wrote:
>>
>> Changes since v1 [1]:
>> * Teach memmap_sync() to take over a sub-set of memmap initialization in
>>   the foreground. This foreground work still needs to await the
>>   completion of vmemmap_populate_hugepages(), but it will otherwise
>>   steal 1/1024th of the 'struct page' init work for the given range.
>>   (Jan)
>> * Add kernel-doc for all the new 'async' structures.
>> * Split foreach_order_pgoff() to its own patch.
>> * Add Pavel and Daniel to the cc as they have been active in the memory
>>   hotplug code.
>> * Fix a typo that prevented CONFIG_DAX_DRIVER_DEBUG=y from performing
>>   early pfn retrieval at dax-filesystem mount time.
>> * Improve some of the changelogs
>>
>> [1]: https://lwn.net/Articles/759117/
>>
>> ---
>>
>> In order to keep pfn_to_page() a simple offset calculation the 'struct
>> page' memmap needs to be mapped and initialized in advance of any usage
>> of a page. This poses a problem for large memory systems as it delays
>> full availability of memory resources for 10s to 100s of seconds.
>>
>> For typical 'System RAM' the problem is mitigated by the fact that large
>> memory allocations tend to happen after the kernel has fully initialized
>> and userspace services / applications are launched. A small amount, 2GB
>> of memory, is initialized up front. The remainder is initialized in the
>> background and freed to the page allocator over time.
>>
>> Unfortunately, that scheme is not directly reusable for persistent
>> memory and dax because userspace has visibility to the entire resource
>> pool and can choose to access any offset directly at its choosing. In
>> other words there is no allocator indirection where the kernel can
>> satisfy requests with arbitrary pages as they become initialized.
>>
>> That said, we can approximate the optimization by performing the
>> initialization in the background, allow the kernel to fully boot the
>> platform, start up pmem block devices, mount filesystems in dax mode,
>> and only incur delay at the first userspace dax fault. When that initial
>> fault occurs that process is delegated a portion of the memmap to
>> initialize in the foreground so that it need not wait for initialization
>> of resources that it does not immediately need.
>>
>> With this change an 8 socket system was observed to initialize pmem
>> namespaces in ~4 seconds whereas it was previously taking ~4 minutes.
>
> Hi Dan,
>
> I am worried that this work adds another way to multi-thread struct
> page initialization without re-use of already existing method. The
> code is already a mess, and leads to bugs [1] because of the number of
> different memory layouts, architecture specific quirks, and different
> struct page initialization methods.

Yes, the lamentations about the complexity of the memory hotplug code
are known. I didn't think this set made it irretrievably worse, but
I'm biased and otherwise certainly want to build consensus with other
mem-hotplug folks.

>
> So, when DEFERRED_STRUCT_PAGE_INIT is used we initialize struct pages
> on demand until page_alloc_init_late() is called, and at that time we
> initialize all the rest of struct pages by calling:
>
> page_alloc_init_late()
>   deferred_init_memmap() (a thread per node)
>     deferred_init_pages()
>        __init_single_page()
>
> This is because memmap_init_zone() is not multi-threaded. However,
> this work makes memmap_init_zone() multi-threaded. So, I think we
> should really be either be using deferred_init_memmap() here, or teach
> DEFERRED_STRUCT_PAGE_INIT to use new multi-threaded memmap_init_zone()
> but not both.

I agree it would be good to look at unifying the 2 async
initialization approaches, however they have distinct constraints. All
of the ZONE_DEVICE memmap initialization work happens as a hotplug
event where the deferred_init_memmap() threads have already been torn
down. For the memory capacities where it takes minutes to initialize
the memmap it is painful to incur a global flush of all initialization
work. So, I think that a move to rework deferred_init_memmap() in
terms of memmap_init_async() is warranted because memmap_init_async()
avoids a global sync and supports the hotplug case.

Unfortunately, the work to unite these 2 mechanisms is going to be
4.20 material, at least for me, since I'm taking an extended leave,
and there is little time for me to get this in shape for 4.19. I
wouldn't be opposed to someone judiciously stealing from this set and
taking a shot at the integration, I likely will not get back to this
until September.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ