linux-kernel - Re: [PATCH 00/13] mm: Asynchronous + multithreaded memmap init for ZONE

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <CAPcyv4j3X7vQb0t3FzN0c6yEicZC6LDCPyJVJud1y+vusMUBbw@mail.gmail.com>
Date:   Mon, 9 Jul 2018 09:53:40 -0700
From:   Dan Williams <dan.j.williams@...el.com>
To:     Jan Kara <jack@...e.cz>
Cc:     Andrew Morton <akpm@...ux-foundation.org>,
        Tony Luck <tony.luck@...el.com>,
        Huaisheng Ye <yehs1@...ovo.com>,
        Vishal Verma <vishal.l.verma@...el.com>,
        Dave Jiang <dave.jiang@...el.com>,
        "H. Peter Anvin" <hpa@...or.com>,
        Thomas Gleixner <tglx@...utronix.de>,
        Rich Felker <dalias@...c.org>,
        Fenghua Yu <fenghua.yu@...el.com>,
        Yoshinori Sato <ysato@...rs.sourceforge.jp>,
        Benjamin Herrenschmidt <benh@...nel.crashing.org>,
        Michal Hocko <mhocko@...e.com>,
        Paul Mackerras <paulus@...ba.org>,
        Christoph Hellwig <hch@....de>,
        Jérôme Glisse <jglisse@...hat.com>,
        Ingo Molnar <mingo@...hat.com>,
        Johannes Thumshirn <jthumshirn@...e.de>,
        Michael Ellerman <mpe@...erman.id.au>,
        Heiko Carstens <heiko.carstens@...ibm.com>,
        X86 ML <x86@...nel.org>, Logan Gunthorpe <logang@...tatee.com>,
        Ross Zwisler <ross.zwisler@...ux.intel.com>,
        Jeff Moyer <jmoyer@...hat.com>,
        Vlastimil Babka <vbabka@...e.cz>,
        Martin Schwidefsky <schwidefsky@...ibm.com>,
        linux-nvdimm <linux-nvdimm@...ts.01.org>,
        Linux MM <linux-mm@...ck.org>,
        Linux Kernel Mailing List <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH 00/13] mm: Asynchronous + multithreaded memmap init for ZONE_DEVICE

On Mon, Jul 9, 2018 at 5:56 AM, Jan Kara <jack@...e.cz> wrote:
> On Wed 04-07-18 23:49:02, Dan Williams wrote:
>> In order to keep pfn_to_page() a simple offset calculation the 'struct
>> page' memmap needs to be mapped and initialized in advance of any usage
>> of a page. This poses a problem for large memory systems as it delays
>> full availability of memory resources for 10s to 100s of seconds.
>>
>> For typical 'System RAM' the problem is mitigated by the fact that large
>> memory allocations tend to happen after the kernel has fully initialized
>> and userspace services / applications are launched. A small amount, 2GB
>> of memory, is initialized up front. The remainder is initialized in the
>> background and freed to the page allocator over time.
>>
>> Unfortunately, that scheme is not directly reusable for persistent
>> memory and dax because userspace has visibility to the entire resource
>> pool and can choose to access any offset directly at its choosing. In
>> other words there is no allocator indirection where the kernel can
>> satisfy requests with arbitrary pages as they become initialized.
>>
>> That said, we can approximate the optimization by performing the
>> initialization in the background, allow the kernel to fully boot the
>> platform, start up pmem block devices, mount filesystems in dax mode,
>> and only incur the delay at the first userspace dax fault.
>>
>> With this change an 8 socket system was observed to initialize pmem
>> namespaces in ~4 seconds whereas it was previously taking ~4 minutes.
>>
>> These patches apply on top of the HMM + devm_memremap_pages() reworks
>> [1]. Andrew, once the reviews come back, please consider this series for
>> -mm as well.
>>
>> [1]: https://lkml.org/lkml/2018/6/19/108
>
> One question: Why not (in addition to background initialization) have
> ->direct_access() initialize a block of struct pages around the pfn it
> needs if it finds it's not initialized yet? That would make devices usable
> immediately without waiting for init to complete...

Hmm, yes, relatively immediately... it would depend on the granularity
of the tracking where we can reliably steal initialization work from
the background thread. I'll give it a shot, I'm thinking dividing each
thread's work into 64 sub-units and track those units with a bitmap.
The worst case init time then becomes the time to initialize the pages
for a range that is namespace-size / (NR_MEMMAP_THREADS * 64).