linux-kernel - Re: [PATCH 2/2] mm/dax: Don't enable huge dax mapping by default

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <CAPcyv4hiAE9Y3Jeudr=Ys=eu2gei088xGyTCJGOoz04iUExUfw@mail.gmail.com>
Date:   Wed, 20 Mar 2019 20:12:34 -0700
From:   Dan Williams <dan.j.williams@...el.com>
To:     Oliver <oohall@...il.com>
Cc:     "Aneesh Kumar K.V" <aneesh.kumar@...ux.ibm.com>,
        Jan Kara <jack@...e.cz>,
        linux-nvdimm <linux-nvdimm@...ts.01.org>,
        Michael Ellerman <mpe@...erman.id.au>,
        Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
        Linux MM <linux-mm@...ck.org>,
        Ross Zwisler <zwisler@...nel.org>,
        Andrew Morton <akpm@...ux-foundation.org>,
        linuxppc-dev <linuxppc-dev@...ts.ozlabs.org>,
        "Kirill A . Shutemov" <kirill.shutemov@...ux.intel.com>
Subject: Re: [PATCH 2/2] mm/dax: Don't enable huge dax mapping by default

On Wed, Mar 20, 2019 at 8:09 PM Oliver <oohall@...il.com> wrote:
>
> On Thu, Mar 21, 2019 at 7:57 AM Dan Williams <dan.j.williams@...el.com> wrote:
> >
> > On Wed, Mar 20, 2019 at 8:34 AM Dan Williams <dan.j.williams@...el.com> wrote:
> > >
> > > On Wed, Mar 20, 2019 at 1:09 AM Aneesh Kumar K.V
> > > <aneesh.kumar@...ux.ibm.com> wrote:
> > > >
> > > > Aneesh Kumar K.V <aneesh.kumar@...ux.ibm.com> writes:
> > > >
> > > > > Dan Williams <dan.j.williams@...el.com> writes:
> > > > >
> > > > >>
> > > > >>> Now what will be page size used for mapping vmemmap?
> > > > >>
> > > > >> That's up to the architecture's vmemmap_populate() implementation.
> > > > >>
> > > > >>> Architectures
> > > > >>> possibly will use PMD_SIZE mapping if supported for vmemmap. Now a
> > > > >>> device-dax with struct page in the device will have pfn reserve area aligned
> > > > >>> to PAGE_SIZE with the above example? We can't map that using
> > > > >>> PMD_SIZE page size?
> > > > >>
> > > > >> IIUC, that's a different alignment. Currently that's handled by
> > > > >> padding the reservation area up to a section (128MB on x86) boundary,
> > > > >> but I'm working on patches to allow sub-section sized ranges to be
> > > > >> mapped.
> > > > >
> > > > > I am missing something w.r.t code. The below code align that using nd_pfn->align
> > > > >
> > > > >       if (nd_pfn->mode == PFN_MODE_PMEM) {
> > > > >               unsigned long memmap_size;
> > > > >
> > > > >               /*
> > > > >                * vmemmap_populate_hugepages() allocates the memmap array in
> > > > >                * HPAGE_SIZE chunks.
> > > > >                */
> > > > >               memmap_size = ALIGN(64 * npfns, HPAGE_SIZE);
> > > > >               offset = ALIGN(start + SZ_8K + memmap_size + dax_label_reserve,
> > > > >                               nd_pfn->align) - start;
> > > > >       }
> > > > >
> > > > > IIUC that is finding the offset where to put vmemmap start. And that has
> > > > > to be aligned to the page size with which we may end up mapping vmemmap
> > > > > area right?
> > >
> > > Right, that's the physical offset of where the vmemmap ends, and the
> > > memory to be mapped begins.
> > >
> > > > > Yes we find the npfns by aligning up using PAGES_PER_SECTION. But that
> > > > > is to compute howmany pfns we should map for this pfn dev right?
> > > > >
> > > >
> > > > Also i guess those 4K assumptions there is wrong?
> > >
> > > Yes, I think to support non-4K-PAGE_SIZE systems the 'pfn' metadata
> > > needs to be revved and the PAGE_SIZE needs to be recorded in the
> > > info-block.
> >
> > How often does a system change page-size. Is it fixed or do
> > environment change it from one boot to the next? I'm thinking through
> > the behavior of what do when the recorded PAGE_SIZE in the info-block
> > does not match the current system page size. The simplest option is to
> > just fail the device and require it to be reconfigured. Is that
> > acceptable?
>
> The kernel page size is set at build time and as far as I know every
> distro configures their ppc64(le) kernel for 64K. I've used 4K kernels
> a few times in the past to debug PAGE_SIZE dependent problems, but I'd
> be surprised if anyone is using 4K in production.

Ah, ok.

> Anyway, my view is that using 4K here isn't really a problem since
> it's just the accounting unit of the pfn superblock format. The kernel
> reading form it should understand that and scale it to whatever
> accounting unit it wants to use internally. Currently we don't so that
> should probably be fixed, but that doesn't seem to cause any real
> issues. As far as I can tell the only user of npfns in
> __nvdimm_setup_pfn() whih prints the "number of pfns truncated"
> message.
>
> Am I missing something?

No, I don't think so. The only time it would break is if a system with
64K page size laid down an info-block with not enough reserved
capacity when the page-size is 4K (npfns too small). However, that
sounds like an exceptional case which is why no problems have been
reported to date.