[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAPcyv4gXaUswZS_a4-oiQZWVZ4QDJrKps4XGs=xxLE0Ls=PSmg@mail.gmail.com>
Date: Tue, 12 Jan 2021 01:18:20 -0800
From: Dan Williams <dan.j.williams@...el.com>
To: David Hildenbrand <david@...hat.com>
Cc: Michal Hocko <mhocko@...e.com>, Linux MM <linux-mm@...ck.org>,
Andrew Morton <akpm@...ux-foundation.org>,
Linux Kernel Mailing List <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH] mm: Teach pfn_to_online_page() about ZONE_DEVICE section collisions
On Thu, Jan 7, 2021 at 1:16 AM David Hildenbrand <david@...hat.com> wrote:
>
> [...]
>
> >>> Well, I would love to have no surprises either. So far there was not
> >>> actual argument why the pmem reserved space cannot be fully initialized.
> >>
> >> Yes, I'm still hoping Dan can clarify that.
> >
> > Complexity and effective utility (once pfn_to_online_page() is fixed)
> > are the roadblocks in my mind. The altmap is there to allow for PMEM
> > capacity to be used as memmap space, so there would need to be code to
> > break that circular dependency and allocate a memmap for the metadata
> > space from DRAM and the rest of the memmap space for the data capacity
> > from pmem itself. That memmap-for-pmem-metadata will still represent
> > offline pages. So once pfn_to_online_page() is fixed, what pfn-walker
> > is going to be doing pfn_to_page() on PMEM metadata? Secondly, there
>
> Assume I do
>
> pgmap = get_dev_pagemap(pfn, NULL);
> if (pgmap)
> return pfn_to_page(pfn);
> return NULL;
>
> on a random pfn because I want to inspect ZONE_DEVICE PFNs.
I keep getting hung up on the motivation to do random pfn inspection?
The problems we have found to date have required different solutions.
The KVM bug didn't use get_dev_pagemap() to inspect the pfn because it
could rely on the fact that the page already had an elevated reference
count. The get_user_pages() path only looks up ZONE_DEVICE pfns when
it see {pte,pmd,pud}_devmap set in the page table entry. pfn walkers
have been a problem, but with pfn_to_online_page() fixed what is the
remaining motivation to inspect ZONE_DEVICE pfns?
> IIUC, the memmap I get might usually be initialized, except we're
> hitting a PFN in the reserved altmap space. Correct?
The pagemap currently returns true for every pfn in the range
including those in the altmap.
>
> Do we expect this handling to not be valid - i.e., initialization to be
> of no utility? (to make it fly we would have to explicitly check for
> PFNs in the altmap reserved space, which wouldn't be too problematic)
>
> > is a PMEM namespace mode called "raw" that eschews DAX and 'struct
> > page' for pmem and just behaves like a memory-backed block device. The
> > end result is still that pfn walkers need to discover if a PMEM pfn
> > has a page, so I don't see what "sometimes there's an
> > memmap-for-pmem-metadata" buys us?
>
> Right, but that's as easy as doing a pfn_valid() first.
>
>
> Let me try to express what I (and I think Michal) mean:
>
> In pagemap_range(), we
>
> 1. move_pfn_range_to_zone()->memmap_init_zone(): initialize the memmap
> of the PMEM device used as memmap itself ("self host", confusing). We
> skip initializing the memmap for the the reserved altmap space.
>
> 2. memmap_init_zone_device(): initialize the memmap of everything
> outside of the altmap space.
>
> IIUC, the memmap of the reserved altmap space remains uninitialized.
> What speaks against just initializing that part in e.g., 2. or similarly
> after 1.?
>
>
> I'm planning on documenting the result of this discussion in the code,
> so people don't have to scratch their head whenever stumbling over the
> altmap reserved space.
>
> >
> >>
> >>> On the other hand making sure that pfn_to_online_page sounds like the
> >>> right thing to do. And having an explicit check for zone device there in
> >>> a slow path makes sense to me.
> >>
> >> As I said, I'd favor to simplify and just get rid of the special case,
> >> instead of coming up with increasingly complex ways to deal with it.
> >> pfn_to_online_page() used to be simple, essentially checking a single
> >> flag was sufficient in most setups.
> >
> > I think the logic to throw away System RAM that might collide with
> > PMEM and soft-reserved memory within a section is on the order of the
> > same code complexity as the patch proposed here, no? Certainly the
> > throw-away concept itself is easier to grasp, but I don't think that
> > would be reflected in the code patch to achieve it... willing to be
> > proved wrong with a patch.
>
> Well, at least it sounds easier to go over memblock holes and
> align-up/down some relevant PFNs to section boundaries, ending up with
> no affect to runtime performance later (e.g., pfn_to_online_page()). But
> I agree that most probably the devil is in the detail - e.g., handling
> different kind of holes (different flavors of "reserved") and syncing up
> other data structures (e.g., resource tree).
>
> I don't have time to look into that right now, but might look into it in
> the future. For now I'm fine with this approach, assuming we don't
> discover other corner cases that turn it even more complex. I'm happy
> that we finally talk about it and fix it!
>
>
> --
> Thanks,
>
> David / dhildenb
>
Powered by blists - more mailing lists