[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ca649aad-7b76-4c6d-b513-26b3d58f8e68@redhat.com>
Date: Thu, 7 Dec 2023 12:08:08 +0100
From: David Hildenbrand <david@...hat.com>
To: Ryan Roberts <ryan.roberts@....com>,
Andrew Morton <akpm@...ux-foundation.org>,
Matthew Wilcox <willy@...radead.org>,
Yin Fengwei <fengwei.yin@...el.com>,
Yu Zhao <yuzhao@...gle.com>,
Catalin Marinas <catalin.marinas@....com>,
Anshuman Khandual <anshuman.khandual@....com>,
Yang Shi <shy828301@...il.com>,
"Huang, Ying" <ying.huang@...el.com>, Zi Yan <ziy@...dia.com>,
Luis Chamberlain <mcgrof@...nel.org>,
Itaru Kitayama <itaru.kitayama@...il.com>,
"Kirill A. Shutemov" <kirill.shutemov@...ux.intel.com>,
John Hubbard <jhubbard@...dia.com>,
David Rientjes <rientjes@...gle.com>,
Vlastimil Babka <vbabka@...e.cz>,
Hugh Dickins <hughd@...gle.com>,
Kefeng Wang <wangkefeng.wang@...wei.com>,
Barry Song <21cnbao@...il.com>,
Alistair Popple <apopple@...dia.com>
Cc: linux-mm@...ck.org, linux-arm-kernel@...ts.infradead.org,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH v8 04/10] mm: thp: Support allocation of anonymous
multi-size THP
[...]
>>
>> Nit: the orders = ... order = ... looks like this might deserve a helper
>> function that makes this easier to read.
>
> To be honest, the existing function that I've modified is a bit of a mess.
It's all an ugly mess and I hate it.
It would be cleanest if we'd just have "thp_vma_configured_orders()"
that gives us all configured orders for the given VMA+flags combination.
No passing in of orders, try handling the masking in the caller.
Then, we move that nasty "transhuge_vma_suitable" handling for !in_pf
out of there and handle that in the callers. The comment "Huge fault
does the check in fault handlers. And this check is not suitable for
huge PUD fault handlers." already makes me angry, what a mess.
Then, we'd have a thp_vma_fitting_orders() / thp_vma_is_fitting_order()
function that does the filtering only based on the given address + vma
size/alignment. That's roughly "thp_vma_suitable_orders()".
Finding a good name to combine both could be something like
"thp_vma_possible_orders()".
Would make more sense to me (but again, German guy, so it's probably all
wrong).
> thp_vma_allowable_orders() calls thp_vma_suitable_orders() if we are not in a
> page fault, because the page fault handlers already do that check themselves. It
> would be nice to refactor the whole thing so that thp_vma_allowable_orders() is
> a strict superset of thp_vma_suitable_orders(). Then this can just call
> thp_vma_allowable_orders(). But that's going to start touching the PMD and PUD
> handlers, so prefer if we leave that for a separate patch set.
>
>>
>> Nit: Why call thp_vma_suitable_orders if the orders are already 0? Again, some
>> helper might be reasonable where that is handled internally.
>
> Because thp_vma_suitable_orders() will handle it safely and is inline, so it
> should just as efficient? This would go away with the refactoring described above.
Right. Won't win in a beauty contest. Some simple helper might make this
much easier to digest.
>
>>
>> Comment: For order-0 we'll always perform a function call to both
>> thp_vma_allowable_orders() / thp_vma_suitable_orders(). We should perform some
>> fast and efficient check if any <PMD THP are even enabled in the system / for
>> this VMA, and in that case just fallback before doing more expensive checks.
>
> thp_vma_allowable_orders() is inline as you mentioned.
>
> I was deliberately trying to keep all the decision logic in one place
> (thp_vma_suitable_orders) because it's already pretty complicated. But if you
> insist, how about this in the header:
>
> static inline
> unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
> unsigned long vm_flags, bool smaps,
> bool in_pf, bool enforce_sysfs,
> unsigned long orders)
> {
> /* Optimization to check if required orders are enabled early. */
> if (enforce_sysfs && vma_is_anonymous(vma)) {
> unsigned long mask = READ_ONCE(huge_anon_orders_always);
>
> if (vm_flags & VM_HUGEPAGE)
> mask |= READ_ONCE(huge_anon_orders_madvise);
> if (hugepage_global_always() ||
> ((vm_flags & VM_HUGEPAGE) && hugepage_global_enabled()))
> mask |= READ_ONCE(huge_anon_orders_inherit);
>
> orders &= mask;
> if (!orders)
> return 0;
>
> enforce_sysfs = false;
> }
>
> return __thp_vma_allowable_orders(vma, vm_flags, smaps, in_pf,
> enforce_sysfs, orders);
> }
>
> Then the above check can be removed from __thp_vma_allowable_orders() - it will
> still retain the `if (enforce_sysfs && !vma_is_anonymous(vma))` part.
>
Better. I still kind-of hate having to pass in orders here. Such masking
is better done in the caller (see above how it might be done when moving
the transhuge_vma_suitable() check out).
>
>>
>>> +
>>> + if (!orders)
>>> + goto fallback;
>>> +
>>> + pte = pte_offset_map(vmf->pmd, vmf->address & PMD_MASK);
>>> + if (!pte)
>>> + return ERR_PTR(-EAGAIN);
>>> +
>>> + order = first_order(orders);
>>> + while (orders) {
>>> + addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
>>> + vmf->pte = pte + pte_index(addr);
>>> + if (pte_range_none(vmf->pte, 1 << order))
>>> + break;
>>
>> Comment: Likely it would make sense to scan only once and determine the "largest
>> none range" around that address, having the largest suitable order in mind.
>
> Yes, that's how I used to do it, but Yu Zhou requested simplifying to this,
> IIRC. Perhaps this an optimization opportunity for later?
Yes, definetly.
>
>>
>>> + order = next_order(&orders, order);
>>> + }
>>> +
>>> + vmf->pte = NULL;
>>
>> Nit: Can you elaborate why you are messing with vmf->pte here? A simple helper
>> variable will make this code look less magical. Unless I am missing something
>> important :)
>
> Gahh, I used to pass the vmf to what pte_range_none() was refactored into (an
> approach that was suggested by Yu Zhou IIRC). But since I did some refactoring
> based on some comments from JohnH, I see I don't need that anymore. Agreed; it
> will be much clearer just to use a local variable. Will fix.
>
>>
>>> + pte_unmap(pte);
>>> +
>>> + gfp = vma_thp_gfp_mask(vma);
>>> +
>>> + while (orders) {
>>> + addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
>>> + folio = vma_alloc_folio(gfp, order, vma, addr, true);
>>> + if (folio) {
>>> + clear_huge_page(&folio->page, addr, 1 << order);
>>> + return folio;
>>> + }
>>> + order = next_order(&orders, order);
>>> + }
>>> +
>>
>> Queestion: would it make sense to combine both loops? I suspect memory
>> allocations with pte_offset_map()/kmao are problematic.
>
> They are both operating on separate orders; next_order() is "consuming" an order
> by removing the current one from the orders bitfield and returning the next one.
>
> So the first loop starts at the highest order and keeps checking lower orders
> until one fully fits in the VMA. And the second loop starts at the first order
> that was found to fully fits and loops to lower orders until an allocation is
> successful.
Right, but you know from the first loop which order is applicable (and
will be fed to the second loop) and could just pte_unmap(pte) +
tryalloc. If that fails, remap and try with the next orders.
That would make the code certainly easier to understand. That "orders"
magic of constructing, filtering, walking is confusing :)
I might find some time today to see if there is an easy way to cleanup
all what I spelled out above. It really is a mess. But likely that
cleanup could be deferred (but you're touching it, so ... :) ).
--
Cheers,
David / dhildenb
Powered by blists - more mailing lists