[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <83744787-6bc5-4958-bdaa-030c9ed8225a@nvidia.com>
Date: Wed, 29 Nov 2023 11:40:57 -0800
From: John Hubbard <jhubbard@...dia.com>
To: Ryan Roberts <ryan.roberts@....com>,
Andrew Morton <akpm@...ux-foundation.org>,
Matthew Wilcox <willy@...radead.org>,
Yin Fengwei <fengwei.yin@...el.com>,
David Hildenbrand <david@...hat.com>,
Yu Zhao <yuzhao@...gle.com>,
Catalin Marinas <catalin.marinas@....com>,
Anshuman Khandual <anshuman.khandual@....com>,
Yang Shi <shy828301@...il.com>,
"Huang, Ying" <ying.huang@...el.com>, Zi Yan <ziy@...dia.com>,
Luis Chamberlain <mcgrof@...nel.org>,
Itaru Kitayama <itaru.kitayama@...il.com>,
"Kirill A. Shutemov" <kirill.shutemov@...ux.intel.com>,
David Rientjes <rientjes@...gle.com>,
Vlastimil Babka <vbabka@...e.cz>,
Hugh Dickins <hughd@...gle.com>,
Kefeng Wang <wangkefeng.wang@...wei.com>
Cc: linux-mm@...ck.org, linux-arm-kernel@...ts.infradead.org,
linux-kernel@...r.kernel.org
Subject: Re: [RESEND PATCH v7 03/10] mm: thp: Introduce per-size thp sysfs
interface
1On 11/29/23 03:05, Ryan Roberts wrote:
> On 29/11/2023 03:42, John Hubbard wrote:
>> On 11/22/23 08:29, Ryan Roberts wrote:
...
>>> +As well as PMD-sized THP described above, it is also possible to
>>> +configure the system to allocate "small-sized THP" to back anonymous
>>
>> Here's one of the places to change to the new name, which lately is
>> "multi-size THP", or mTHP or m_thp for short. (I've typed "multi-size"
>> instead of "multi-sized", because the 'd' doesn't add significantly to
>> the meaning, and if in doubt, shorter is better.
>
> I was thinking of some light changes to the start of this paragraph, something like:
>
> Modern kernels support "multi-size THP" (mTHP), which introduces the ability to
> allocate memory in blocks that are bigger than a base page but smaller than
> traditional PMD-size (as described above, in increments of a power-of-2 number
> of pages. mTHP can back anonymous
>
Very nice.
...
>> By default, PMD-sized hugepages have enabled="inherited" and all other
>> hugepage sizes have enabled="never".
>
> That all sounds good; will update.
>
> I wonder if "inherit" is even better than "inherited" though?
Yes, I think so. "inherit" was actually my first idea, and after
thinking about it, I just made it worse by adding the "ed". haha. :)
...
>>> Khugepaged controls
>>> -------------------
>>>
>>> +.. note::
>>> + khugepaged currently only searches for opportunities to collapse to
>>> + PMD-sized THP and no attempt is made to collapse to small-sized
>>> + THP.
>>> +
>
> "small-sized THP" used here too; I propose to change it to "... collapse to
> other THP sizes."
OK, looks good.
>
>>> khugepaged runs usually at low frequency so while one may not want to
>>> invoke defrag algorithms synchronously during the page faults, it
>>> should be worth invoking defrag at least in khugepaged. However it's
>>> @@ -282,10 +321,11 @@ force
>>> Need of application restart
>>> ===========================
>>>
>>> -The transparent_hugepage/enabled values and tmpfs mount option only affect
>>> -future behavior. So to make them effective you need to restart any
>>> -application that could have been using hugepages. This also applies to the
>>> -regions registered in khugepaged.
>>> +The transparent_hugepage/enabled and
>>> +transparent_hugepage/hugepages-<size>kB/enabled values and tmpfs mount
>>> +option only affect future behavior. So to make them effective you need
>>> +to restart any application that could have been using hugepages. This
>>> +also applies to the regions registered in khugepaged.
>>>
>>> Monitoring usage
>>> ================
>>> @@ -308,6 +348,10 @@ frequently will incur overhead.
>>> There are a number of counters in ``/proc/vmstat`` that may be used to
>>> monitor how successfully the system is providing huge pages for use.
>>>
>>> +.. note::
>>> + Currently the below counters only record events relating to
>>> + PMD-sized THP. Events relating to small-sized THP are not included.
>>
>> Here's another spot to rename to "multi-size THP".
>
> I propose to change it to "Events relating to other THP sizes..."
>
> I'm also going to move this note to just under the "Monitoring Usage" heading
> since its current position doesn't make it clear that it includes
> "AnonHugePages". I'll also make it clear in the text that mentions AnonHugePages
> counter that it "only applies to traditional PMD-sized THP for historical
> reasons" and that "AnonHugePages should have been called AnonHugePmdMapped" as
> David suggested in the other thread.
OK.
Are we entirely dropping the AnonHugePtePages that was there in v6? I'm
looking for some way to monitor this. I may have missed it because I haven't
gone through all of v7 yet.
...
>>> -"THPeligible" indicates whether the mapping is eligible for allocating THP
>>> -pages as well as the THP is PMD mappable or not - 1 if true, 0 otherwise.
>>> -It just shows the current status.
>>> +"THPeligible" indicates whether the mapping is eligible for allocating
>>> +naturally aligned THP pages of any currently enabled size. 1 if true, 0
>>> +otherwise. It just shows the current status.
>>
>> "It just shows the current status"...as opposed to what? I'm missing an
>> educational opportunity here--not sure what this means. :)
>
> I have no idea either - it seems superfluous. But that sentence was there
> already. I can remove it if you like?
>
Yes, let's remove it, please.
...
>>> +/*
>>> + * Mask of all large folio orders supported for anonymous THP.
>>> + */
>>> +#define THP_ORDERS_ALL_ANON BIT(PMD_ORDER)
>>> +
>>> +/*
>>> + * Mask of all large folio orders supported for file THP.
>>> + */
>>> +#define THP_ORDERS_ALL_FILE (BIT(PMD_ORDER) | BIT(PUD_ORDER))
>>
>> Is there something about file THP that implies PUD_ORDER? This
>> deserves an explanatory comment, I think.
>
> Sorry, I'm not really sure what you are asking? Today's kernel supports
> PUD-mapping file-backed memory (mostly DAX I believe). I'm not sure what comment
> you want me to add, other than "file-backed memory can support PUD-mapping", but
> that's self-evident from the define, isn't it?
>
Well, it's sort of evident, but confusing to me anyway, because it
combines THP and PUD. Which seems to imply that we have PUD-based THPs
supported, but only for file-backed mappings. Which is weird and needs
some explanation, right?
>>
>>> +
>>> +/*
>>> + * Mask of all large folio orders supported for THP.
>>> + */
>>> +#define THP_ORDERS_ALL (THP_ORDERS_ALL_ANON | THP_ORDERS_ALL_FILE)
>>> +
>>> #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>> #define HPAGE_PMD_SHIFT PMD_SHIFT
>>> #define HPAGE_PMD_SIZE ((1UL) << HPAGE_PMD_SHIFT)
>>> @@ -78,13 +93,18 @@ extern struct kobj_attribute shmem_enabled_attr;
>>>
>>> extern unsigned long transparent_hugepage_flags;
>>>
>>> -#define hugepage_flags_enabled() \
>>> - (transparent_hugepage_flags & \
>>> - ((1<<TRANSPARENT_HUGEPAGE_FLAG) | \
>>> - (1<<TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG)))
>>> -#define hugepage_flags_always() \
>>> - (transparent_hugepage_flags & \
>>> - (1<<TRANSPARENT_HUGEPAGE_FLAG))
>>
>> Are macros actually required? If not, static inlines would be nicer.
>
> I agree static inlines are nicer. Here I'm removing existing macros though, and
Oh, I see that I replied in the wrong part of the email, since that's the
*removal* part. oops. :)
...
>>> -static inline bool transhuge_vma_suitable(struct vm_area_struct *vma,
>>> - unsigned long addr)
>>> -{
>>> - unsigned long haddr;
>>> -
>>> - /* Don't have to check pgoff for anonymous vma */
>>> - if (!vma_is_anonymous(vma)) {
>>> - if (!IS_ALIGNED((vma->vm_start >> PAGE_SHIFT) - vma->vm_pgoff,
>>> - HPAGE_PMD_NR))
>>> - return false;
>>> +static inline unsigned long transhuge_vma_suitable(struct vm_area_struct *vma,
>>> + unsigned long addr, unsigned long orders)
>>
>> Changing this from a bool to a mask of orders is quite significant, and
>> both the function name and the function-level comment documentation need
>> to also be adjusted. Here, perhaps one of these names would work:
>>
>> transhuge_vma_suitable_orders()
>
> This is my preference, but its getting a bit long. How about:
>
> thp_vma_suitable_orders()
>
> I could then call the other one:
>
> thp_vma_allowable_orders()
>
> So we have a clearly related pair?
Oh yes, that works nicely.
>
>
>> transhuge_vma_orders()>
>>
>>> +{
>>> + int order;
>>> +
>>> + /*
>>> + * Iterate over orders, highest to lowest, removing orders that don't
>>> + * meet alignment requirements from the set. Exit loop at first order
>>> + * that meets requirements, since all lower orders must also meet
>>> + * requirements.
>>
>> Should we add a little note here, to the effect of, "this is unilaterally
>> taking over a certain amount of allocation-like policy, by deciding how
>> to search for folios of a certain size"?
>>
>> Or is this The Only Way To Do It, after all? I know we had some discussion
>> about it, and intuitively it feels reasonable, but wanted to poke at it
>> one last time anyway.
>
> This function isn't trying to implement policy, its just filtering out orders
> that don't fit and therefore definitely cannot be used. The result is a set of
> orders the could be used and its the policy maker's job to decide which one.
> Currently that policy is "biggest one that fits && does not overlap other
> non-none ptes && folio successfully allocated". That's implemented in the next
> patch and would potentially be swapped out in the future for something more
> exotic/optimal.
>
> So I don't think we need any extra comments here.
Ack.
...
>>> -bool hugepage_vma_check(struct vm_area_struct *vma, unsigned long vm_flags,
>>> - bool smaps, bool in_pf, bool enforce_sysfs)
>>> +#define hugepage_global_enabled() \
>>> + (transparent_hugepage_flags & \
>>> + ((1<<TRANSPARENT_HUGEPAGE_FLAG) | \
>>> + (1<<TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG)))
>>> +
>>> +#define hugepage_global_always() \
>>> + (transparent_hugepage_flags & \
>>> + (1<<TRANSPARENT_HUGEPAGE_FLAG))
>>> +
>>
>> Here again, I'd like to request that we avoid macros, as I don't think
>> they are strictly required.
>
> Yeah I agree. I did them this way, because they already existed and I was just
> moving them from the header to here. But I'll change to static inline.
>
>>
>>> +bool hugepage_flags_enabled(void)
>>> {
>>> + /*
>>> + * We cover both the anon and the file-backed case here; we must return
>>> + * true if globally enabled, even when all anon sizes are set to never.
>>> + * So we don't need to look at huge_anon_orders_global.
>>> + */
>>> + return hugepage_global_enabled() ||
>>> + huge_anon_orders_always ||
>>> + huge_anon_orders_madvise;
>>> +}
>>> +
>>> +/**
>>> + * hugepage_vma_check - determine which hugepage orders can be applied to vma
>>> + * @vma: the vm area to check
>>> + * @vm_flags: use these vm_flags instead of vma->vm_flags
>>> + * @smaps: whether answer will be used for smaps file
>>> + * @in_pf: whether answer will be used by page fault handler
>>> + * @enforce_sysfs: whether sysfs config should be taken into account
>>> + * @orders: bitfield of all orders to consider
>>> + *
>>> + * Calculates the intersection of the requested hugepage orders and the allowed
>>> + * hugepage orders for the provided vma. Permitted orders are encoded as a set
>>> + * bit at the corresponding bit position (bit-2 corresponds to order-2, bit-3
>>> + * corresponds to order-3, etc). Order-0 is never considered a hugepage order.
>>> + *
>>> + * Return: bitfield of orders allowed for hugepage in the vma. 0 if no hugepage
>>> + * orders are allowed.
>>> + */
>>> +unsigned long hugepage_vma_check(struct vm_area_struct *vma,
>>> + unsigned long vm_flags, bool smaps, bool in_pf,
>>> + bool enforce_sysfs, unsigned long orders)
>>
>> Here again, a bool return type has been changed to a bitfield. Let's
>> also update the function name. Maybe one of these:
>>
>> hugepage_vma_orders()
>> hugepage_vma_allowable_orders()
>
> thp_vma_allowable_orders()?
>
Even better, yes.
thanks,
--
John Hubbard
NVIDIA
Powered by blists - more mailing lists