[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <841178ee-d29d-43e3-b977-5796c90bd7ad@arm.com>
Date: Thu, 7 Dec 2023 10:56:16 +0000
From: Ryan Roberts <ryan.roberts@....com>
To: David Hildenbrand <david@...hat.com>,
Andrew Morton <akpm@...ux-foundation.org>,
Matthew Wilcox <willy@...radead.org>,
Yin Fengwei <fengwei.yin@...el.com>,
Yu Zhao <yuzhao@...gle.com>,
Catalin Marinas <catalin.marinas@....com>,
Anshuman Khandual <anshuman.khandual@....com>,
Yang Shi <shy828301@...il.com>,
"Huang, Ying" <ying.huang@...el.com>, Zi Yan <ziy@...dia.com>,
Luis Chamberlain <mcgrof@...nel.org>,
Itaru Kitayama <itaru.kitayama@...il.com>,
"Kirill A. Shutemov" <kirill.shutemov@...ux.intel.com>,
John Hubbard <jhubbard@...dia.com>,
David Rientjes <rientjes@...gle.com>,
Vlastimil Babka <vbabka@...e.cz>,
Hugh Dickins <hughd@...gle.com>,
Kefeng Wang <wangkefeng.wang@...wei.com>,
Barry Song <21cnbao@...il.com>,
Alistair Popple <apopple@...dia.com>
Cc: linux-mm@...ck.org, linux-arm-kernel@...ts.infradead.org,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH v8 03/10] mm: thp: Introduce multi-size THP sysfs
interface
On 06/12/2023 13:18, Ryan Roberts wrote:
> On 05/12/2023 16:57, David Hildenbrand wrote:
>> On 04.12.23 11:20, Ryan Roberts wrote:
>>> In preparation for adding support for anonymous multi-size THP,
>>> introduce new sysfs structure that will be used to control the new
>>> behaviours. A new directory is added under transparent_hugepage for each
>>> supported THP size, and contains an `enabled` file, which can be set to
>>> "inherit" (to inherit the global setting), "always", "madvise" or
>>> "never". For now, the kernel still only supports PMD-sized anonymous
>>> THP, so only 1 directory is populated.
>>>
>>> The first half of the change converts transhuge_vma_suitable() and
>>> hugepage_vma_check() so that they take a bitfield of orders for which
>>> the user wants to determine support, and the functions filter out all
>>> the orders that can't be supported, given the current sysfs
>>> configuration and the VMA dimensions. If there is only 1 order set in
>>> the input then the output can continue to be treated like a boolean;
>>> this is the case for most call sites. The resulting functions are
>>> renamed to thp_vma_suitable_orders() and thp_vma_allowable_orders()
>>> respectively.
>>>
>>> The second half of the change implements the new sysfs interface. It has
>>> been done so that each supported THP size has a `struct thpsize`, which
>>> describes the relevant metadata and is itself a kobject. This is pretty
>>> minimal for now, but should make it easy to add new per-thpsize files to
>>> the interface if needed in future (e.g. per-size defrag). Rather than
>>> keep the `enabled` state directly in the struct thpsize, I've elected to
>>> directly encode it into huge_anon_orders_[always|madvise|inherit]
>>> bitfields since this reduces the amount of work required in
>>> thp_vma_allowable_orders() which is called for every page fault.
>>>
>>> See Documentation/admin-guide/mm/transhuge.rst, as modified by this
>>> commit, for details of how the new sysfs interface works.
>>>
>>> Signed-off-by: Ryan Roberts <ryan.roberts@....com>
>>> ---
>>
>> Some comments mostly regarding thp_vma_allowable_orders and friends. In general,
>> LGTM. I'll have to go over the order logic once again, I got a bit lost once we
>> started mixing anon and file orders.
>>
>> [...]
>>
>> Doc updates all looked good to me, skimming over them.
>>
>>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>>> index fa0350b0812a..bd0eadd3befb 100644
>>> --- a/include/linux/huge_mm.h
>>> +++ b/include/linux/huge_mm.h
>>
>> [...]
>>
>>> +static inline unsigned long thp_vma_suitable_orders(struct vm_area_struct *vma,
>>> + unsigned long addr, unsigned long orders)
>>> +{
>>> + int order;
>>> +
>>> + /*
>>> + * Iterate over orders, highest to lowest, removing orders that don't
>>> + * meet alignment requirements from the set. Exit loop at first order
>>> + * that meets requirements, since all lower orders must also meet
>>> + * requirements.
>>> + */
>>> +
>>> + order = first_order(orders);
>>
>> nit: "highest_order" or "largest_order" would be more expressive regarding the
>> actual semantics.
>
> Yep, will call it "highest_order".
>
>>
>>> +
>>> + while (orders) {
>>> + unsigned long hpage_size = PAGE_SIZE << order;
>>> + unsigned long haddr = ALIGN_DOWN(addr, hpage_size);
>>> +
>>> + if (haddr >= vma->vm_start &&
>>> + haddr + hpage_size <= vma->vm_end) {
>>> + if (!vma_is_anonymous(vma)) {
>>> + if (IS_ALIGNED((vma->vm_start >> PAGE_SHIFT) -
>>> + vma->vm_pgoff,
>>> + hpage_size >> PAGE_SHIFT))
>>> + break;
>>> + } else
>>> + break;
>>
>> Comment: Codying style wants you to use if () {} else {}
>>
>> But I'd recommend for the conditions:
>>
>> if (haddr < vma->vm_start ||
>> haddr + hpage_size > vma->vm_end)
>> continue;
>> /* Don't have to check pgoff for anonymous vma */
>> if (!vma_is_anonymous(vma))
>> break;
>> if (IS_ALIGNED((...
>> break;
>
> OK I'll take this structure.
FYI I ended up NOT taking this, because I now have thp_vma_suitable_order() and
thp_vma_suitable_orders(). The former is essentially what was there before
(except I pass the order and derive the nr_pages, alignment, etc from that
rather than using the HPAGE_* macros. The latter is just a loop that calls the
former for each order in the bitfield.
>
>>
>> [...]
>>
>>
>>> +/**
>>> + * thp_vma_allowable_orders - determine hugepage orders that are allowed for vma
>>> + * @vma: the vm area to check
>>> + * @vm_flags: use these vm_flags instead of vma->vm_flags
>>> + * @smaps: whether answer will be used for smaps file
>>> + * @in_pf: whether answer will be used by page fault handler
>>> + * @enforce_sysfs: whether sysfs config should be taken into account
>>> + * @orders: bitfield of all orders to consider
>>> + *
>>> + * Calculates the intersection of the requested hugepage orders and the allowed
>>> + * hugepage orders for the provided vma. Permitted orders are encoded as a set
>>> + * bit at the corresponding bit position (bit-2 corresponds to order-2, bit-3
>>> + * corresponds to order-3, etc). Order-0 is never considered a hugepage order.
>>> + *
>>> + * Return: bitfield of orders allowed for hugepage in the vma. 0 if no hugepage
>>> + * orders are allowed.
>>> + */
>>> +unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
>>> + unsigned long vm_flags, bool smaps,
>>> + bool in_pf, bool enforce_sysfs,
>>> + unsigned long orders)
>>> +{
>>> + /* Check the intersection of requested and supported orders. */
>>> + orders &= vma_is_anonymous(vma) ?
>>> + THP_ORDERS_ALL_ANON : THP_ORDERS_ALL_FILE;
>>> + if (!orders)
>>> + return 0;
>>
>> Comment: if this is called from some hot path, we might want to move as much as
>> possible into a header, so we can avoid this function call here when e.g., THP
>> are completely disabled etc.
>
> If THP is completely disabled (compiled out) then thp_vma_allowable_orders() is
> defined as a header inline that returns 0. I'm not sure there are any paths in
> practice which are likely to ask for a set of orders which are never supported
> (i.e. where this specific check would return 0). And the "are they run time
> enabled" check is further down and fairly involved, so not sure that's ideal for
> an inline.
>
> I haven't changed the pattern from how it was previously, so I don't think it
> should be any more expensive. Which parts exactly do you want to move to a header?
As per my response against the next patch (#4), I have now implemented
thp_vma_allowable_orders() so that the order check is in the header with an
early exit if THP is completely disabled.
>
>
>>
>>> +
>>> if (!vma->vm_mm) /* vdso */
>>> - return false;
>>> + return 0;
>>> /*
>>> * Explicitly disabled through madvise or prctl, or some
>>> @@ -88,16 +141,16 @@ bool hugepage_vma_check(struct vm_area_struct *vma,
>>> unsigned long vm_flags,
>>> * */
>>> if ((vm_flags & VM_NOHUGEPAGE) ||
>>> test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags))
>>> - return false;
>>> + return 0;
>>> /*
>>> * If the hardware/firmware marked hugepage support disabled.
>>> */
>>> if (transparent_hugepage_flags & (1 << TRANSPARENT_HUGEPAGE_UNSUPPORTED))
>>> - return false;
>>> + return 0;
>>> /* khugepaged doesn't collapse DAX vma, but page fault is fine. */
>>> if (vma_is_dax(vma))
>>> - return in_pf;
>>> + return in_pf ? orders : 0;
>>> /*
>>> * khugepaged special VMA and hugetlb VMA.
>>> @@ -105,17 +158,29 @@ bool hugepage_vma_check(struct vm_area_struct *vma,
>>> unsigned long vm_flags,
>>> * VM_MIXEDMAP set.
>>> */
>>> if (!in_pf && !smaps && (vm_flags & VM_NO_KHUGEPAGED))
>>> - return false;
>>> + return 0;
>>> /*
>>> - * Check alignment for file vma and size for both file and anon vma.
>>> + * Check alignment for file vma and size for both file and anon vma by
>>> + * filtering out the unsuitable orders.
>>> *
>>> * Skip the check for page fault. Huge fault does the check in fault
>>> - * handlers. And this check is not suitable for huge PUD fault.
>>> + * handlers.
>>> */
>>> - if (!in_pf &&
>>> - !transhuge_vma_suitable(vma, (vma->vm_end - HPAGE_PMD_SIZE)))
>>> - return false;
>>> + if (!in_pf) {
>>> + int order = first_order(orders);
>>> + unsigned long addr;
>>> +
>>> + while (orders) {
>>> + addr = vma->vm_end - (PAGE_SIZE << order);
>>> + if (thp_vma_suitable_orders(vma, addr, BIT(order)))
>>> + break;
>>
>> Comment: you'd want a "thp_vma_suitable_order" helper here. But maybe the
>> compiler is smart enough to optimize the loop and everyything else out.
>
> I'm happy to refactor so that thp_vma_suitable_order() is the basic primitive,
> then make thp_vma_suitable_orders() a loop that calls thp_vma_suitable_order()
> (that's basically how it is laid out already, just all in one function). Is that
> what you are requesting?
>
>>
>> [...]
>>
>>> +
>>> +static ssize_t thpsize_enabled_store(struct kobject *kobj,
>>> + struct kobj_attribute *attr,
>>> + const char *buf, size_t count)
>>> +{
>>> + int order = to_thpsize(kobj)->order;
>>> + ssize_t ret = count;
>>> +
>>> + if (sysfs_streq(buf, "always")) {
>>> + set_bit(order, &huge_anon_orders_always);
>>> + clear_bit(order, &huge_anon_orders_inherit);
>>> + clear_bit(order, &huge_anon_orders_madvise);
>>> + } else if (sysfs_streq(buf, "inherit")) {
>>> + set_bit(order, &huge_anon_orders_inherit);
>>> + clear_bit(order, &huge_anon_orders_always);
>>> + clear_bit(order, &huge_anon_orders_madvise);
>>> + } else if (sysfs_streq(buf, "madvise")) {
>>> + set_bit(order, &huge_anon_orders_madvise);
>>> + clear_bit(order, &huge_anon_orders_always);
>>> + clear_bit(order, &huge_anon_orders_inherit);
>>> + } else if (sysfs_streq(buf, "never")) {
>>> + clear_bit(order, &huge_anon_orders_always);
>>> + clear_bit(order, &huge_anon_orders_inherit);
>>> + clear_bit(order, &huge_anon_orders_madvise);
>>
>> Note: I was wondering for a second if some concurrent cames could lead to an
>> inconsistent state. I think in the worst case we'll simply end up with "never"
>> on races.
>
> You mean if different threads try to write different values to this file
> concurrently? Or if there is a concurrent fault that tries to read the flags
> while they are being modified?
>
> I thought about this for a long time too and wasn't sure what was best. The
> existing global enabled store impl clears the bits first then sets the bit. With
> this approach you can end up with multiple bits set if there is a race to set
> diffierent values, and you can end up with a faulting thread seeing never if it
> reads the bits after they have been cleared but before setting them.
>
> I decided to set the new bit before clearing the old bits, which is different; A
> racing fault will never see "never" but as you say, a race to set the file could
> result in "never" being set.
>
> On reflection, it's probably best to set the bit *last* like the global control
> does?>
>
>>
>> [...]
>>
>>> static int __init hugepage_init_sysfs(struct kobject **hugepage_kobj)
>>> {
>>> int err;
>>> + struct thpsize *thpsize;
>>> + unsigned long orders;
>>> + int order;
>>> +
>>> + /*
>>> + * Default to setting PMD-sized THP to inherit the global setting and
>>> + * disable all other sizes. powerpc's PMD_ORDER isn't a compile-time
>>> + * constant so we have to do this here.
>>> + */
>>> + huge_anon_orders_inherit = BIT(PMD_ORDER);
>>> *hugepage_kobj = kobject_create_and_add("transparent_hugepage", mm_kobj);
>>> if (unlikely(!*hugepage_kobj)) {
>>> @@ -434,8 +631,24 @@ static int __init hugepage_init_sysfs(struct kobject
>>> **hugepage_kobj)
>>> goto remove_hp_group;
>>> }
>>> + orders = THP_ORDERS_ALL_ANON;
>>> + order = first_order(orders);
>>> + while (orders) {
>>> + thpsize = thpsize_create(order, *hugepage_kobj);
>>> + if (IS_ERR(thpsize)) {
>>> + pr_err("failed to create thpsize for order %d\n", order);
>>> + err = PTR_ERR(thpsize);
>>> + goto remove_all;
>>> + }
>>> + list_add(&thpsize->node, &thpsize_list);
>>> + order = next_order(&orders, order);
>>> + }
>>> +
>>> return 0;
>>>
>>
>> [...]
>>
>>> page = compound_head(page);
>>> @@ -5116,7 +5116,8 @@ static vm_fault_t __handle_mm_fault(struct
>>> vm_area_struct *vma,
>>> return VM_FAULT_OOM;
>>> retry_pud:
>>> if (pud_none(*vmf.pud) &&
>>> - hugepage_vma_check(vma, vm_flags, false, true, true)) {
>>> + thp_vma_allowable_orders(vma, vm_flags, false, true, true,
>>> + BIT(PUD_ORDER))) {
>>> ret = create_huge_pud(&vmf);
>>> if (!(ret & VM_FAULT_FALLBACK))
>>> return ret;
>>> @@ -5150,7 +5151,8 @@ static vm_fault_t __handle_mm_fault(struct
>>> vm_area_struct *vma,
>>> goto retry_pud;
>>> if (pmd_none(*vmf.pmd) &&
>>> - hugepage_vma_check(vma, vm_flags, false, true, true)) {
>>> + thp_vma_allowable_orders(vma, vm_flags, false, true, true,
>>> + BIT(PMD_ORDER))) {
>>
>> Comment: A helper like "thp_vma_allowable_order(vma, PMD_ORDER)" might make this
>> easier to read -- and the implemenmtation will be faster.
>
> I'm happy to do this and use it to improve readability:
>
> #define thp_vma_allowable_order(..., order) \
> thp_vma_allowable_orders(..., BIT(order))
>
> This wouldn't make the implementation any faster though; Are you suggesting a
> completely separate impl? Even then, I don't think there is much scope to make
> it faster for the case where there is only 1 order in the bitfield.
>
>>
>>> ret = create_huge_pmd(&vmf);
>>> if (!(ret & VM_FAULT_FALLBACK))
>>> return ret;
>>> diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
>>> index e0b368e545ed..64da127cc267 100644
>>> --- a/mm/page_vma_mapped.c
>>> +++ b/mm/page_vma_mapped.c
>>> @@ -268,7 +268,8 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
>>> * cleared *pmd but not decremented compound_mapcount().
>>> */
>>> if ((pvmw->flags & PVMW_SYNC) &&
>>> - transhuge_vma_suitable(vma, pvmw->address) &&
>>> + thp_vma_suitable_orders(vma, pvmw->address,
>>> + BIT(PMD_ORDER)) &&
>>
>> Comment: Similarly, a helper like "thp_vma_suitable_order(vma, PMD_ORDER)" might
>> make this easier to read.
>
> Yep, will do this.
>
>>
>>> (pvmw->nr_pages >= HPAGE_PMD_NR)) {
>>> spinlock_t *ptl = pmd_lock(mm, pvmw->pmd);
>>>
>>
>
Powered by blists - more mailing lists