linux-kernel - Re: [PATCH v1 2/2] mm: mTHP stats for pagecache folio allocations

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <f03deb7c-9a67-4096-9d33-32b357b52152@arm.com>
Date: Tue, 16 Jul 2024 12:14:39 +0100
From: Ryan Roberts <ryan.roberts@....com>
To: David Hildenbrand <david@...hat.com>, Lance Yang <ioworker0@...il.com>,
 Baolin Wang <baolin.wang@...ux.alibaba.com>
Cc: Andrew Morton <akpm@...ux-foundation.org>, Hugh Dickins
 <hughd@...gle.com>, Jonathan Corbet <corbet@....net>,
 "Matthew Wilcox (Oracle)" <willy@...radead.org>,
 Barry Song <baohua@...nel.org>, linux-kernel@...r.kernel.org,
 linux-mm@...ck.org
Subject: Re: [PATCH v1 2/2] mm: mTHP stats for pagecache folio allocations

On 16/07/2024 11:19, David Hildenbrand wrote:
> On 16.07.24 10:31, Ryan Roberts wrote:
>> On 13/07/2024 11:45, Ryan Roberts wrote:
>>> On 13/07/2024 02:08, David Hildenbrand wrote:
>>>> On 12.07.24 14:22, Lance Yang wrote:
>>>>> On Fri, Jul 12, 2024 at 11:00 AM Baolin Wang
>>>>> <baolin.wang@...ux.alibaba.com> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 2024/7/11 15:29, Ryan Roberts wrote:
>>>>>>> Expose 3 new mTHP stats for file (pagecache) folio allocations:
>>>>>>>
>>>>>>>      /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/file_alloc
>>>>>>>      /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/file_fallback
>>>>>>>    
>>>>>>> /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/file_fallback_charge
>>>>>>>
>>>>>>> This will provide some insight on the sizes of large folios being
>>>>>>> allocated for file-backed memory, and how often allocation is failing.
>>>>>>>
>>>>>>> All non-order-0 (and most order-0) folio allocations are currently done
>>>>>>> through filemap_alloc_folio(), and folios are charged in a subsequent
>>>>>>> call to filemap_add_folio(). So count file_fallback when allocation
>>>>>>> fails in filemap_alloc_folio() and count file_alloc or
>>>>>>> file_fallback_charge in filemap_add_folio(), based on whether charging
>>>>>>> succeeded or not. There are some users of filemap_add_folio() that
>>>>>>> allocate their own order-0 folio by other means, so we would not count
>>>>>>> an allocation failure in this case, but we also don't care about order-0
>>>>>>> allocations. This approach feels like it should be good enough and
>>>>>>> doesn't require any (impractically large) refactoring.
>>>>>>>
>>>>>>> The existing mTHP stats interface is reused to provide consistency to
>>>>>>> users. And because we are reusing the same interface, we can reuse the
>>>>>>> same infrastructure on the kernel side. The one small wrinkle is that
>>>>>>> the set of folio sizes supported by the pagecache are not identical to
>>>>>>> those supported by anon and shmem; pagecache supports order-1, unlike
>>>>>>> anon and shmem, and the max pagecache order may be less than PMD-size
>>>>>>> (see arm64 with 64K base pages), again unlike anon and shmem. So we now
>>>>>>> create a hugepages-*kB directory for the union of the sizes supported by
>>>>>>> all 3 memory types and populate it with the relevant stats and controls.
>>>>>>
>>>>>> Personally, I like the idea that can help analyze the allocation of
>>>>>> large folios for the page cache.
>>>>>>
>>>>>> However, I have a slight concern about the consistency of the interface.
>>>>>>
>>>>>> For 64K, the fields layout:
>>>>>> ├── hugepages-64kB
>>>>>> │   ├── enabled
>>>>>> │   ├── shmem_enabled
>>>>>> │   └── stats
>>>>>> │       ├── anon_fault_alloc
>>>>>> │       ├── anon_fault_fallback
>>>>>> │       ├── anon_fault_fallback_charge
>>>>>> │       ├── file_alloc
>>>>>> │       ├── file_fallback
>>>>>> │       ├── file_fallback_charge
>>>>>> │       ├── shmem_alloc
>>>>>> │       ├── shmem_fallback
>>>>>> │       ├── shmem_fallback_charge
>>>>>> │       ├── split
>>>>>> │       ├── split_deferred
>>>>>> │       ├── split_failed
>>>>>> │       ├── swpout
>>>>>> │       └── swpout_fallback
>>>>>>
>>>>>> But for 8K (for pagecache), you removed some fields (of course, I
>>>>>> understand why they are not supported).
>>>>>>
>>>>>> ├── hugepages-8kB
>>>>>> │   └── stats
>>>>>> │       ├── file_alloc
>>>>>> │       ├── file_fallback
>>>>>> │       └── file_fallback_charge
>>>>>>
>>>>>> This might not be user-friendly for some user-space parsing tools, as
>>>>>> they lack certain fields for the same pattern interfaces. Of course,
>>>>>> this might not be an issue if we have clear documentation describing the
>>>>>> differences here:)
>>>>>>
>>>>>> Another possible approach is to maintain the same field layout to keep
>>>>>> consistent, but prohibit writing to the fields that are not supported by
>>>>>> the pagecache, and any stats read from them would be 0.
>>>>>
>>>>> I agree that maintaining a uniform field layout, especially at the stats
>>>>> level, might be necessary ;)
>>>>>
>>>>> Keeping a consistent interface could future-proof the design. It allows
>>>>> for the possibility that features not currently supported for 8kB pages
>>>>> might be enabled in the future.
>>>>
>>>> I'll just note that, with shmem/file effectively being disabled for order > 11,
>>>> we'll also have entries there that are effectively unused.
>>>
>>> Indeed, I mentioned that in the commit log :)
> 
> Well, I think it's more extreme than what you mentioned.
> 
> For example, shmem_enable on arm64 with 64k is now effectively non-functional.
> Just like it will be for other orders in the anon-shmem case when the order
> exceeds MAX_PAGECACHE_ORDER.

Ahh I see what you are saying now; we already have precedent for non-functional
controls.

(Actually, looking at the code, it looks like the shmem stats will be
unconditionally exposed, but the shmem controls will only be exposed when
CONFIG_SHMEM is enabled. I guess that should be fixed - I'll post a patch).

> 
>>>
>>>>
>>>> Good question how we want to deal with that (stats are easy, but what about
>>>> when
>>>> we enable something? Maybe we should document that "enabled" is only effective
>>>> when supported).
>>>
>>> The documentation already says "If enabling multiple hugepage sizes, the kernel
>>> will select the most appropriate enabled size for a given allocation." for anon
>>> THP (and I've added similar wording for my as-yet-unposted patch to add controls
>>> for page cache folio sizes). So I think we could easily add dummy *enabled
>>> controls for all sizes, that can be written to and read back consistently, but
>>> the kernel just ignores them when deciding what size to use. It would also
>>> simplify the code that populates the controls.
>>>
>>> Personally though, I'm not convinced of the value of trying to make the controls
>>> for every size look identical. What's the real value to the user to pretend that
>>> they can select a size that they cannot? What happens when we inevitably want to
>>> add some new control in future which only applies to select sizes and there is
>>> no good way to fake it for the other sizes? Why can't user space just be
>>> expected to rely on the existance of the files rather than on the existance of
>>> the directories?
>>>
>>> As always, I'll go with the majority, but just wanted to register my opinion.
>>
>> Should I assume from the lack of reply on this that everyone else is in favour
>> of adding dummy controls so that all sizes have the same set of controls? If I
>> don't hear anything further, I'll post v2 with dummry controls today or tomorrow.
> 
> Sorry, busy with other stuff.
> 
> Indicating only what really exists sounds cleaner. But I wonder how we would
> want to handle in general orders that are effectively non-existant?

I'm not following your distinction between orders that don't "really exist" and
orders that are "effectively non-existant".

I guess the real supported orders are:

  anon:
    min order: 2
    max order: PMD_ORDER
  anon-shmem:
    min order: 1
    max order: MAX_PAGECACHE_ORDER
  tmpfs-shmem:
    min order: PMD_ORDER <= 11 ? PMD_ORDER : NONE
    max order: PMD_ORDER <= 11 ? PMD_ORDER : NONE
  file:
    min order: 1
    max order: MAX_PAGECACHE_ORDER

But today, controls and stats are exposed for:

  anon:
    min order: 2
    max order: PMD_ORDER
  anon-shmem:
    min order: 2
    max order: PMD_ORDER
  tmpfs-shmem:
    min order: PMD_ORDER
    max order: PMD_ORDER
  file:
    min order: Nothing yet (this patch proposes 1)
    max order: Nothing yet (this patch proposes MAX_PAGECACHE_ORDER)

So I think there is definitely a bug for shmem where the minimum order control
should be order-1 but its currently order-2.

I also wonder about PUD-order for DAX? We don't currently have a stat/control.
If we wanted to add it in future, if we take the "expose all stats/controls for
all orders" approach, we would end up extending all the way to PUD-order and all
the orders between PMD and PUD would be dummy for all memory types. That really
starts to feel odd, so I still favour only populating what's really supported.

I propose to fix shmem (extend down to 1, stop at MAX_PAGECACHE_ORDER) and
continue with the approach of "indicating only what really exists" for v2.

Shout if you disagree.

Thanks,
Ryan