linux-kernel - Re: [PATCH 5/8] mm: shmem: add multi-size THP sysfs interface for anonymous shmem

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <6e412cee-be0c-4c94-b576-ebdd897e6e05@redhat.com>
Date: Wed, 8 May 2024 15:07:20 +0200
From: David Hildenbrand <david@...hat.com>
To: Ryan Roberts <ryan.roberts@....com>,
 Baolin Wang <baolin.wang@...ux.alibaba.com>, akpm@...ux-foundation.org,
 hughd@...gle.com
Cc: willy@...radead.org, ioworker0@...il.com, wangkefeng.wang@...wei.com,
 ying.huang@...el.com, 21cnbao@...il.com, shy828301@...il.com,
 ziy@...dia.com, linux-mm@...ck.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH 5/8] mm: shmem: add multi-size THP sysfs interface for
 anonymous shmem

On 08.05.24 14:54, Ryan Roberts wrote:
> On 08/05/2024 13:45, David Hildenbrand wrote:
>> On 08.05.24 14:43, Ryan Roberts wrote:
>>> On 08/05/2024 13:10, David Hildenbrand wrote:
>>>> On 08.05.24 14:02, David Hildenbrand wrote:
>>>>> On 08.05.24 11:02, Ryan Roberts wrote:
>>>>>> On 08/05/2024 08:12, David Hildenbrand wrote:
>>>>>>> On 08.05.24 09:08, David Hildenbrand wrote:
>>>>>>>> On 08.05.24 06:45, Baolin Wang wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 2024/5/7 18:52, Ryan Roberts wrote:
>>>>>>>>>> On 06/05/2024 09:46, Baolin Wang wrote:
>>>>>>>>>>> To support the use of mTHP with anonymous shmem, add a new sysfs
>>>>>>>>>>> interface
>>>>>>>>>>> 'shmem_enabled' in the
>>>>>>>>>>> '/sys/kernel/mm/transparent_hugepage/hugepages-kB/'
>>>>>>>>>>> directory for each mTHP to control whether shmem is enabled for that
>>>>>>>>>>> mTHP,
>>>>>>>>>>> with a value similar to the top level 'shmem_enabled', which can be
>>>>>>>>>>> set to:
>>>>>>>>>>> "always", "inherit (to inherit the top level setting)", "within_size",
>>>>>>>>>>> "advise",
>>>>>>>>>>> "never", "deny", "force". These values follow the same semantics as
>>>>>>>>>>> the top
>>>>>>>>>>> level, except the 'deny' is equivalent to 'never', and 'force' is
>>>>>>>>>>> equivalent
>>>>>>>>>>> to 'always' to keep compatibility.
>>>>>>>>>>
>>>>>>>>>> We decided at [1] to not allow 'force' for non-PMD-sizes.
>>>>>>>>>>
>>>>>>>>>> [1]
>>>>>>>>>> https://lore.kernel.org/linux-mm/533f37e9-81bf-4fa2-9b72-12cdcb1edb3f@redhat.com/
>>>>>>>>>>
>>>>>>>>>> However, thinking about this a bit more, I wonder if the decision we
>>>>>>>>>> made to
>>>>>>>>>> allow all hugepages-xxkB/enabled controls to take "inherit" was the wrong
>>>>>>>>>> one.
>>>>>>>>>> Perhaps we should have only allowed the PMD-sized enable=inherit (this is
>>>>>>>>>> just
>>>>>>>>>> for legacy back compat after all, I don't think there is any use case
>>>>>>>>>> where
>>>>>>>>>> changing multiple mTHP size controls atomically is actually useful).
>>>>>>>>>> Applying
>>>>>>>>>
>>>>>>>>> Agree. This is also our usage of 'inherit'.
>>>>>>>
>>>>>>> Missed that one: there might be use cases in the future once we would start
>>>>>>> defaulting to "inherit" for all knobs (a distro might default to that) and
>>>>>>> default-enable THP in the global knob. Then, it would be easy to disable any
>>>>>>> THP
>>>>>>> by disabling the global knob. (I think that's the future we're heading to,
>>>>>>> where
>>>>>>> we'd have an "auto" mode that can be set on the global toggle).
>>>>>>>
>>>>>>> But I am just making up use cases ;) I think it will be valuable and just
>>>>>>> doing
>>>>>>> it consistently now might be cleaner.
>>>>>>
>>>>>> I agree that consistency between enabled and shmem_enabled is top priority.
>>>>>> And
>>>>>> yes, I had forgotten about the glorious "auto" future. So probably continuing
>>>>>> all sizes to select "inherit" is best.
>>>>>>
>>>>>> But for shmem_enabled, that means we need the following error checking:
>>>>>>
>>>>>>      - It is an error to set "force" for any size except PMD-size
>>>>>>
>>>>>>      - It is an error to set "force" for the global control if any size except
>>>>>> PMD-
>>>>>>        size is set to "inherit"
>>>>>>
>>>>>>      - It is an error to set "inherit" for any size except PMD-size if the
>>>>>> global
>>>>>>        control is set to "force".
>>>>>>
>>>>>> Certainly not too difficult to code and prove to be correct, but not the
>>>>>> nicest
>>>>>> UX from the user's point of view when they start seeing errors.
>>>>>>
>>>>>> I think we previously said this would likely be temporary, and if/when tmpfs
>>>>>> gets mTHP support, we could simplify and allow all sizes to be set to "force".
>>>>>> But I wonder if tmpfs would ever need explicit mTHP control? Maybe it would be
>>>>>> more suited to the approach the page cache takes to transparently ramp up the
>>>>>> folio size as it faults more in. (Just saying there is a chance that this
>>>>>> error
>>>>>> checking becomes permanent).
>>>>>
>>>>> Note that with shmem you're inherently facing the same memory waste
>>>>> issues etc as you would with anonymous memory. (sometimes even worse, if
>>>>> you're running shmem that's configured to be unswappable!).
>>>>
>>>> Also noting that memory waste is not really a problem when a write to a shmem
>>>> file allocates a large folio that stays within boundaries of that write; issues
>>>> only pop up if you end up over-allocating, especially, during page faults where
>>>> you have not that much clue about what to do (single address, no real range
>>>> provided).
>>>>
>>>> There is the other issue that wasting large chunks of contiguous memory on stuff
>>>> that barely benefits from it. With memory that maybe never gets evicted, there
>>>> is no automatic "handing back" of that memory to the system to be used by
>>>> something else. With ordinary files, that's a bit different. But I did not look
>>>> closer into that issue yet, it's one of the reasons MADV_HUGEPAGE was added
>>>> IIRC.
>>>
>>> OK understood. Although, with tmpfs you're not going to mmap it then randomly
>>> extend the file through page faults - mmap doesn't permit that, I don't think?
>>> So presumably the user must explicitly set the size of the file first? Are you
>>> suggesting there are a lot of use cases where a large tmpfs file is created,
>>> mmaped then only accessed sparsely?
>>
>> I don't know about "a lot of use cases", but for VMs that's certainly how it's
>> used.
> 

There are more details around that and the sparsity (memory ballooning, 
virtio-mem, free page reporting), but it might distract here :) I'll 
note that shmem+THP is known to be problematic with memory ballooning.

> Gottya, thanks. And out of curiosity, what's the benefit of using tmpfs rather
> than private (or shared) anonymous memory for VMs?

The primary use case I know of is sharing VM memory with other processes 
(usually not child processes): DPDK/SPDK and other vhost-user variants 
(such as virtiofs) mmap() all guest memory to access it directly (some 
sort of multi-process hypervisors). They either use real-file-based 
shmem or memfd (essentially the same without a named file) for that.

Then, there is live-hypervisor upgrade, whereby you start a second 
hypervisor process that will take over. People use shmem for that, so 
you can minimize downtime by migrating guest memory simply by mmap'ing 
the shmem file into the new hypervisor.

Shared anonymous memory is basically never used (I only know of one 
corner case in QEMU).

I would assume that there are also DBs making use of rather sparse 
shmem? But no expert on that.

-- 
Cheers,

David / dhildenb