linux-kernel - Re: [PATCH 0/8] add mTHP support for anonymous shmem

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <de9f9d07-6534-419b-86a8-628e13020c1e@redhat.com>
Date: Wed, 8 May 2024 19:03:57 +0200
From: David Hildenbrand <david@...hat.com>
To: Daniel Gomez <da.gomez@...sung.com>
Cc: Baolin Wang <baolin.wang@...ux.alibaba.com>,
 "akpm@...ux-foundation.org" <akpm@...ux-foundation.org>,
 "hughd@...gle.com" <hughd@...gle.com>,
 "willy@...radead.org" <willy@...radead.org>,
 "ioworker0@...il.com" <ioworker0@...il.com>,
 "wangkefeng.wang@...wei.com" <wangkefeng.wang@...wei.com>,
 "ying.huang@...el.com" <ying.huang@...el.com>,
 "21cnbao@...il.com" <21cnbao@...il.com>,
 "ryan.roberts@....com" <ryan.roberts@....com>,
 "shy828301@...il.com" <shy828301@...il.com>, "ziy@...dia.com"
 <ziy@...dia.com>, "linux-mm@...ck.org" <linux-mm@...ck.org>,
 "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH 0/8] add mTHP support for anonymous shmem

On 08.05.24 16:28, Daniel Gomez wrote:
> On Wed, May 08, 2024 at 01:58:19PM +0200, David Hildenbrand wrote:
>> On 08.05.24 13:39, Daniel Gomez wrote:
>>> On Mon, May 06, 2024 at 04:46:24PM +0800, Baolin Wang wrote:
>>>> Anonymous pages have already been supported for multi-size (mTHP) allocation
>>>> through commit 19eaf44954df, that can allow THP to be configured through the
>>>> sysfs interface located at '/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled'.
>>>>
>>>> However, the anonymous shared pages will ignore the anonymous mTHP rule
>>>> configured through the sysfs interface, and can only use the PMD-mapped
>>>> THP, that is not reasonable. Many implement anonymous page sharing through
>>>> mmap(MAP_SHARED | MAP_ANONYMOUS), especially in database usage scenarios,
>>>> therefore, users expect to apply an unified mTHP strategy for anonymous pages,
>>>> also including the anonymous shared pages, in order to enjoy the benefits of
>>>> mTHP. For example, lower latency than PMD-mapped THP, smaller memory bloat
>>>> than PMD-mapped THP, contiguous PTEs on ARM architecture to reduce TLB miss etc.
>>>>
>>>> The primary strategy is similar to supporting anonymous mTHP. Introduce
>>>> a new interface '/mm/transparent_hugepage/hugepage-XXkb/shmem_enabled',
>>>> which can have all the same values as the top-level
>>>> '/sys/kernel/mm/transparent_hugepage/shmem_enabled', with adding a new
>>>> additional "inherit" option. By default all sizes will be set to "never"
>>>> except PMD size, which is set to "inherit". This ensures backward compatibility
>>>> with the shmem enabled of the top level, meanwhile also allows independent
>>>> control of shmem enabled for each mTHP.
>>>
>>> I'm trying to understand the adoption of mTHP and how it fits into the adoption
>>> of (large) folios that the kernel is moving towards. Can you, or anyone involved
>>> here, explain this? How much do they overlap, and can we benefit from having
>>> both? Is there any argument against the adoption of large folios here that I
>>> might have missed?
>>
>> mTHP are implemented using large folios, just like traditional PMD-sized THP
>> are. (you really should explore the history of mTHP and how it all works
>> internally)
> 
> I'll check more in deep the code. By any chance are any of you going to be at
> LSFMM this year? I have this session [1] scheduled for Wednesday and it would
> be nice to get your feedback on it and if you see this working together with
> mTHP/THP.
>

I'll be around and will attend that session! But note that I am still 
scratching my head what to do with "ordinary" shmem, especially because 
of the weird way shmem behaves in contrast to real files (below). Some 
input from Hugh might be very helpful.

Example: you write() to a shmem file and populate a 2M THP. Then, nobody 
touches that file for a long time. There are certainly other mmap() 
users that could better benefit from that THP ... and without swap that 
THP will be trapped there possibly a long time (unless I am missing an 
important piece of shmem THP design :) )? Sure, if we only have THP's 
it's nice, that's just not the reality unfortunately. IIRC, that's one 
of the reasons why THP for shmem can be enabled/disabled. But again, 
still scratching my head ...


Note that this patch set only tackles anonymous shmem 
(MAP_SHARED|MAP_ANON), which is in 99.999% of all cases only accessed 
via page tables (memory allocated during page faults). I think there are 
ways to grab the fd (/proc/self/fd), but IIRC only corner cases 
read/write that.

So in that sense, anonymous shmem (this patch set) behaves mostly like 
ordinary anonymous memory, and likely there is not much overlap with 
other "allocate large folios during read/write/fallocate" as in [1]. 
swap might have an overlap.


The real confusion begins when we have ordinary shmem: some users never 
mmap it and only read/write, some users never read/write it and only 
mmap it and some (less common?) users do both.

And shmem really is special: it looks like "just another file", but 
memory-consumption and reclaim wise it behaves just like anonymous 
memory. It might be swappable ("usually very limited backing disk space 
available") or it might not.

In a subthread here we are discussing what to do with that special 
"shmem_enabled = force" mode ... and it's all complicated I think.

> [1] https://lore.kernel.org/all/4ktpayu66noklllpdpspa3vm5gbmb5boxskcj2q6qn7md3pwwt@kvlu64pqwjzl/
> 
>>
>> The biggest challenge with memory that cannot be evicted on memory pressure
>> to be reclaimed (in contrast to your ordinary files in the pagecache) is
>> memory waste, well, and placement of large chunks of memory in general,
>> during page faults.
>>
>> In the worst case (no swap), you allocate a large chunk of memory once and
>> it will stick around until freed: no reclaim of that memory.
> 
> I can see that path being triggered by some fstests but only for THP (where we
> can actually reclaim memory).

Is that when we punch-hole a partial THP and split it? I'd be interested 
in what that test does.



-- 
Cheers,

David / dhildenb