linux-kernel - Re: [PATCH v3 0/6] add mTHP support for anonymous shmem

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <vlkkfcyumveggkddb6d44f55gtx4qonoiijxyofa63mtmkuofv@uf4nbw3r5ysh>
Date: Fri, 31 May 2024 13:19:06 +0000
From: Daniel Gomez <da.gomez@...sung.com>
To: David Hildenbrand <david@...hat.com>
CC: Baolin Wang <baolin.wang@...ux.alibaba.com>, "akpm@...ux-foundation.org"
	<akpm@...ux-foundation.org>, "hughd@...gle.com" <hughd@...gle.com>,
	"willy@...radead.org" <willy@...radead.org>, "wangkefeng.wang@...wei.com"
	<wangkefeng.wang@...wei.com>, "ying.huang@...el.com" <ying.huang@...el.com>,
	"21cnbao@...il.com" <21cnbao@...il.com>, "ryan.roberts@....com"
	<ryan.roberts@....com>, "shy828301@...il.com" <shy828301@...il.com>,
	"ziy@...dia.com" <ziy@...dia.com>, "ioworker0@...il.com"
	<ioworker0@...il.com>, Pankaj Raghav <p.raghav@...sung.com>,
	"linux-mm@...ck.org" <linux-mm@...ck.org>, "linux-kernel@...r.kernel.org"
	<linux-kernel@...r.kernel.org>
Subject: Re: [PATCH v3 0/6] add mTHP support for anonymous shmem

On Fri, May 31, 2024 at 11:35:30AM +0200, David Hildenbrand wrote:
> On 30.05.24 04:04, Baolin Wang wrote:
> > Anonymous pages have already been supported for multi-size (mTHP) allocation
> > through commit 19eaf44954df, that can allow THP to be configured through the
> > sysfs interface located at '/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled'.
> > 
> > However, the anonymous shmem will ignore the anonymous mTHP rule configured
> > through the sysfs interface, and can only use the PMD-mapped THP, that is not
> > reasonable. Many implement anonymous page sharing through mmap(MAP_SHARED |
> > MAP_ANONYMOUS), especially in database usage scenarios, therefore, users expect
> > to apply an unified mTHP strategy for anonymous pages, also including the
> > anonymous shared pages, in order to enjoy the benefits of mTHP. For example,
> > lower latency than PMD-mapped THP, smaller memory bloat than PMD-mapped THP,
> > contiguous PTEs on ARM architecture to reduce TLB miss etc.
> > 
> > The primary strategy is similar to supporting anonymous mTHP. Introduce
> > a new interface '/mm/transparent_hugepage/hugepage-XXkb/shmem_enabled',
> > which can have all the same values as the top-level
> > '/sys/kernel/mm/transparent_hugepage/shmem_enabled', with adding a new
> > additional "inherit" option. By default all sizes will be set to "never"
> > except PMD size, which is set to "inherit". This ensures backward compatibility
> > with the anonymous shmem enabled of the top level, meanwhile also allows
> > independent control of anonymous shmem enabled for each mTHP.
> > 
> > Use the page fault latency tool to measure the performance of 1G anonymous shmem
> > with 32 threads on my machine environment with: ARM64 Architecture, 32 cores,
> > 125G memory:
> > base: mm-unstable
> > user-time    sys_time    faults_per_sec_per_cpu     faults_per_sec
> > 0.04s        3.10s         83516.416                  2669684.890
> > 
> > mm-unstable + patchset, anon shmem mTHP disabled
> > user-time    sys_time    faults_per_sec_per_cpu     faults_per_sec
> > 0.02s        3.14s         82936.359                  2630746.027
> > 
> > mm-unstable + patchset, anon shmem 64K mTHP enabled
> > user-time    sys_time    faults_per_sec_per_cpu     faults_per_sec
> > 0.08s        0.31s         678630.231                 17082522.495
> > 
> >  From the data above, it is observed that the patchset has a minimal impact when
> > mTHP is not enabled (some fluctuations observed during testing). When enabling 64K
> > mTHP, there is a significant improvement of the page fault latency.
> 
> Let me summarize the takeaway from the bi-weekly MM meeting as I understood
> it, that includes Hugh's feedback on per-block tracking vs. mTHP:

Thanks David for the summary. Please, find below some follow up questions.

I want understand if zeropage scanning overhead is preferred over per-block
tracking complexity or if we still need to verify this.

> 
> (1) Per-block tracking
> 
> Per-block tracking is currently considered unwarranted complexity in
> shmem.c. We should try to get it done without that. For any test cases that
> fail, we should consider if they are actually valid for shmem.

I agree it was unwarranted complexity but only if this is just to fix lseek() as
we can simply make the test pass by checking if holes are reported as data. That
would be the minimum required for lseek() to be compliant with the syscall.

How can we use per-block tracking for reclaiming memory and what changes would
be needed? Or is per-block really a non-viable option?

Clearly, if per-block is viable option, shmem_fault() bug would required to be
fixed first. Any ideas on how to make it reproducible?

The alternatives discussed where sub-page refcounting and zeropage scanning.
The first one is not possible (IIUC) because of a refactor years back that
simplified the code and also requires extra complexity. The second option would
require additional overhead as we would involve scanning.

> 
> To optimize FALLOC_FL_PUNCH_HOLE for the cases where splitting+freeing
> is not possible at fallcoate() time, detecting zeropages later and
> retrying to split+free might be an option, without per-block tracking.

> 
> (2) mTHP controls
> 
> As a default, we should not be using large folios / mTHP for any shmem, just
> like we did with THP via shmem_enabled. This is what this series currently
> does, and is aprt of the whole mTHP user-space interface design.

That was clear for me too. But what is the reason we want to boot in 'safe
mode'? What are the implications of not respecing that?

> 
> Further, the mTHP controls should control all of shmem, not only "anonymous
> shmem".

As I understood from the call, mTHP with sysctl knobs is preferred over
optimistic falloc/write allocation? But is still unclear to me why the former
is preferred.

Is large folios a non-viable option?

> 
> Also, we should properly fallback within the configured sizes, and not jump
> "over" configured sizes. Unless there is a good reason.
> 
> (3) khugepaged
> 
> khugepaged needs to handle larger folios properly as well. Until fixed,
> using smaller THP sizes as fallback might prohibit collapsing a PMD-sized
> THP later. But really, khugepaged needs to be fixed to handle that.
> 
> (4) force/disable
> 
> These settings are rather testing artifacts from the old ages. We should not
> add them to the per-size toggles. We might "inherit" it from the global one,
> though.
> 
> "within_size" might have value, and especially for consistency, we should
> have them per size.
> 
> 
> 
> So, this series only tackles anonymous shmem, which is a good starting
> point. Ideally, we'd get support for other shmem (especially during fault
> time) soon afterwards, because we won't be adding separate toggles for that
> from the interface POV, and having inconsistent behavior between kernel
> versions would be a bit unfortunate.
> 
> 
> @Baolin, this series likely does not consider (4) yet. And I suggest we have
> to take a lot of the "anonymous thp" terminology out of this series,
> especially when it comes to documentation.
> 
> @Daniel, Pankaj, what are your plans regarding that? It would be great if we
> could get an understanding on the next steps on !anon shmem.

I realize I've raised so many questions, but it's essential for us to grasp the
mm concerns and usage scenarios. This understanding will provide clarity on the
direction regarding folios for !anon shmem.

> 
> -- 
> Cheers,
> 
> David / dhildenb
> 

Daniel