[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <87o715r4vn.fsf@DESKTOP-5N7EMDA>
Date: Sat, 21 Dec 2024 13:18:04 +0800
From: "Huang, Ying" <ying.huang@...ux.alibaba.com>
To: Gregory Price <gourry@...rry.net>
Cc: linux-mm@...ck.org, linux-kernel@...r.kernel.org,
nehagholkar@...a.com, abhishekd@...a.com, kernel-team@...a.com,
david@...hat.com, nphamcs@...il.com, akpm@...ux-foundation.org,
hannes@...xchg.org, kbusch@...a.com, Feng Tang <feng.tang@...el.com>
Subject: Re: [RFC v2 PATCH 0/5] Promotion of Unmapped Page Cache Folios.
Hi, Gregory,
Thanks for working on this!
Gregory Price <gourry@...rry.net> writes:
> Unmapped page cache pages can be demoted to low-tier memory, but
> they can presently only be promoted in two conditions:
> 1) The page is fully swapped out and re-faulted
> 2) The page becomes mapped (and exposed to NUMA hint faults)
>
> This RFC proposes promoting unmapped page cache pages by using
> folio_mark_accessed as a hotness hint for unmapped pages.
>
> Patches 1-3
> allow NULL as valid input to migration prep interfaces
> for vmf/vma - which is not present in unmapped folios.
> Patch 4
> adds NUMA_HINT_PAGE_CACHE to vmstat
> Patch 5
> adds the promotion mechanism, along with a sysfs
> extension which defaults the behavior to off.
> /sys/kernel/mm/numa/pagecache_promotion_enabled
>
> Functional test showed that we are able to reclaim some performance
> in canned scenarios (a file gets demoted and becomes hot with
> relatively little contention). See test/overhead section below.
>
> v2
> - cleanup first commit to be accurate and take Ying's feedback
> - cleanup NUMA_HINT_ define usage
> - add NUMA_HINT_ type selection macro to keep code clean
> - mild comment updates
>
> Open Questions:
> ======
> 1) Should we also add a limit to how much can be forced onto
> a single task's promotion list at any one time? This might
> piggy-back on the existing TPP promotion limit (256MB?) and
> would simply add something like task->promo_count.
>
> Technically we are limited by the batch read-rate before a
> TASK_RESUME occurs.
>
> 2) Should we exempt certain forms of folios, or add additional
> knobs/levers in to deal with things like large folios?
>
> 3) We added NUMA_HINT_PAGE_CACHE to differentiate hint faults
> so we could validate the behavior works as intended. Should
> we just call this a NUMA_HINT_FAULT and not add a new hint?
>
> 4) Benchmark suggestions that can pressure 1TB memory. This is
> not my typical wheelhouse, so if folks know of a useful
> benchmark that can pressure my 1TB (768 DRAM / 256 CXL) setup,
> I'd like to add additional measurements here.
>
> Development Notes
> =================
>
> During development, we explored the following proposals:
>
> 1) directly promoting within folio_mark_accessed (FMA)
> Originally suggested by Johannes Weiner
> https://lore.kernel.org/all/20240803094715.23900-1-gourry@gourry.net/
>
> This caused deadlocks due to the fact that the PTL was held
> in a variety of cases - but in particular during task exit.
> It also is incredibly inflexible and causes promotion-on-fault.
> It was discussed that a deferral mechanism was preferred.
>
>
> 2) promoting in filemap.c locations (calls of FMA)
> Originally proposed by Feng Tang and Ying Huang
> https://git.kernel.org/pub/scm/linux/kernel/git/vishal/tiering.git/patch/?id=5f2e64ce75c0322602c2ec8c70b64bb69b1f1329
>
> First, we saw this as less problematic than directly hooking FMA,
> but we realized this has the potential to miss data in a variety of
> locations: swap.c, memory.c, gup.c, ksm.c, paddr.c - etc.
>
> Second, we discovered that the lock state of pages is very subtle,
> and that these locations in filemap.c can be called in an atomic
> context. Prototypes lead to a variety of stalls and lockups.
>
>
> 3) a new LRU - originally proposed by Keith Busch
> https://git.kernel.org/pub/scm/linux/kernel/git/kbusch/linux.git/patch/?id=6616afe9a722f6ebedbb27ade3848cf07b9a3af7
>
> There are two issues with this approach: PG_promotable and reclaim.
>
> First - PG_promotable has generally be discouraged.
>
> Second - Attach this mechanism to an LRU is both backwards and
> counter-intutive. A promotable list is better served by a MOST
> recently used list, and since LRUs are generally only shrank when
> exposed to pressure it would require implementing a new promotion
> list shrinker that runs separate from the existing reclaim logic.
>
>
> 4) Adding a separate kthread - suggested by many
>
> This is - to an extent - a more general version of the LRU proposal.
> We still have to track the folios - which likely requires the
> addition of a page flag. Additionally, this method would actually
> contend pretty heavily with LRU behavior - i.e. we'd want to
> throttle addition to the promotion candidate list in some scenarios.
>
>
> 5) Doing it in task work
>
> This seemed to be the most realistic after considering the above.
>
> We observe the following:
> - FMA is an ideal hook for this and isolation is safe here
> - the new promotion_candidate function is an ideal hook for new
> filter logic (throttling, fairness, etc).
> - isolated folios are either promoted or putback on task resume,
> there are no additional concurrency mechanics to worry about
> - The mechanic can be made optional via a sysfs hook to avoid
> overhead in degenerate scenarios (thrashing).
>
> We also piggy-backed on the numa_hint_fault_latency timestamp to
> further throttle promotions to help avoid promotions on one or
> two time accesses to a particular page.
>
>
> Test:
> ======
>
> Environment:
> 1.5-3.7GHz CPU, ~4000 BogoMIPS,
> 1TB Machine with 768GB DRAM and 256GB CXL
> A 64GB file being linearly read by 6-7 Python processes
>
> Goal:
> Generate promotions. Demonstrate stability and measure overhead.
>
> System Settings:
> echo 1 > /sys/kernel/mm/numa/demotion_enabled
> echo 1 > /sys/kernel/mm/numa/pagecache_promotion_enabled
> echo 2 > /proc/sys/kernel/numa_balancing
>
> Each process took up ~128GB each, with anonymous memory growing and
> shrinking as python filled and released buffers with the 64GB data.
> This causes DRAM pressure to generate demotions, and file pages to
> "become hot" - and therefore be selected for promotion.
>
> First we ran with promotion disabled to show consistent overhead as
> a result of forcing a file out to CXL memory. We first ran a single
> reader to see uncontended performance, launched many readers to force
> demotions, then droppedb back to a single reader to observe.
>
> Single-reader DRAM: ~16.0-16.4s
> Single-reader CXL (after demotion): ~16.8-17s
The difference is trivial. This makes me thought that why we need this
patchset?
> Next we turned promotion on with only a single reader running.
>
> Before promotions:
> Node 0 MemFree: 636478112 kB
> Node 0 FilePages: 59009156 kB
> Node 1 MemFree: 250336004 kB
> Node 1 FilePages: 14979628 kB
Why are there some many file pages on node 1 even if there're a lot of
free pages on node 0? You moved some file pages from node 0 to node 1?
> After promotions:
> Node 0 MemFree: 632267268 kB
> Node 0 FilePages: 72204968 kB
> Node 1 MemFree: 262567056 kB
> Node 1 FilePages: 2918768 kB
>
> Single-reader (after_promotion): ~16.5s
>
> Turning the promotion mechanism on when nothing had been demoted
> produced no appreciable overhead (memory allocation noise overpowers it)
>
> Read time did not change after turning promotion off after promotion
> occurred, which implies that the additional overhead is not coming from
> the promotion system itself - but likely other pages still trapped on
> the low tier. Either way, this at least demonstrates the mechanism is
> not particularly harmful when there are no pages to promote - and the
> mechanism is valuable when a file actually is quite hot.
>
> Notability, it takes some time for the average read loop to come back
> down, and there still remains unpromoted file pages trapped in pagecache.
> This isn't entirely unexpected, there are many files which may have been
> demoted, and they may not be very hot.
>
>
> Overhead
> ======
> When promotion was tured on we saw a loop-runtime increate temporarily
>
> before: 16.8s
> during:
> 17.606216192245483
> 17.375206470489502
> 17.722095489501953
> 18.230552434921265
> 18.20712447166443
> 18.008254528045654
> 17.008427381515503
> 16.851454257965088
> 16.715774059295654
> stable: ~16.5s
>
> We measured overhead with a separate patch that simply measured the
> rdtsc value before/after calls in promotion_candidate and task work.
>
> e.g.:
> + start = rdtsc();
> list_for_each_entry_safe(folio, tmp, promo_list, lru) {
> list_del_init(&folio->lru);
> migrate_misplaced_folio(folio, NULL, nid);
> + count++;
> }
> + atomic_long_add(rdtsc()-start, &promo_time);
> + atomic_long_add(count, &promo_count);
>
> numa_migrate_prep: 93 - time(3969867917) count(42576860)
> migrate_misplaced_folio_prepare: 491 - time(3433174319) count(6985523)
> migrate_misplaced_folio: 1635 - time(11426529980) count(6985523)
>
> Thoughts on a good throttling heuristic would be appreciated here.
We do have a throttle mechanism already, for example, you can used
$ echo 100 > /proc/sys/kernel/numa_balancing_promote_rate_limit_MBps
to rate limit the promotion throughput under 100 MB/s for each DRAM
node.
> Suggested-by: Huang Ying <ying.huang@...ux.alibaba.com>
> Suggested-by: Johannes Weiner <hannes@...xchg.org>
> Suggested-by: Keith Busch <kbusch@...a.com>
> Suggested-by: Feng Tang <feng.tang@...el.com>
> Signed-off-by: Gregory Price <gourry@...rry.net>
>
> Gregory Price (5):
> migrate: Allow migrate_misplaced_folio_prepare() to accept a NULL VMA.
> memory: move conditionally defined enums use inside ifdef tags
> memory: allow non-fault migration in numa_migrate_check path
> vmstat: add page-cache numa hints
> migrate,sysfs: add pagecache promotion
>
> .../ABI/testing/sysfs-kernel-mm-numa | 20 ++++++
> include/linux/memory-tiers.h | 2 +
> include/linux/migrate.h | 2 +
> include/linux/sched.h | 3 +
> include/linux/sched/numa_balancing.h | 5 ++
> include/linux/vm_event_item.h | 8 +++
> init/init_task.c | 1 +
> kernel/sched/fair.c | 26 +++++++-
> mm/memory-tiers.c | 27 ++++++++
> mm/memory.c | 32 +++++-----
> mm/mempolicy.c | 25 +++++---
> mm/migrate.c | 61 ++++++++++++++++++-
> mm/swap.c | 3 +
> mm/vmstat.c | 2 +
> 14 files changed, 193 insertions(+), 24 deletions(-)
---
Best Regards,
Huang, Ying
Powered by blists - more mailing lists