[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAMgjq7BpOqgKoeQEPCL9ai3dvVPv7wJe3k_g1hDjAVeCLpZ=7w@mail.gmail.com>
Date: Wed, 4 Sep 2024 02:24:39 +0800
From: Kairui Song <ryncsn@...il.com>
To: hanchuanhua@...o.com, Usama Arif <usamaarif642@...il.com>
Cc: akpm@...ux-foundation.org, linux-mm@...ck.org,
baolin.wang@...ux.alibaba.com, chrisl@...nel.org, david@...hat.com,
hannes@...xchg.org, hughd@...gle.com, kaleshsingh@...gle.com,
linux-kernel@...r.kernel.org, mhocko@...e.com, minchan@...nel.org,
nphamcs@...il.com, ryan.roberts@....com, senozhatsky@...omium.org,
shakeel.butt@...ux.dev, shy828301@...il.com, surenb@...gle.com,
v-songbaohua@...o.com, willy@...radead.org, xiang@...nel.org,
ying.huang@...el.com, yosryahmed@...gle.com, hch@...radead.org
Subject: Re: [PATCH v7 2/2] mm: support large folios swap-in for sync io devices
Hi All,
On Wed, Aug 21, 2024 at 3:46 PM <hanchuanhua@...o.com> wrote:
>
> From: Chuanhua Han <hanchuanhua@...o.com>
>
> Currently, we have mTHP features, but unfortunately, without support for
> large folio swap-ins, once these large folios are swapped out, they are
> lost because mTHP swap is a one-way process. The lack of mTHP swap-in
> functionality prevents mTHP from being used on devices like Android that
> heavily rely on swap.
>
> This patch introduces mTHP swap-in support. It starts from sync devices
> such as zRAM. This is probably the simplest and most common use case,
> benefiting billions of Android phones and similar devices with minimal
> implementation cost. In this straightforward scenario, large folios are
> always exclusive, eliminating the need to handle complex rmap and
> swapcache issues.
>
> It offers several benefits:
> 1. Enables bidirectional mTHP swapping, allowing retrieval of mTHP after
> swap-out and swap-in. Large folios in the buddy system are also
> preserved as much as possible, rather than being fragmented due
> to swap-in.
>
> 2. Eliminates fragmentation in swap slots and supports successful
> THP_SWPOUT.
>
> w/o this patch (Refer to the data from Chris's and Kairui's latest
> swap allocator optimization while running ./thp_swap_allocator_test
> w/o "-a" option [1]):
>
> ./thp_swap_allocator_test
> Iteration 1: swpout inc: 233, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 2: swpout inc: 131, swpout fallback inc: 101, Fallback percentage: 43.53%
> Iteration 3: swpout inc: 71, swpout fallback inc: 155, Fallback percentage: 68.58%
> Iteration 4: swpout inc: 55, swpout fallback inc: 168, Fallback percentage: 75.34%
> Iteration 5: swpout inc: 35, swpout fallback inc: 191, Fallback percentage: 84.51%
> Iteration 6: swpout inc: 25, swpout fallback inc: 199, Fallback percentage: 88.84%
> Iteration 7: swpout inc: 23, swpout fallback inc: 205, Fallback percentage: 89.91%
> Iteration 8: swpout inc: 9, swpout fallback inc: 219, Fallback percentage: 96.05%
> Iteration 9: swpout inc: 13, swpout fallback inc: 213, Fallback percentage: 94.25%
> Iteration 10: swpout inc: 12, swpout fallback inc: 216, Fallback percentage: 94.74%
> Iteration 11: swpout inc: 16, swpout fallback inc: 213, Fallback percentage: 93.01%
> Iteration 12: swpout inc: 10, swpout fallback inc: 210, Fallback percentage: 95.45%
> Iteration 13: swpout inc: 16, swpout fallback inc: 212, Fallback percentage: 92.98%
> Iteration 14: swpout inc: 12, swpout fallback inc: 212, Fallback percentage: 94.64%
> Iteration 15: swpout inc: 15, swpout fallback inc: 211, Fallback percentage: 93.36%
> Iteration 16: swpout inc: 15, swpout fallback inc: 200, Fallback percentage: 93.02%
> Iteration 17: swpout inc: 9, swpout fallback inc: 220, Fallback percentage: 96.07%
>
> w/ this patch (always 0%):
> Iteration 1: swpout inc: 948, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 2: swpout inc: 953, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 3: swpout inc: 950, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 4: swpout inc: 952, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 5: swpout inc: 950, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 6: swpout inc: 950, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 7: swpout inc: 947, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 8: swpout inc: 950, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 9: swpout inc: 950, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 10: swpout inc: 945, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 11: swpout inc: 947, swpout fallback inc: 0, Fallback percentage: 0.00%
> ...
>
> 3. With both mTHP swap-out and swap-in supported, we offer the option to enable
> zsmalloc compression/decompression with larger granularity[2]. The upcoming
> optimization in zsmalloc will significantly increase swap speed and improve
> compression efficiency. Tested by running 100 iterations of swapping 100MiB
> of anon memory, the swap speed improved dramatically:
> time consumption of swapin(ms) time consumption of swapout(ms)
> lz4 4k 45274 90540
> lz4 64k 22942 55667
> zstdn 4k 85035 186585
> zstdn 64k 46558 118533
>
> The compression ratio also improved, as evaluated with 1 GiB of data:
> granularity orig_data_size compr_data_size
> 4KiB-zstd 1048576000 246876055
> 64KiB-zstd 1048576000 199763892
>
> Without mTHP swap-in, the potential optimizations in zsmalloc cannot be
> realized.
>
> 4. Even mTHP swap-in itself can reduce swap-in page faults by a factor
> of nr_pages. Swapping in content filled with the same data 0x11, w/o
> and w/ the patch for five rounds (Since the content is the same,
> decompression will be very fast. This primarily assesses the impact of
> reduced page faults):
>
> swp in bandwidth(bytes/ms) w/o w/
> round1 624152 1127501
> round2 631672 1127501
> round3 620459 1139756
> round4 606113 1139756
> round5 624152 1152281
> avg 621310 1137359 +83%
>
> [1] https://lore.kernel.org/all/20240730-swap-allocator-v5-0-cb9c148b9297@kernel.org/
> [2] https://lore.kernel.org/all/20240327214816.31191-1-21cnbao@gmail.com/
>
> Signed-off-by: Chuanhua Han <hanchuanhua@...o.com>
> Co-developed-by: Barry Song <v-songbaohua@...o.com>
> Signed-off-by: Barry Song <v-songbaohua@...o.com>
> ---
> mm/memory.c | 250 ++++++++++++++++++++++++++++++++++++++++++++++------
> 1 file changed, 223 insertions(+), 27 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index b9fe2f354878..7aa0358a4160 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3986,6 +3986,184 @@ static vm_fault_t handle_pte_marker(struct vm_fault *vmf)
> return VM_FAULT_SIGBUS;
> }
>
> +static struct folio *__alloc_swap_folio(struct vm_fault *vmf)
> +{
> + struct vm_area_struct *vma = vmf->vma;
> + struct folio *folio;
> + swp_entry_t entry;
> +
> + folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, vma,
> + vmf->address, false);
> + if (!folio)
> + return NULL;
> +
> + entry = pte_to_swp_entry(vmf->orig_pte);
> + if (mem_cgroup_swapin_charge_folio(folio, vma->vm_mm,
> + GFP_KERNEL, entry)) {
> + folio_put(folio);
> + return NULL;
> + }
> +
> + return folio;
> +}
> +
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +/*
> + * Check if the PTEs within a range are contiguous swap entries
> + * and have no cache when check_no_cache is true.
> + */
> +static bool can_swapin_thp(struct vm_fault *vmf, pte_t *ptep,
> + int nr_pages, bool check_no_cache)
> +{
> + struct swap_info_struct *si;
> + unsigned long addr;
> + swp_entry_t entry;
> + pgoff_t offset;
> + int idx, i;
> + pte_t pte;
> +
> + addr = ALIGN_DOWN(vmf->address, nr_pages * PAGE_SIZE);
> + idx = (vmf->address - addr) / PAGE_SIZE;
> + pte = ptep_get(ptep);
> +
> + if (!pte_same(pte, pte_move_swp_offset(vmf->orig_pte, -idx)))
> + return false;
> + entry = pte_to_swp_entry(pte);
> + offset = swp_offset(entry);
> + if (swap_pte_batch(ptep, nr_pages, pte) != nr_pages)
> + return false;
> +
> + if (!check_no_cache)
> + return true;
> +
> + si = swp_swap_info(entry);
> + /*
> + * While allocating a large folio and doing swap_read_folio, which is
> + * the case the being faulted pte doesn't have swapcache. We need to
> + * ensure all PTEs have no cache as well, otherwise, we might go to
> + * swap devices while the content is in swapcache.
> + */
> + for (i = 0; i < nr_pages; i++) {
> + if ((si->swap_map[offset + i] & SWAP_HAS_CACHE))
> + return false;
> + }
> +
> + return true;
> +}
> +
> +static inline unsigned long thp_swap_suitable_orders(pgoff_t swp_offset,
> + unsigned long addr,
> + unsigned long orders)
> +{
> + int order, nr;
> +
> + order = highest_order(orders);
> +
> + /*
> + * To swap in a THP with nr pages, we require that its first swap_offset
> + * is aligned with that number, as it was when the THP was swapped out.
> + * This helps filter out most invalid entries.
> + */
> + while (orders) {
> + nr = 1 << order;
> + if ((addr >> PAGE_SHIFT) % nr == swp_offset % nr)
> + break;
> + order = next_order(&orders, order);
> + }
> +
> + return orders;
> +}
> +
> +static struct folio *alloc_swap_folio(struct vm_fault *vmf)
> +{
> + struct vm_area_struct *vma = vmf->vma;
> + unsigned long orders;
> + struct folio *folio;
> + unsigned long addr;
> + swp_entry_t entry;
> + spinlock_t *ptl;
> + pte_t *pte;
> + gfp_t gfp;
> + int order;
> +
> + /*
> + * If uffd is active for the vma we need per-page fault fidelity to
> + * maintain the uffd semantics.
> + */
> + if (unlikely(userfaultfd_armed(vma)))
> + goto fallback;
> +
> + /*
> + * A large swapped out folio could be partially or fully in zswap. We
> + * lack handling for such cases, so fallback to swapping in order-0
> + * folio.
> + */
> + if (!zswap_never_enabled())
> + goto fallback;
> +
> + entry = pte_to_swp_entry(vmf->orig_pte);
> + /*
> + * Get a list of all the (large) orders below PMD_ORDER that are enabled
> + * and suitable for swapping THP.
> + */
> + orders = thp_vma_allowable_orders(vma, vma->vm_flags,
> + TVA_IN_PF | TVA_ENFORCE_SYSFS, BIT(PMD_ORDER) - 1);
> + orders = thp_vma_suitable_orders(vma, vmf->address, orders);
> + orders = thp_swap_suitable_orders(swp_offset(entry),
> + vmf->address, orders);
> +
> + if (!orders)
> + goto fallback;
> +
> + pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd,
> + vmf->address & PMD_MASK, &ptl);
> + if (unlikely(!pte))
> + goto fallback;
> +
> + /*
> + * For do_swap_page, find the highest order where the aligned range is
> + * completely swap entries with contiguous swap offsets.
> + */
> + order = highest_order(orders);
> + while (orders) {
> + addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
> + if (can_swapin_thp(vmf, pte + pte_index(addr), 1 << order, true))
> + break;
> + order = next_order(&orders, order);
> + }
> +
> + pte_unmap_unlock(pte, ptl);
> +
> + /* Try allocating the highest of the remaining orders. */
> + gfp = vma_thp_gfp_mask(vma);
> + while (orders) {
> + addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
> + folio = vma_alloc_folio(gfp, order, vma, addr, true);
> + if (folio) {
> + if (!mem_cgroup_swapin_charge_folio(folio, vma->vm_mm,
> + gfp, entry))
> + return folio;
> + folio_put(folio);
> + }
> + order = next_order(&orders, order);
> + }
> +
> +fallback:
> + return __alloc_swap_folio(vmf);
> +}
> +#else /* !CONFIG_TRANSPARENT_HUGEPAGE */
> +static inline bool can_swapin_thp(struct vm_fault *vmf, pte_t *ptep,
> + int nr_pages, bool check_no_cache)
> +{
> + return false;
> +}
> +
> +static struct folio *alloc_swap_folio(struct vm_fault *vmf)
> +{
> + return __alloc_swap_folio(vmf);
> +}
> +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> +
> /*
> * We enter with non-exclusive mmap_lock (to exclude vma changes,
> * but allow concurrent faults), and pte mapped but not yet locked.
> @@ -4074,34 +4252,34 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> if (!folio) {
> if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
> __swap_count(entry) == 1) {
> - /*
> - * Prevent parallel swapin from proceeding with
> - * the cache flag. Otherwise, another thread may
> - * finish swapin first, free the entry, and swapout
> - * reusing the same entry. It's undetectable as
> - * pte_same() returns true due to entry reuse.
> - */
> - if (swapcache_prepare(entry, 1)) {
> - /* Relax a bit to prevent rapid repeated page faults */
> - schedule_timeout_uninterruptible(1);
> - goto out;
> - }
> - need_clear_cache = true;
> -
> /* skip swapcache */
> - folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0,
> - vma, vmf->address, false);
> + folio = alloc_swap_folio(vmf);
> if (folio) {
> __folio_set_locked(folio);
> __folio_set_swapbacked(folio);
>
> - if (mem_cgroup_swapin_charge_folio(folio,
> - vma->vm_mm, GFP_KERNEL,
> - entry)) {
> - ret = VM_FAULT_OOM;
> + nr_pages = folio_nr_pages(folio);
> + if (folio_test_large(folio))
> + entry.val = ALIGN_DOWN(entry.val, nr_pages);
> + /*
> + * Prevent parallel swapin from proceeding with
> + * the cache flag. Otherwise, another thread
> + * may finish swapin first, free the entry, and
> + * swapout reusing the same entry. It's
> + * undetectable as pte_same() returns true due
> + * to entry reuse.
> + */
> + if (swapcache_prepare(entry, nr_pages)) {
> + /*
> + * Relax a bit to prevent rapid
> + * repeated page faults.
> + */
> + schedule_timeout_uninterruptible(1);
> goto out_page;
> }
> - mem_cgroup_swapin_uncharge_swap(entry, 1);
> + need_clear_cache = true;
> +
> + mem_cgroup_swapin_uncharge_swap(entry, nr_pages);
>
> shadow = get_shadow_from_swap_cache(entry);
> if (shadow)
> @@ -4207,6 +4385,23 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> goto out_nomap;
> }
>
> + /* allocated large folios for SWP_SYNCHRONOUS_IO */
> + if (folio_test_large(folio) && !folio_test_swapcache(folio)) {
> + unsigned long nr = folio_nr_pages(folio);
> + unsigned long folio_start = ALIGN_DOWN(vmf->address,
> + nr * PAGE_SIZE);
> + unsigned long idx = (vmf->address - folio_start) / PAGE_SIZE;
> + pte_t *folio_ptep = vmf->pte - idx;
> +
> + if (!can_swapin_thp(vmf, folio_ptep, nr, false))
> + goto out_nomap;
> +
> + page_idx = idx;
> + address = folio_start;
> + ptep = folio_ptep;
> + goto check_folio;
> + }
> +
> nr_pages = 1;
> page_idx = 0;
> address = vmf->address;
> @@ -4338,11 +4533,12 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> folio_add_lru_vma(folio, vma);
> } else if (!folio_test_anon(folio)) {
> /*
> - * We currently only expect small !anon folios, which are either
> - * fully exclusive or fully shared. If we ever get large folios
> - * here, we have to be careful.
> + * We currently only expect small !anon folios which are either
> + * fully exclusive or fully shared, or new allocated large
> + * folios which are fully exclusive. If we ever get large
> + * folios within swapcache here, we have to be careful.
> */
> - VM_WARN_ON_ONCE(folio_test_large(folio));
> + VM_WARN_ON_ONCE(folio_test_large(folio) && folio_test_swapcache(folio));
> VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio);
> folio_add_new_anon_rmap(folio, vma, address, rmap_flags);
> } else {
> @@ -4385,7 +4581,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> out:
> /* Clear the swap cache pin for direct swapin after PTL unlock */
> if (need_clear_cache)
> - swapcache_clear(si, entry, 1);
> + swapcache_clear(si, entry, nr_pages);
> if (si)
> put_swap_device(si);
> return ret;
> @@ -4401,7 +4597,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> folio_put(swapcache);
> }
> if (need_clear_cache)
> - swapcache_clear(si, entry, 1);
> + swapcache_clear(si, entry, nr_pages);
> if (si)
> put_swap_device(si);
> return ret;
> --
> 2.43.0
>
With latest mm-unstable, I'm seeing following WARN followed by user
space segfaults (multiple mTHP enabled):
[ 39.145686] ------------[ cut here ]------------
[ 39.145969] WARNING: CPU: 24 PID: 11159 at mm/page_io.c:535
swap_read_folio+0x4db/0x520
[ 39.146307] Modules linked in:
[ 39.146507] CPU: 24 UID: 1000 PID: 11159 Comm: sh Kdump: loaded Not
tainted 6.11.0-rc6.orig+ #131
[ 39.146887] Hardware name: Tencent Cloud CVM, BIOS
seabios-1.9.1-qemu-project.org 04/01/2014
[ 39.147206] RIP: 0010:swap_read_folio+0x4db/0x520
[ 39.147430] Code: 00 e0 ff ff 09 c1 83 f8 08 0f 42 d1 e9 c4 fe ff
ff 48 63 85 34 02 00 00 48 03 45 08 49 39 c4 0f 85 63 fe ff ff e9 db
fe ff ff <0f> 0b e9 91 fd ff ff 31 d2 e9 9d fe ff ff 48 c7 c6 38 b6 4e
82 48
[ 39.148079] RSP: 0000:ffffc900045c3ce0 EFLAGS: 00010202
[ 39.148390] RAX: 0017ffffd0020061 RBX: ffffea00064d4c00 RCX: 03ffffffffffffff
[ 39.148737] RDX: ffffea00064d4c00 RSI: 0000000000000000 RDI: ffffea00064d4c00
[ 39.149102] RBP: 0000000000000001 R08: ffffea00064d4c00 R09: 0000000000000078
[ 39.149482] R10: 00000000000000f0 R11: 0000000000000004 R12: 0000000000001000
[ 39.149832] R13: ffff888102df5c00 R14: ffff888102df5c00 R15: 0000000000000003
[ 39.150177] FS: 00007f51a56c9540(0000) GS:ffff888fffc00000(0000)
knlGS:0000000000000000
[ 39.150623] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 39.150950] CR2: 000055627b13fda0 CR3: 00000001083e2000 CR4: 00000000003506b0
[ 39.151317] Call Trace:
[ 39.151565] <TASK>
[ 39.151778] ? __warn+0x84/0x130
[ 39.152044] ? swap_read_folio+0x4db/0x520
[ 39.152345] ? report_bug+0xfc/0x1e0
[ 39.152614] ? handle_bug+0x3f/0x70
[ 39.152891] ? exc_invalid_op+0x17/0x70
[ 39.153178] ? asm_exc_invalid_op+0x1a/0x20
[ 39.153467] ? swap_read_folio+0x4db/0x520
[ 39.153753] do_swap_page+0xc6d/0x14f0
[ 39.154054] ? srso_return_thunk+0x5/0x5f
[ 39.154361] __handle_mm_fault+0x758/0x850
[ 39.154645] handle_mm_fault+0x134/0x340
[ 39.154945] do_user_addr_fault+0x2e5/0x760
[ 39.155245] exc_page_fault+0x6a/0x140
[ 39.155546] asm_exc_page_fault+0x26/0x30
[ 39.155847] RIP: 0033:0x55627b071446
[ 39.156124] Code: f6 7e 19 83 e3 01 74 14 41 83 ee 01 44 89 35 25
72 0c 00 45 85 ed 0f 88 73 02 00 00 8b 05 ea 74 0c 00 85 c0 0f 85 da
03 00 00 <44> 8b 15 53 e9 0c 00 45 85 d2 74 2e 44 8b 0d 37 e3 0c 00 45
85 c9
[ 39.156944] RSP: 002b:00007ffd619d54f0 EFLAGS: 00010246
[ 39.157237] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 00007f51a44f968b
[ 39.157594] RDX: 0000000000000000 RSI: 00007ffd619d5518 RDI: 00000000ffffffff
[ 39.157954] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000007
[ 39.158288] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000001
[ 39.158634] R13: 0000000000002b9a R14: 0000000000000000 R15: 00007ffd619d5518
[ 39.158998] </TASK>
[ 39.159226] ---[ end trace 0000000000000000 ]---
After reverting this or Usama's "mm: store zero pages to be swapped
out in a bitmap", the problem is gone. I think these two patches may
have some conflict that needs to be resolved.
Powered by blists - more mailing lists