[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAMgjq7AxYyYTZ9NQ_3eJaxTis0CFAr5Zpi2oz3FPFYoWhDp00g@mail.gmail.com>
Date: Thu, 24 Oct 2024 11:51:48 +0800
From: Kairui Song <ryncsn@...il.com>
To: "Huang, Ying" <ying.huang@...el.com>
Cc: linux-mm@...ck.org, Andrew Morton <akpm@...ux-foundation.org>,
Chris Li <chrisl@...nel.org>, Barry Song <v-songbaohua@...o.com>,
Ryan Roberts <ryan.roberts@....com>, Hugh Dickins <hughd@...gle.com>,
Yosry Ahmed <yosryahmed@...gle.com>, Tim Chen <tim.c.chen@...ux.intel.com>,
Nhat Pham <nphamcs@...il.com>, linux-kernel@...r.kernel.org
Subject: Re: [PATCH 00/13] mm, swap: rework of swap allocator locks
On Thu, Oct 24, 2024 at 11:08 AM Huang, Ying <ying.huang@...el.com> wrote:
>
> Kairui Song <ryncsn@...il.com> writes:
>
> > On Wed, Oct 23, 2024 at 10:27 AM Huang, Ying <ying.huang@...el.com> wrote:
> >>
> >> Hi, Kairui,
> >
> > Hi Ying,
> >
> >>
> >> Kairui Song <ryncsn@...il.com> writes:
> >>
> >> > From: Kairui Song <kasong@...cent.com>
> >> >
> >> > This series improved the swap allocator performance greatly by reworking
> >> > the locking design and simplify a lot of code path.
> >> >
> >> > This is follow up of previous swap cluster allocator series:
> >> > https://lore.kernel.org/linux-mm/20240730-swap-allocator-v5-0-cb9c148b9297@kernel.org/
> >> >
> >> > And this series is based on an follow up fix of the swap cluster
> >> > allocator:
> >> > https://lore.kernel.org/linux-mm/20241022175512.10398-1-ryncsn@gmail.com/
> >> >
> >> > This is part of the new swap allocator work item discussed in
> >> > Chris's "Swap Abstraction" discussion at LSF/MM 2024, and
> >> > "mTHP and swap allocator" discussion at LPC 2024.
> >> >
> >> > Previous series introduced a fully cluster based allocation algorithm,
> >> > this series completely get rid of the old allocation path and makes the
> >> > allocator avoid grabbing the si->lock unless needed. This bring huge
> >> > performance gain and get rid of slot cache on freeing path.
> >>
> >> Great!
> >>
> >> > Currently, swap locking is mainly composed of two locks, cluster lock
> >> > (ci->lock) and device lock (si->lock). The device lock is widely used
> >> > to protect many things, causing it to be the main bottleneck for SWAP.
> >>
> >> Device lock can be confusing with another device lock for struct device.
> >> Better to call it swap device lock?
> >
> > Good idea, I'll use the term swap device lock then.
> >
> >>
> >> > Cluster lock is much more fine-grained, so it will be best to use
> >> > ci->lock instead of si->lock as much as possible.
> >> >
> >> > `perf lock` indicates this issue clearly. Doing linux kernel build
> >> > using tmpfs and ZRAM with limited memory (make -j64 with 1G memcg and 4k
> >> > pages), result of "perf lock contention -ab sleep 3":
> >> >
> >> > contended total wait max wait avg wait type caller
> >> >
> >> > 34948 53.63 s 7.11 ms 1.53 ms spinlock free_swap_and_cache_nr+0x350
> >> > 16569 40.05 s 6.45 ms 2.42 ms spinlock get_swap_pages+0x231
> >> > 11191 28.41 s 7.03 ms 2.54 ms spinlock swapcache_free_entries+0x59
> >> > 4147 22.78 s 122.66 ms 5.49 ms spinlock page_vma_mapped_walk+0x6f3
> >> > 4595 7.17 s 6.79 ms 1.56 ms spinlock swapcache_free_entries+0x59
> >> > 406027 2.74 s 2.59 ms 6.74 us spinlock list_lru_add+0x39
> >> > ...snip...
> >> >
> >> > The top 5 caller are all users of si->lock, total wait time up sums to
> >> > several minutes in the 3 seconds time window.
> >>
> >> Can you show results of `perf record -g`, `perf report -g` too? I have
> >> interest to check hot spot shifting too.
> >
> > Sure. I think `perf lock` result is already good enough and cleaner.
> > My test environment are mostly VM based so spinlock slow path may get
> > offloaded to host, and can't be see by perf record, I collected
> > following data after disabled paravirt spinlock:
> >
> > The time consumption and stack trace of a page fault before:
> > - 78.45% 0.17% cc1 [kernel.kallsyms]
> > [k] asm_exc_page_fault
> > - 78.28% asm_exc_page_fault
> > - 78.18% exc_page_fault
> > - 78.17% do_user_addr_fault
> > - 78.09% handle_mm_fault
> > - 78.06% __handle_mm_fault
> > - 69.69% do_swap_page
> > - 55.87% alloc_swap_folio
> > - 55.60% mem_cgroup_swapin_charge_folio
> > - 55.48% charge_memcg
> > - 55.45% try_charge_memcg
> > - 55.36% try_to_free_mem_cgroup_pages
> > - do_try_to_free_pages
> > - 55.35% shrink_node
> > - 55.27% shrink_lruvec
> > - 55.13% try_to_shrink_lruvec
> > - 54.79% evict_folios
> > - 54.35% shrink_folio_list
> > - 30.01% add_to_swap
> > - 29.77%
> > folio_alloc_swap
> > - 29.50%
> > get_swap_pages
> >
> > 25.03% queued_spin_lock_slowpath
> > - 2.71%
> > alloc_swap_scan_cluster
> >
> > 1.80% queued_spin_lock_slowpath
> > +
> > 0.89% __try_to_reclaim_swap
> > - 1.74%
> > swap_reclaim_full_clusters
> >
> > 1.74% queued_spin_lock_slowpath
> > - 10.88%
> > try_to_unmap_flush_dirty
> > - 10.87%
> > arch_tlbbatch_flush
> > - 10.85%
> > on_each_cpu_cond_mask
> >
> > smp_call_function_many_cond
> > + 7.45% pageout
> > + 2.71% try_to_unmap_flush
> > + 1.90% try_to_unmap
> > + 0.78% folio_referenced
> > - 9.41% cluster_swap_free_nr
> > - 9.39% free_swap_slot
> > - 9.35% swapcache_free_entries
> > 8.40% queued_spin_lock_slowpath
> > 0.93% swap_entry_range_free
> > - 3.61% swap_read_folio_bdev_sync
> > - 3.55% submit_bio_wait
> > - 3.51% submit_bio_noacct_nocheck
> > + 3.46% __submit_bio
> > + 7.71% do_pte_missing
> > + 0.61% wp_page_copy
> >
> > The queued_spin_lock_slowpath above is the si->lock, and there are
> > multiple users of it so the total overhead is higher than shown.
> >
> > After:
> > - 75.05% 0.43% cc1 [kernel.kallsyms]
> > [k] asm_exc_page_fault
> > - 74.62% asm_exc_page_fault
> > - 74.36% exc_page_fault
> > - 74.34% do_user_addr_fault
> > - 74.10% handle_mm_fault
> > - 73.96% __handle_mm_fault
> > - 67.55% do_swap_page
> > - 45.92% alloc_swap_folio
> > - 45.03% mem_cgroup_swapin_charge_folio
> > - 44.58% charge_memcg
> > - 44.44% try_charge_memcg
> > - 44.12% try_to_free_mem_cgroup_pages
> > - do_try_to_free_pages
> > - 44.10% shrink_node
> > - 43.86% shrink_lruvec
> > - 41.92% try_to_shrink_lruvec
> > - 40.67% evict_folios
> > - 37.12% shrink_folio_list
> > - 20.88% pageout
> > + 20.02% swap_writepage
> > + 0.72% shmem_writepage
> > - 4.08% add_to_swap
> > - 2.48%
> > folio_alloc_swap
> > - 2.12%
> > __mem_cgroup_try_charge_swap
> > - 1.47%
> > swap_cgroup_record
> > +
> > 1.32% _raw_spin_lock_irqsave
> > - 1.56%
> > add_to_swap_cache
> > - 1.04% xas_store
> > + 1.01%
> > workingset_update_node
> > + 3.97%
> > try_to_unmap_flush_dirty
> > + 3.51% folio_referenced
> > + 2.24% __remove_mapping
> > + 1.16% try_to_unmap
> > + 0.52% try_to_unmap_flush
> > 2.50%
> > queued_spin_lock_slowpath
> > 0.79% scan_folios
> > + 1.20% try_to_inc_max_seq
> > + 1.92% lru_add_drain
> > + 0.73% vma_alloc_folio_noprof
> > - 9.81% swap_read_folio_bdev_sync
> > - 9.61% submit_bio_wait
> > + 9.49% submit_bio_noacct_nocheck
> > - 8.06% cluster_swap_free_nr
> > - 8.02% swap_entry_range_free
> > + 3.92% __mem_cgroup_uncharge_swap
> > + 2.90% zram_slot_free_notify
> > 0.58% clear_shadow_from_swap_cache
> > - 1.32% __folio_batch_add_and_move
> > - 1.30% folio_batch_move_lru
> > + 1.10% folio_lruvec_lock_irqsave
>
> Thanks for data.
>
> It seems that the cycles shifts from spinning to memory compression.
> That is expected.
>
> > spin_lock usage is much lower.
> >
> > I prefer the perf lock output as it shows the exact time and user of locks.
>
> perf cycles data is more complete. You can find which part becomes new
> hot spot.
>
> >>
> >> > Following the new allocator design, many operation doesn't need to touch
> >> > si->lock at all. We only need to take si->lock when doing operations
> >> > across multiple clusters (eg. changing the cluster list), other
> >> > operations only need to take ci->lock. So ideally allocator should
> >> > always take ci->lock first, then, if needed, take si->lock. But due
> >> > to historical reasons, ci->lock is used inside si->lock by design,
> >> > causing lock inversion if we simply try to acquire si->lock after
> >> > acquiring ci->lock.
> >> >
> >> > This series audited all si->lock usage, simplify legacy codes, eliminate
> >> > usage of si->lock as much as possible by introducing new designs based
> >> > on the new cluster allocator.
> >> >
> >> > Old HDD allocation codes are removed, cluster allocator is adapted
> >> > with small changes for HDD usage, test is looking OK.
> >>
> >> I think that it's a good idea to remove HDD allocation specific code.
> >> Can you check the performance of swapping to HDD? However, I understand
> >> that many people have no HDD in hand.
> >
> > It's not hard to make cluster allocator work well with HDD in theory,
> > see the commit "mm, swap: use a global swap cluster for non-rotation
> > device".
> > The testing is not very reliable though, I found HDD swap performance
> > is very unstable because of the IO pattern of HDD, so it's just a best
> > effort try.
>
> Just to check whether code change cause something too bad for HDD. No
> measurable difference is a good news.
>
> >> > And this also removed slot cache for freeing path. The performance is
> >> > better without it, and this enables other clean up and optimizations
> >> > as discussed before:
> >> > https://lore.kernel.org/all/CAMgjq7ACohT_uerSz8E_994ZZCv709Zor+43hdmesW_59W1BWw@mail.gmail.com/
> >> >
> >> > After this series, lock contention on si->lock is nearly unobservable
> >> > with `perf lock` with the same test above :
> >> >
> >> > contended total wait max wait avg wait type caller
> >> > ... snip ...
> >> > 91 204.62 us 4.51 us 2.25 us spinlock cluster_move+0x2e
> >> > ... snip ...
> >> > 47 125.62 us 4.47 us 2.67 us spinlock cluster_move+0x2e
> >> > ... snip ...
> >> > 23 63.15 us 3.95 us 2.74 us spinlock cluster_move+0x2e
> >> > ... snip ...
> >> > 17 41.26 us 4.58 us 2.43 us spinlock cluster_isolate_lock+0x1d
> >> > ... snip ...
> >> >
> >> > cluster_move and cluster_isolate_lock are basically the only users
> >> > of si->lock now, performance gain is huge with reduced LOC.
> >> >
> >> > Tests
> >> > ===
> >> >
> >> > Build kernel with defconfig on tmpfs with ZRAM as swap:
> >> > ---
> >> >
> >> > Running a test matrix which is scaled up progressive for a intuitive result.
> >> > The test are ran on top of tmpfs, using memory cgroup for memory limitation,
> >> > on a 48c96t system.
> >> >
> >> > 12 test run for each case, it can be seen clearly that as concurrent job
> >> > number goes higher the performance gain is higher, the performance is
> >> > higher even with low concurrency.
> >> >
> >> > make -j<NR> | System Time (seconds) | Total Time (seconds)
> >> > (NR / Mem / ZRAM) | (Before / After / Delta) | (Before / After / Delta)
> >> > With 4k pages only:
> >> > 6 / 192M / 3G | 5258 / 5235 / -0.3% | 1420 / 1414 / -0.3%
> >> > 12 / 256M / 4G | 5518 / 5337 / -3.3% | 758 / 742 / -2.1%
> >> > 24 / 384M / 5G | 7091 / 5766 / -18.7% | 476 / 422 / -11.3%
> >> > 48 / 768M / 7G | 11139 / 5831 / -47.7% | 330 / 221 / -33.0%
> >> > 96 / 1.5G / 10G | 21303 / 11353 / -46.7% | 283 / 180 / -36.4%
> >> > With 64k mTHP:
> >> > 24 / 512M / 5G | 5104 / 4641 / -18.7% | 376 / 358 / -4.8%
> >> > 48 / 1G / 7G | 8693 / 4662 / -18.7% | 257 / 176 / -31.5%
> >> > 96 / 2G / 10G | 17056 / 10263 / -39.8% | 234 / 169 / -27.8%
> >>
> >> How much is the swap in/out throughput before/after the change?
> >
> > This may not be too beneficial for typical throughput measurement:
> > - For example doing the same test with brd will only show a ~20%
> > performance improvement, still a big gain though. I think the si->lock
> > spinlock wasting CPU cycles may effect CPU sensitive things like ZRAM
> > even more.
>
> 20% is a good data. You don't need to guess. perf cycles profiling can
> show the hot spot.
>
> > - And simple benchmarks which just do multiple sequential swaps in/out
> > in multiple thread hardly stress the allocator.
> >
> > I haven't found a good
> > benchmark to simulate random parallel IOs on SWAP yet, I can write one
> > later.
>
> I have used anon-w-rand test case of vm-scalability to simulate random
> parallel swap out.
>
> https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git/tree/case-anon-w-rand
>
> > A more close to real word benchmark like build kernel test, or
> > mysql/sysbench all showed great improment.
>
> Yes. Real work load is good. We can use micro-benchmark to find out
> some performance limit, for example, max possible throughput.
>
> >>
> >> When I worked on swap in/out performance before, the hot spot shifts from
> >> swap related code to LRU lock and zone lock. Things may change a lot
> >> now.
> >>
> >> If zram is used as swap device, the hot spot may become
> >> compression/decompression after solving the swap lock contention. To
> >> stress swap subsystem further, we may use a ram disk as swap.
> >> Previously, we have used a simulated pmem device (backed by DRAM). That
> >> can be setup as in,
> >>
> >> https://pmem.io/blog/2016/02/how-to-emulate-persistent-memory/
> >>
> >> After creating the raw block device: /dev/pmem0, we can do
> >>
> >> $ mkswap /dev/pmem0
> >> $ swapon /dev/pmem0
> >>
> >> Can you use something similar if necessary?
> >
> > I used to test with brd, as described above,
>
> brd will allocate memory during running, pmem can avoid that. perf
> profile is your friends to root cause the possible issue.
>
> > I think using ZRAM with
> > test simulating real workload is more useful.
>
> Yes. And, as I said before. Micro-benchmark has its own value.
Hi Ying,
Thank you very much for the suggestion, I didn't mean I'm against
micro benchmarks in any way, just a lot of effort was spent on other
tests so I skipped that part for V1.
As you mentioned vm-scalability, I think this is definitely a good
idea to include that test when pmem simulation.
There are still some bottlenecks of SWAP, beside compression and page
fault / tlb, mostly cgroup lock and list lru locks. I have some ideas
to optimize these too, could be next steps.
> > And I did include a Sequential SWAP test, the result is looking OK (no
> > regression, minor to none improvement).
>
> Good. At least we have no regression here.
>
> --
> Best Regards,
> Huang, Ying
Powered by blists - more mailing lists