[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKEwX=PYv4Cn_a7WSnbUyhwSBO61xoDPSexXxS0uwwxu5P6XLw@mail.gmail.com>
Date: Mon, 26 Aug 2024 07:12:28 -0700
From: Nhat Pham <nphamcs@...il.com>
To: "Sridhar, Kanchana P" <kanchana.p.sridhar@...el.com>
Cc: "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>, "linux-mm@...ck.org" <linux-mm@...ck.org>,
"hannes@...xchg.org" <hannes@...xchg.org>, "yosryahmed@...gle.com" <yosryahmed@...gle.com>,
"ryan.roberts@....com" <ryan.roberts@....com>, "Huang, Ying" <ying.huang@...el.com>,
"21cnbao@...il.com" <21cnbao@...il.com>, "akpm@...ux-foundation.org" <akpm@...ux-foundation.org>,
"Zou, Nanhai" <nanhai.zou@...el.com>, "Feghali, Wajdi K" <wajdi.k.feghali@...el.com>,
"Gopal, Vinodh" <vinodh.gopal@...el.com>
Subject: Re: [PATCH v4 0/4] mm: ZSWAP swap-out of mTHP folios
On Fri, Aug 23, 2024 at 11:21 PM Sridhar, Kanchana P
<kanchana.p.sridhar@...el.com> wrote:
>
> Hi Nhat,
>
>
> I started out with 2 main hypotheses to explain why zswap incurs more
> reclaim wrt SSD:
>
> 1) The cgroup zswap charge, that hastens the memory.high limit to be
> breached, and adds to the reclaim being triggered in
> mem_cgroup_handle_over_high().
>
> 2) Does a faster reclaim path somehow cause less allocation stalls; thereby
> causing more breaches of memory.high, hence more reclaim -- and does this
> cycle repeat, potentially leading to higher swapout activity with zswap?
By faster reclaim path, do you mean zswap has a lower reclaim latency?
>
> I focused on gathering data with lz4 for this debug, under the reasonable
> assumption that results with deflate-iaa will be better. Once we figure out
> an overall direction on next steps, I will publish results with zswap lz4,
> deflate-iaa, etc.
>
> All experiments except "Exp 1.A" are run with
> usemem --init-time -w -O -n 70 1g.
>
> General settings for all data presented in this patch-series:
>
> vm.swappiness = 100
> zswap shrinker_enabled = N
>
> Experiment 1 - impact of not doing cgroup zswap charge:
> -------------------------------------------------------
>
> I wanted to first understand by how much we improve without the cgroup
> zswap charge. I commented out both, the calls to obj_cgroup_charge_zswap()
> and obj_cgroup_uncharge_zswap() in zswap.c in my patch-set.
> We improve throughput by quite a bit with this change, and are now better
> than mTHP getting swapped out to SSD. We have also slightly improved on the
> sys time, though this is still a regression as compared to SSD. If you
> recall, we were worse on throughput and sys time with v4.
I'm not 100% sure about the validity this pair of experiments.
The thing is, you cannot ignore zswap's memory footprint altogether.
That's the whole point of the trade-off. It's probably gigabytes worth
of unaccounted memory usage - I see that your SSD size is 4G, and
since compression ratio is less than 2, that's potentially 2G worth of
memory give or take you are not charging to the cgroup, which can
altogether alter the memory pressure and reclaim dynamics.
The zswap charging itself is not the problem - that's fair and
healthy. It might be the overreaction by the memory reclaim subsystem
that seems anomalous?
>
> Averages over 3 runs are summarized in each case.
>
> Exp 1.A: usemem -n 1 58g:
> -------------------------
>
> 64KB mTHP (cgroup memory.high set to 60G):
> ==========================================
>
> SSD mTHP zswap mTHP v4 zswap mTHP no_charge
> ----------------------------------------------------------------
> pswpout 586,352 0 0
> zswpout 1,005 1,042,963 587,181
> ----------------------------------------------------------------
> Total swapout 587,357 1,042,963 587,181
> ----------------------------------------------------------------
>
> Without the zswap charge to cgroup, the total swapout activity for
> zswap-mTHP is on par with that of SSD-mTHP for the single process case.
>
>
> Exp 1.B: usemem -n 70 1g:
> -------------------------
> v4 results with cgroup zswap charge:
> ------------------------------------
>
> 64KB mTHP (cgroup memory.high set to 60G):
> ==========================================
> ------------------------------------------------------------------
> | | | | |
> |Kernel | mTHP SWAP-OUT | Throughput | Change |
> | | | KB/s | |
> |--------------------|-------------------|------------|------------|
> |v6.11-rc3 mainline | SSD | 335,346 | Baseline |
> |zswap-mTHP-Store | ZSWAP lz4 | 271,558 | -19% |
> |------------------------------------------------------------------|
> | | | | |
> |Kernel | mTHP SWAP-OUT | Sys time | Change |
> | | | sec | |
> |--------------------|-------------------|------------|------------|
> |v6.11-rc3 mainline | SSD | 91.37 | Baseline |
> |zswap-mTHP=Store | ZSWAP lz4 | 265.43 | -191% |
> ------------------------------------------------------------------
>
> -----------------------------------------------------------------------
> | VMSTATS, mTHP ZSWAP/SSD stats| v6.11-rc3 | zswap-mTHP | zswap-mTHP |
> | | mainline | Store | Store |
> | | | lz4 | deflate-iaa |
> |-----------------------------------------------------------------------|
> | pswpout | 174,432 | 0 | 0 |
> | zswpout | 1,501 | 1,491,654 | 1,398,805 |
> |-----------------------------------------------------------------------|
> | hugepages-64kB/stats/zswpout | | 63,200 | 63,244 |
> |-----------------------------------------------------------------------|
> | hugepages-64kB/stats/swpout | 10,902 | 0 | 0 |
> -----------------------------------------------------------------------
>
> Debug results without cgroup zswap charge in both, "Before" and "After":
> ------------------------------------------------------------------------
>
> 64KB mTHP (cgroup memory.high set to 60G):
> ==========================================
> ------------------------------------------------------------------
> | | | | |
> |Kernel | mTHP SWAP-OUT | Throughput | Change |
> | | | KB/s | |
> |--------------------|-------------------|------------|------------|
> |v6.11-rc3 mainline | SSD | 300,565 | Baseline |
> |zswap-mTHP-Store | ZSWAP lz4 | 420,125 | 40% |
> |------------------------------------------------------------------|
> | | | | |
> |Kernel | mTHP SWAP-OUT | Sys time | Change |
> | | | sec | |
> |--------------------|-------------------|------------|------------|
> |v6.11-rc3 mainline | SSD | 90.76 | Baseline |
> |zswap-mTHP=Store | ZSWAP lz4 | 213.09 | -135% |
> ------------------------------------------------------------------
>
> ---------------------------------------------------------
> | VMSTATS, mTHP ZSWAP/SSD stats| v6.11-rc3 | zswap-mTHP |
> | | mainline | Store |
> | | | lz4 |
> |----------------------------------------------------------
> | pswpout | 330,640 | 0 |
> | zswpout | 1,527 | 1,384,725 |
> |----------------------------------------------------------
> | hugepages-64kB/stats/zswpout | | 63,335 |
> |----------------------------------------------------------
> | hugepages-64kB/stats/swpout | 18,242 | 0 |
> ---------------------------------------------------------
>
Hmm, in the 70 processes case, it looks like we're still seeing
latency regression, and that same pattern of overreclaiming, even
without zswap cgroup charging?
That seems like a hint - concurrency exacerbates the problem?
>
> Based on these results, I kept the cgroup zswap charging commented out in
> subsequent debug steps, so as to not place zswap at a disadvantage when
> trying to determine further causes for hypothesis (1).
>
>
> Experiment 2 - swap latency/reclamation with 64K mTHP:
> ------------------------------------------------------
>
> Number of swap_writepage Total swap_writepage Average swap_writepage
> calls from all cores Latency (millisec) Latency (microsec)
> ---------------------------------------------------------------------------
> SSD 21,373 165,434.9 7,740
> zswap 344,109 55,446.8 161
> ---------------------------------------------------------------------------
>
>
> Reclamation analysis: 64k mTHP swapout:
> ---------------------------------------
> "Before":
> Total SSD compressed data size = 1,362,296,832 bytes
> Total SSD write IO latency = 887,861 milliseconds
>
> Average SSD compressed data size = 1,089,837 bytes
> Average SSD write IO latency = 710,289 microseconds
>
> "After":
> Total ZSWAP compressed pool size = 2,610,657,430 bytes
> Total ZSWAP compress latency = 55,984 milliseconds
>
> Average ZSWAP compress length = 2,055 bytes
> Average ZSWAP compress latency = 44 microseconds
>
> zswap-LZ4 mTHP compression ratio = 1.99
> All moderately compressible pages. 0 zswap_store errors.
> 84% of pages compress to 2056 bytes.
Hmm this ratio isn't very good indeed - it is less than 2-to-1 memory saving...
Internally, we often see 1-3 or 1-4 saving ratio (or even more).
Probably does not explain everything, but worth double checking -
could you check with zstd to see if the ratio improves.
>
>
> Experiment 3 - 4K folios swap characteristics SSD vs. ZSWAP:
> ------------------------------------------------------------
>
> I wanted to take a step back and understand how the mainline v6.11-rc3
> handles 4K folios when swapped out to SSD (CONFIG_ZSWAP is off) and when
> swapped out to ZSWAP. Interestingly, higher swapout activity is observed
> with 4K folios and v6.11-rc3 (with the debug change to not charge zswap to
> cgroup).
>
> v6.11-rc3 with no zswap charge, only 4K folios, no (m)THP:
>
> -------------------------------------------------------------
> SSD (CONFIG_ZSWAP is OFF) ZSWAP lz4 lzo-rle
> -------------------------------------------------------------
> cgroup memory.events: cgroup memory.events:
>
> low 0 low 0 0
> high 5,068 high 321,923 375,116
> max 0 max 0 0
> oom 0 oom 0 0
> oom_kill 0 oom_kill 0 0
> oom_group_kill 0 oom_group_kill 0 0
> -------------------------------------------------------------
>
> SSD (CONFIG_ZSWAP is OFF):
> --------------------------
> pswpout 415,709
> sys time (sec) 301.02
> Throughput KB/s 155,970
> memcg_high events 5,068
> --------------------------
>
>
> ZSWAP lz4 lz4 lz4 lzo-rle
> --------------------------------------------------------------
> zswpout 1,598,550 1,515,151 1,449,432 1,493,917
> sys time (sec) 889.36 481.21 581.22 635.75
> Throughput KB/s 35,176 14,765 20,253 21,407
> memcg_high events 321,923 412,733 369,976 375,116
> --------------------------------------------------------------
>
> This shows that there is a performance regression of -60% to -195% with
> zswap as compared to SSD with 4K folios. The higher swapout activity with
> zswap is seen here too (i.e., this doesn't appear to be mTHP-specific).
>
> I verified this to be the case even with the v6.7 kernel, which also
> showed a 2.3X throughput improvement when we don't charge zswap:
>
> ZSWAP lz4 v6.7 v6.7 with no cgroup zswap charge
> --------------------------------------------------------------------
> zswpout 1,419,802 1,398,620
> sys time (sec) 535.4 613.41
systime increases without zswap cgroup charging? That's strange...
> Throughput KB/s 8,671 20,045
> memcg_high events 574,046 451,859
So, on 4k folio setup, even without cgroup charge, we are still seeing:
1. More zswpout (than observed in SSD)
2. 40-50% worse latency - in fact it is worse without zswap cgroup charging.
3. 100 times the amount of memcg_high events? This is perhaps the
*strangest* to me. You're already removing zswap cgroup charging, then
where does this comes from? How can we have memory.high violation when
zswap does *not* contribute to memory usage?
Is this due to swap limit charging? Do you have a cgroup swap limit?
mem_high = page_counter_read(&memcg->memory) >
READ_ONCE(memcg->memory.high);
swap_high = page_counter_read(&memcg->swap) >
READ_ONCE(memcg->swap.high);
[...]
if (mem_high || swap_high) {
/*
* The allocating tasks in this cgroup will need to do
* reclaim or be throttled to prevent further growth
* of the memory or swap footprints.
*
* Target some best-effort fairness between the tasks,
* and distribute reclaim work and delay penalties
* based on how much each task is actually allocating.
*/
current->memcg_nr_pages_over_high += batch;
set_notify_resume(current);
break;
}
> --------------------------------------------------------------------
>
>
> Summary from the debug:
> -----------------------
> 1) Excess reclaim is exacerbated by zswap charge to cgroup. Without the
> charge, reclaim is on par with SSD for mTHP in the single process
> case. The multiple process excess reclaim seems to be most likely
> resulting from over-reclaim done by the cores, in their respective calls
> to mem_cgroup_handle_over_high().
Exarcebate, yes. I'm not 100% it's the sole or even the main cause.
You still see a degree of overreclaiming without zswap cgroup charging in:
1. 70 processes, with mTHP
2. 70 processes, with 4K folios.
>
> 2) The higher swapout activity with zswap as compared to SSD does not
> appear to be specific to mTHP. Higher reclaim activity and sys time
> regression with zswap (as compared to a setup where there is only SSD
> configured as swap) exists with 4K pages as far back as v6.7.
Yeah I can believe that without mthp, the same-ish workload would
cause the same regression.
>
> 3) The debug indicates the hypothesis (2) is worth more investigation:
> Does a faster reclaim path somehow cause less allocation stalls; thereby
> causing more breaches of memory.high, hence more reclaim -- and does this
> cycle repeat, potentially leading to higher swapout activity with zswap?
> Any advise on this being a possibility, and suggestions/pointers to
> verify this, would be greatly appreciated.
Add stalls along the zswap path? :)
>
> 4) Interestingly, the # of memcg_high events reduces significantly with 64K
> mTHP as compared to the above 4K high events data, when tested with v4
> and no zswap charge: 3,069 (SSD-mTHP) and 19,656 (ZSWAP-mTHP). This
> potentially indicates something to do with allocation efficiency
> countering the higher reclaim that seems to be caused by swapout
> efficiency.
>
> 5) Nhat, Yosry: would it be possible for you to run the 4K folios
> usemem -n 70 1g (with 60G memory.high) expmnt with 4G and some higher
> value SSD configuration in your setup and say, v6.11-rc3. I would like
> to rule out the memory constrained 4G SSD in my setup somehow skewing
> the behavior of zswap vis-a-vis
> allocation/memcg_handle_over_high/reclaim. I realize your time is
> valuable, however I think an independent confirmation of what I have
> been observing, would be really helpful for us to figure out potential
> root-causes and solutions.
It might take awhile for me to set up your benchmark, but yeah 4G
swapfile seems small on a 64G host - of course it depends on the
workload, but this has a lot memory usage. In fact the total memory
usage (70G?) is slightly above memory.high + 4G swapfile - note that
this is exarcebated by, once again, zswap's less-than-100% memory
saving ratio.
>
> 6) I tried a small change in memcontrol.c::mem_cgroup_handle_over_high() to
> break out of the loop if we have reclaimed a total of at least
> "nr_pages":
>
> nr_reclaimed = reclaim_high(memcg,
> in_retry ? SWAP_CLUSTER_MAX : nr_pages,
> gfp_mask);
>
> + nr_reclaimed_total += nr_reclaimed;
> +
> + if (nr_reclaimed_total >= nr_pages)
> + goto out;
>
>
> This was only for debug purposes, and did seem to mitigate the higher
> reclaim behavior for 4K folios:
>
> ZSWAP lz4 lz4 lz4
> ----------------------------------------------------------
> zswpout 1,305,367 1,349,195 1,529,235
> sys time (sec) 472.06 507.76 646.39
> Throughput KB/s 55,144 21,811 88,310
> memcg_high events 257,890 343,213 172,351
> ----------------------------------------------------------
>
> On average, this change results in 17% improvement in sys time, 2.35X
> improvement in throughput and 30% fewer memcg_high events.
>
> I look forward to further inputs on next steps.
>
> Thanks,
> Kanchana
>
>
> >
> > Thanks for this analysis. I will debug this some more, so we can better
> > understand these results.
> >
> > Thanks,
> > Kanchana
Powered by blists - more mailing lists