[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <SJ0PR11MB567807116A760D785F9822EBC9952@SJ0PR11MB5678.namprd11.prod.outlook.com>
Date: Wed, 28 Aug 2024 07:24:18 +0000
From: "Sridhar, Kanchana P" <kanchana.p.sridhar@...el.com>
To: Nhat Pham <nphamcs@...il.com>
CC: "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"linux-mm@...ck.org" <linux-mm@...ck.org>, "hannes@...xchg.org"
<hannes@...xchg.org>, "yosryahmed@...gle.com" <yosryahmed@...gle.com>,
"ryan.roberts@....com" <ryan.roberts@....com>, "Huang, Ying"
<ying.huang@...el.com>, "21cnbao@...il.com" <21cnbao@...il.com>,
"akpm@...ux-foundation.org" <akpm@...ux-foundation.org>, "Zou, Nanhai"
<nanhai.zou@...el.com>, "Feghali, Wajdi K" <wajdi.k.feghali@...el.com>,
"Gopal, Vinodh" <vinodh.gopal@...el.com>, "Sridhar, Kanchana P"
<kanchana.p.sridhar@...el.com>
Subject: RE: [PATCH v4 0/4] mm: ZSWAP swap-out of mTHP folios
> -----Original Message-----
> From: Sridhar, Kanchana P <kanchana.p.sridhar@...el.com>
> Sent: Tuesday, August 27, 2024 11:42 AM
> To: Nhat Pham <nphamcs@...il.com>
> Cc: linux-kernel@...r.kernel.org; linux-mm@...ck.org;
> hannes@...xchg.org; yosryahmed@...gle.com; ryan.roberts@....com;
> Huang, Ying <ying.huang@...el.com>; 21cnbao@...il.com; akpm@...ux-
> foundation.org; Zou, Nanhai <nanhai.zou@...el.com>; Feghali, Wajdi K
> <wajdi.k.feghali@...el.com>; Gopal, Vinodh <vinodh.gopal@...el.com>;
> Sridhar, Kanchana P <kanchana.p.sridhar@...el.com>
> Subject: RE: [PATCH v4 0/4] mm: ZSWAP swap-out of mTHP folios
>
>
> > -----Original Message-----
> > From: Nhat Pham <nphamcs@...il.com>
> > Sent: Tuesday, August 27, 2024 8:24 AM
> > To: Sridhar, Kanchana P <kanchana.p.sridhar@...el.com>
> > Cc: linux-kernel@...r.kernel.org; linux-mm@...ck.org;
> > hannes@...xchg.org; yosryahmed@...gle.com; ryan.roberts@....com;
> > Huang, Ying <ying.huang@...el.com>; 21cnbao@...il.com; akpm@...ux-
> > foundation.org; Zou, Nanhai <nanhai.zou@...el.com>; Feghali, Wajdi K
> > <wajdi.k.feghali@...el.com>; Gopal, Vinodh <vinodh.gopal@...el.com>
> > Subject: Re: [PATCH v4 0/4] mm: ZSWAP swap-out of mTHP folios
> >
> > On Mon, Aug 26, 2024 at 11:08 PM Sridhar, Kanchana P
> > <kanchana.p.sridhar@...el.com> wrote:
> > >
> > > > Internally, we often see 1-3 or 1-4 saving ratio (or even more).
> > >
> > > Agree with this as well. In our experiments with other workloads, we
> > > typically see much higher ratios.
> > >
> > > >
> > > > Probably does not explain everything, but worth double checking -
> > > > could you check with zstd to see if the ratio improves.
> > >
> > > Sure. I gathered ratio and compressed memory footprint data today with
> > > 64K mTHP, the 4G SSD swapfile and different zswap compressors.
> > >
> > > This patch-series and no zswap charging, 64K mTHP:
> > > ---------------------------------------------------------------------------
> > > Total Total Average Average Comp
> > > compressed compression compressed compression ratio
> > > length latency length latency
> > > bytes milliseconds bytes nanoseconds
> > > ---------------------------------------------------------------------------
> > > SSD (no zswap) 1,362,296,832 887,861
> > > lz4 2,610,657,430 55,984 2,055 44,065 1.99
> > > zstd 729,129,528 50,986 565 39,510 7.25
> > > deflate-iaa 1,286,533,438 44,785 1,415 49,252 2.89
> > > ---------------------------------------------------------------------------
> > >
> > > zstd does very well on ratio, as expected.
> >
> > Wait. So zstd is displaying 7-to-1 compression ratio? And has *lower*
> > average latency?
> >
> > Why are we running benchmark on lz4 again? Sure there is no free lunch
> > and no compressor that works well on all kind of data, but lz4's
> > performance here is so bad that it's borderline justifiable to
> > disable/bypass zswap with this kind of compresison ratio...
> >
> > Can I ask you to run benchmarking on zstd from now on?
>
> Sure, will do.
>
> >
> > >
> > > >
> > > > >
> > > > >
> > > > > Experiment 3 - 4K folios swap characteristics SSD vs. ZSWAP:
> > > > > ------------------------------------------------------------
> > > > >
> > > > > I wanted to take a step back and understand how the mainline v6.11-
> > rc3
> > > > > handles 4K folios when swapped out to SSD (CONFIG_ZSWAP is off)
> and
> > > > when
> > > > > swapped out to ZSWAP. Interestingly, higher swapout activity is
> > observed
> > > > > with 4K folios and v6.11-rc3 (with the debug change to not charge
> > zswap to
> > > > > cgroup).
> > > > >
> > > > > v6.11-rc3 with no zswap charge, only 4K folios, no (m)THP:
> > > > >
> > > > > -------------------------------------------------------------
> > > > > SSD (CONFIG_ZSWAP is OFF) ZSWAP lz4 lzo-rle
> > > > > -------------------------------------------------------------
> > > > > cgroup memory.events: cgroup memory.events:
> > > > >
> > > > > low 0 low 0 0
> > > > > high 5,068 high 321,923 375,116
> > > > > max 0 max 0 0
> > > > > oom 0 oom 0 0
> > > > > oom_kill 0 oom_kill 0 0
> > > > > oom_group_kill 0 oom_group_kill 0 0
> > > > > -------------------------------------------------------------
> > > > >
> > > > > SSD (CONFIG_ZSWAP is OFF):
> > > > > --------------------------
> > > > > pswpout 415,709
> > > > > sys time (sec) 301.02
> > > > > Throughput KB/s 155,970
> > > > > memcg_high events 5,068
> > > > > --------------------------
> > > > >
> > > > >
> > > > > ZSWAP lz4 lz4 lz4 lzo-rle
> > > > > --------------------------------------------------------------
> > > > > zswpout 1,598,550 1,515,151 1,449,432 1,493,917
> > > > > sys time (sec) 889.36 481.21 581.22 635.75
> > > > > Throughput KB/s 35,176 14,765 20,253 21,407
> > > > > memcg_high events 321,923 412,733 369,976 375,116
> > > > > --------------------------------------------------------------
> > > > >
> > > > > This shows that there is a performance regression of -60% to -195%
> > with
> > > > > zswap as compared to SSD with 4K folios. The higher swapout activity
> > with
> > > > > zswap is seen here too (i.e., this doesn't appear to be mTHP-specific).
> > > > >
> > > > > I verified this to be the case even with the v6.7 kernel, which also
> > > > > showed a 2.3X throughput improvement when we don't charge
> zswap:
> > > > >
> > > > > ZSWAP lz4 v6.7 v6.7 with no cgroup zswap charge
> > > > > --------------------------------------------------------------------
> > > > > zswpout 1,419,802 1,398,620
> > > > > sys time (sec) 535.4 613.41
> > > >
> > > > systime increases without zswap cgroup charging? That's strange...
> > >
> > > Additional data gathered with v6.11-rc3 (listed below) based on your
> > suggestion
> > > to investigate potential swap.high breaches should hopefully provide
> some
> > > explanation.
> > >
> > > >
> > > > > Throughput KB/s 8,671 20,045
> > > > > memcg_high events 574,046 451,859
> > > >
> > > > So, on 4k folio setup, even without cgroup charge, we are still seeing:
> > > >
> > > > 1. More zswpout (than observed in SSD)
> > > > 2. 40-50% worse latency - in fact it is worse without zswap cgroup
> > charging.
> > > > 3. 100 times the amount of memcg_high events? This is perhaps the
> > > > *strangest* to me. You're already removing zswap cgroup charging,
> then
> > > > where does this comes from? How can we have memory.high violation
> > when
> > > > zswap does *not* contribute to memory usage?
> > > >
> > > > Is this due to swap limit charging? Do you have a cgroup swap limit?
> > > >
> > > > mem_high = page_counter_read(&memcg->memory) >
> > > > READ_ONCE(memcg->memory.high);
> > > > swap_high = page_counter_read(&memcg->swap) >
> > > > READ_ONCE(memcg->swap.high);
> > > > [...]
> > > >
> > > > if (mem_high || swap_high) {
> > > > /*
> > > > * The allocating tasks in this cgroup will need to do
> > > > * reclaim or be throttled to prevent further growth
> > > > * of the memory or swap footprints.
> > > > *
> > > > * Target some best-effort fairness between the tasks,
> > > > * and distribute reclaim work and delay penalties
> > > > * based on how much each task is actually allocating.
> > > > */
> > > > current->memcg_nr_pages_over_high += batch;
> > > > set_notify_resume(current);
> > > > break;
> > > > }
> > > >
> > >
> > > I don't have a swap.high limit set on the cgroup; it is set to "max".
> > >
> > > I ran experiments with v6.11-rc3, no zswap charging, 4K folios and
> different
> > > zswap compressors to verify if swap.high is breached with the 4G SSD
> > swapfile.
> > >
> > > SSD (CONFIG_ZSWAP is OFF):
> > >
> > > SSD SSD SSD
> > > ------------------------------------------------------------
> > > pswpout 415,709 1,032,170 636,582
> > > sys time (sec) 301.02 328.15 306.98
> > > Throughput KB/s 155,970 89,621 122,219
> > > memcg_high events 5,068 15,072 8,344
> > > memcg_swap_high events 0 0 0
> > > memcg_swap_fail events 0 0 0
> > > ------------------------------------------------------------
> > >
> > > ZSWAP zstd zstd zstd
> > > ----------------------------------------------------------------
> > > zswpout 1,391,524 1,382,965 1,417,307
> > > sys time (sec) 474.68 568.24 489.80
> > > Throughput KB/s 26,099 23,404 111,115
> > > memcg_high events 335,112 340,335 162,260
> > > memcg_swap_high events 0 0 0
> > > memcg_swap_fail events 1,226,899 5,742,153
> > > (mem_cgroup_try_charge_swap)
> > > memcg_memory_stat_pgactivate 1,259,547
> > > (shrink_folio_list)
> > > ----------------------------------------------------------------
> > >
> > > ZSWAP lzo-rle lzo-rle lzo-rle
> > > -----------------------------------------------------------
> > > zswpout 1,493,917 1,363,040 1,428,133
> > > sys time (sec) 635.75 498.63 484.65
> > > Throughput KB/s 21,407 23,827 20,237
> > > memcg_high events 375,116 352,814 373,667
> > > memcg_swap_high events 0 0 0
> > > memcg_swap_fail events 715,211
> > > -----------------------------------------------------------
> > >
> > > ZSWAP lz4 lz4 lz4 lz4
> > > ---------------------------------------------------------------------
> > > zswpout 1,378,781 1,598,550 1,515,151 1,449,432
> > > sys time (sec) 495.45 889.36 481.21 581.22
> > > Throughput KB/s 26,248 35,176 14,765 20,253
> > > memcg_high events 347,209 321,923 412,733 369,976
> > > memcg_swap_high events 0 0 0 0
> > > memcg_swap_fail events 580,103 0
> > > ---------------------------------------------------------------------
> > >
> > > ZSWAP deflate-iaa deflate-iaa deflate-iaa
> > > ----------------------------------------------------------------
> > > zswpout 380,471 1,440,902 1,397,965
> > > sys time (sec) 329.06 570.77 467.41
> > > Throughput KB/s 283,867 28,403 190,600
> > > memcg_high events 5,551 422,831 28,154
> > > memcg_swap_high events 0 0 0
> > > memcg_swap_fail events 0 2,686,758 438,562
> > > ----------------------------------------------------------------
> >
> > Why are there 3 columns for each of the compressors? Is this different
> > runs of the same workload?
> >
> > And why do some columns have missing cells?
>
> Yes, these are different runs of the same workload. Since there is some
> amount of variance seen in the data, I figured it is best to publish the
> metrics from the individual runs rather than averaging.
>
> Some of these runs were gathered earlier with the same code base,
> however, I wasn't monitoring/logging the
> memcg_swap_high/memcg_swap_fail
> events at that time. For those runs, just these two counters have missing
> column entries; the rest of the data is still valid.
>
> >
> > >
> > > There are no swap.high memcg events recorded in any of the SSD/zswap
> > > experiments. However, I do see significant number of memcg_swap_fail
> > > events in some of the zswap runs, for all 3 compressors. This is not
> > > consistent, because there are some runs with 0 memcg_swap_fail for all
> > > compressors.
> > >
> > > There is a possible co-relation between memcg_swap_fail events
> > > (/sys/fs/cgroup/test/memory.swap.events) and the high # of
> memcg_high
> > > events. The root-cause appears to be that there are no available swap
> > > slots, memcg_swap_fail is incremented, add_to_swap() fails in
> > > shrink_folio_list(), followed by "activate_locked:" for the folio.
> > > The folio re-activation is recorded in cgroup memory.stat pgactivate
> > > events. The failure to swap out folios due to lack of swap slots could
> > > contribute towards memory.high breaches.
> >
> > Yeah FWIW, that was gonna be my first suggestion. This swapfile size
> > is wayyyy too small...
> >
> > But that said, the link is not clear to me at all. The only thing I
> > can think of is lz4's performance sucks so bad that it's not saving
> > enough memory, leading to regression. And since it's still taking up
> > swap slot, we cannot use swap either?
>
> The occurrence of memcg_swap_fail events establishes that swap slots
> are not available with 4G of swap space. This causes those 4K folios to
> remain in memory, which can worsen an existing problem with memory.high
> breaches.
>
> However, it is worth noting that this is not the only contributor to
> memcg_high events that still occur without zswap charging. The data shows
> 321,923 occurrences of memcg_high in Col 2 of the lz4 table, that also has
> 0 occurrences of memcg_swap_fail reported in the cgroup stats.
>
> >
> > >
> > > However, this is probably not the only cause for either the high # of
> > > memory.high breaches or the over-reclaim with zswap, as seen in the lz4
> > > data where the memory.high is significant even in cases where there are
> no
> > > memcg_swap_fails.
> > >
> > > Some observations/questions based on the above 4K folios swapout data:
> > >
> > > 1) There are more memcg_high events as the swapout latency reduces
> > > (i.e. faster swap-write path). This is even without charging zswap
> > > utilization to the cgroup.
> >
> > This is still inexplicable to me. If we are not charging zswap usage,
> > we shouldn't even be triggering the reclaim_high() path, no?
> >
> > I'm curious - can you use bpftrace to tracks where/when reclaim_high
> > is being called?
Hi Nhat,
Since reclaim_high() is called only in a handful of places, I figured I
would just use debugfs u64 counters to record where it gets called from.
These are the places where I increment the debugfs counters:
include/linux/resume_user_mode.h:
---------------------------------
diff --git a/include/linux/resume_user_mode.h b/include/linux/resume_user_mode.h
index e0135e0adae0..382f5469e9a2 100644
--- a/include/linux/resume_user_mode.h
+++ b/include/linux/resume_user_mode.h
@@ -24,6 +24,7 @@ static inline void set_notify_resume(struct task_struct *task)
kick_process(task);
}
+extern u64 hoh_userland;
/**
* resume_user_mode_work - Perform work before returning to user mode
@@ -56,6 +57,7 @@ static inline void resume_user_mode_work(struct pt_regs *regs)
}
#endif
+ ++hoh_userland;
mem_cgroup_handle_over_high(GFP_KERNEL);
blkcg_maybe_throttle_current();
mm/memcontrol.c:
----------------
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index f29157288b7d..6738bb670a78 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1910,9 +1910,12 @@ static unsigned long reclaim_high(struct mem_cgroup *memcg,
return nr_reclaimed;
}
+extern u64 rec_high_hwf;
+
static void high_work_func(struct work_struct *work)
{
struct mem_cgroup *memcg;
+ ++rec_high_hwf;
memcg = container_of(work, struct mem_cgroup, high_work);
reclaim_high(memcg, MEMCG_CHARGE_BATCH, GFP_KERNEL);
@@ -2055,6 +2058,8 @@ static unsigned long calculate_high_delay(struct mem_cgroup *memcg,
return penalty_jiffies * nr_pages / MEMCG_CHARGE_BATCH;
}
+extern u64 rec_high_hoh;
+
/*
* Reclaims memory over the high limit. Called directly from
* try_charge() (context permitting), as well as from the userland
@@ -2097,6 +2102,7 @@ void mem_cgroup_handle_over_high(gfp_t gfp_mask)
* memory.high is currently batched, whereas memory.max and the page
* allocator run every time an allocation is made.
*/
+ ++rec_high_hoh;
nr_reclaimed = reclaim_high(memcg,
in_retry ? SWAP_CLUSTER_MAX : nr_pages,
gfp_mask);
@@ -2153,6 +2159,8 @@ void mem_cgroup_handle_over_high(gfp_t gfp_mask)
css_put(&memcg->css);
}
+extern u64 hoh_trycharge;
+
int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
unsigned int nr_pages)
{
@@ -2344,8 +2352,10 @@ int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
*/
if (current->memcg_nr_pages_over_high > MEMCG_CHARGE_BATCH &&
!(current->flags & PF_MEMALLOC) &&
- gfpflags_allow_blocking(gfp_mask))
+ gfpflags_allow_blocking(gfp_mask)) {
+ ++hoh_trycharge;
mem_cgroup_handle_over_high(gfp_mask);
+ }
return 0;
}
I reverted my debug changes for "zswap to not charge cgroup" when I ran
these next set of experiments that record the # of times and locations
where reclaim_high() is called.
zstd is the compressor I have configured for both ZSWAP and ZRAM.
6.11-rc3 mainline, 176Gi ZRAM backing for ZSWAP, zstd, 64K mTHP:
----------------------------------------------------------------
/sys/fs/cgroup/iax/memory.events:
high 112,910
hoh_userland 128,835
hoh_trycharge 0
rec_high_hoh 113,079
rec_high_hwf 0
6.11-rc3 mainline, 4G SSD backing for ZSWAP, zstd, 64K mTHP:
------------------------------------------------------------
/sys/fs/cgroup/iax/memory.events:
high 4,693
hoh_userland 14,069
hoh_trycharge 0
rec_high_hoh 4,694
rec_high_hwf 0
ZSWAP-mTHP, 176Gi ZRAM backing for ZSWAP, zstd, 64K mTHP:
---------------------------------------------------------
/sys/fs/cgroup/iax/memory.events:
high 139,495
hoh_userland 156,628
hoh_trycharge 0
rec_high_hoh 140,039
rec_high_hwf 0
ZSWAP-mTHP, 4G SSD backing for ZSWAP, zstd, 64K mTHP:
-----------------------------------------------------
/sys/fs/cgroup/iax/memory.events:
high 20,427
/sys/fs/cgroup/iax/memory.swap.events:
fail 20,856
hoh_userland 31,346
hoh_trycharge 0
rec_high_hoh 20,513
rec_high_hwf 0
This shows that in all cases, reclaim_high() is called only from the return
path to user mode after handling a page-fault.
Thanks,
Kanchana
>
> I had confirmed earlier with counters that all calls to reclaim_high()
> were from include/linux/resume_user_mode.h::resume_user_mode_work().
> I will confirm this with zstd and bpftrace and share.
>
> Thanks,
> Kanchana
>
> >
> > >
> > > 2) There appears to be a direct co-relation between higher # of
> > > memcg_swap_fail events, and an increase in memcg_high breaches and
> > > reduction in usemem throughput. This combined with the observation in
> > > (1) suggests that with a faster compressor, we need more swap slots,
> > > that increases the probability of running out of swap slots with the 4G
> > > SSD backing device.
> > >
> > > 3) Could the data shared earlier on reduction in memcg_high breaches
> with
> > > 64K mTHP swapout provide some more clues, if we agree with (1) and
> (2):
> > >
> > > "Interestingly, the # of memcg_high events reduces significantly with
> 64K
> > > mTHP as compared to the above 4K memcg_high events data, when
> > tested
> > > with v4 and no zswap charge: 3,069 (SSD-mTHP) and 19,656 (ZSWAP-
> > mTHP)."
> > >
> > > 4) In the case of each zswap compressor, there are some runs that go
> > > through with 0 memcg_swap_fail events. These runs generally have
> better
> > > fewer memcg_high breaches and better sys time/throughput.
> > >
> > > 5) For a given swap setup, there is some amount of variance in
> > > sys time for this workload.
> > >
> > > 6) All this suggests that the primary root cause is the concurrency setup,
> > > where there could be randomness between runs as to the # of processes
> > > that observe the memory.high breach due to other factors such as
> > > availability of swap slots for alloc.
> > >
> > > To summarize, I believe the root-cause is the 4G SSD swapfile resulting in
> > > running out of swap slots, and anomalous behavior with over-reclaim
> when
> > 70
> > > concurrent processes are working with the 60G memory limit while trying
> to
> > > allocate 1G each; with randomness in processes reacting to the breach.
> > >
> > > The cgroup zswap charging exacerbates this situation, but is not a problem
> > > in and of itself.
> > >
> > > Nhat, as you pointed out, this is somewhat of an unrealistic scenario that
> > > doesn't seem to indicate any specific problems to be solved, other than
> the
> > > temporary cgroup zswap double-charging.
> > >
> > > Would it be fair to evaluate this patch-series based on a more realistic
> > > swapfile configuration based on 176G ZRAM, for which I had shared the
> > data
> > > in v2? There weren't any problems with swap slots availability or any
> > > anomalies that I can think of with this setup, other than the fact that the
> > > "Before" and "After" sys times could not be directly compared for 2 key
> > > reasons:
> > >
> > > - ZRAM compressed data is not charged to the cgroup, similar to SSD.
> > > - ZSWAP compressed data is charged to the cgroup.
> >
> > Yeah that's a bit unfair still. Wild idea, but what about we compare
> > SSD without zswap (or SSD with zswap, but without this patch series so
> > that mTHP are not zswapped) v.s zswap-on-zram (i.e with a backing
> > swapfile on zram block device).
> >
> > It is stupid, I know. But let's take advantage of the fact that zram
> > is not charged to cgroup, pretending that its memory foot print is
> > empty?
> >
> > I don't know how zram works though, so my apologies if it's a stupid
> > suggestion :)
Powered by blists - more mailing lists