linux-kernel - Re: [PATCH v2] mm/page-writeback: Raise wb_thresh to prevent write blocking with strictlimit

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20241120115731.gzxozbnb6eazhil7@quack3>
Date: Wed, 20 Nov 2024 12:57:31 +0100
From: Jan Kara <jack@...e.cz>
To: Jim Zhao <jimzhao.ai@...il.com>
Cc: jack@...e.cz, shikemeng@...weicloud.com, akpm@...ux-foundation.org,
	linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org,
	linux-mm@...ck.org, willy@...radead.org
Subject: Re: [PATCH v2] mm/page-writeback: Raise wb_thresh to prevent write
 blocking with strictlimit

Hello!

On Tue 19-11-24 20:29:22, Jim Zhao wrote:
> Thanks, Jan, I just sent patch v2, could you please review it ?

Yes, the patch looks good to me.

> 
> And I found the debug info in the bdi stats. 
> The BdiDirtyThresh value may be greater than DirtyThresh, and after
> applying this patch, the value of BdiDirtyThresh could become even
> larger.
> 
> without patch:
> ---
> root@...ntu:/sys/kernel/debug/bdi/8:0# cat stats
> BdiWriteback:                0 kB
> BdiReclaimable:             96 kB
> BdiDirtyThresh:        1346824 kB

But this is odd. The machine appears to have around 3GB of memory, doesn't
it? I suspect this is caused by multiple cgroup-writeback contexts
contributing to BdiDirtyThresh - in fact I think the math in
bdi_collect_stats() is wrong as it is adding wb_thresh() calculated based
on global dirty_thresh for each cgwb whereas it should be adding
wb_thresh() calculated based on per-memcg dirty_thresh... You can have a
look at /sys/kernel/debug/bdi/8:0/wb_stats file which should have correct
limits as far as I'm reading the code.

								Honza

> DirtyThresh:            673412 kB
> BackgroundThresh:       336292 kB
> BdiDirtied:              19872 kB
> BdiWritten:              19776 kB
> BdiWriteBandwidth:           0 kBps
> b_dirty:                     0
> b_io:                        0
> b_more_io:                   0
> b_dirty_time:                0
> bdi_list:                    1
> state:                       1
> 
> with patch:
> ---
> root@...ntu:/sys/kernel/debug/bdi/8:0# cat stats
> BdiWriteback:               96 kB
> BdiReclaimable:            192 kB
> BdiDirtyThresh:        3090736 kB
> DirtyThresh:            650716 kB
> BackgroundThresh:       324960 kB
> BdiDirtied:             472512 kB
> BdiWritten:             470592 kB
> BdiWriteBandwidth:      106268 kBps
> b_dirty:                     2
> b_io:                        0
> b_more_io:                   0
> b_dirty_time:                0
> bdi_list:                    1
> state:                       1
> 
> 
> @kemeng, is this a normal behavior or an issue ?
> 
> Thanks,
> Jim Zhao
> 
> 
> > With the strictlimit flag, wb_thresh acts as a hard limit in
> > balance_dirty_pages() and wb_position_ratio().  When device write
> > operations are inactive, wb_thresh can drop to 0, causing writes to be
> > blocked.  The issue occasionally occurs in fuse fs, particularly with
> > network backends, the write thread is blocked frequently during a period.
> > To address it, this patch raises the minimum wb_thresh to a controllable
> > level, similar to the non-strictlimit case.
> >
> > Signed-off-by: Jim Zhao <jimzhao.ai@...il.com>
> > ---
> > Changes in v2:
> > 1. Consolidate all wb_thresh bumping logic in __wb_calc_thresh for consistency;
> > 2. Replace the limit variable with thresh for calculating the bump value,
> > as __wb_calc_thresh is also used to calculate the background threshold;
> > 3. Add domain_dirty_avail in wb_calc_thresh to get dtc->dirty.
> > ---
> >  mm/page-writeback.c | 48 ++++++++++++++++++++++-----------------------
> >  1 file changed, 23 insertions(+), 25 deletions(-)
> >
> > diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> > index e5a9eb795f99..8b13bcb42de3 100644
> > --- a/mm/page-writeback.c
> > +++ b/mm/page-writeback.c
> > @@ -917,7 +917,9 @@ static unsigned long __wb_calc_thresh(struct dirty_throttle_control *dtc,
> >                                     unsigned long thresh)
> >  {
> >       struct wb_domain *dom = dtc_dom(dtc);
> > +     struct bdi_writeback *wb = dtc->wb;
> >       u64 wb_thresh;
> > +     u64 wb_max_thresh;
> >       unsigned long numerator, denominator;
> >       unsigned long wb_min_ratio, wb_max_ratio;
> >
> > @@ -931,11 +933,27 @@ static unsigned long __wb_calc_thresh(struct dirty_throttle_control *dtc,
> >       wb_thresh *= numerator;
> >       wb_thresh = div64_ul(wb_thresh, denominator);
> >
> > -     wb_min_max_ratio(dtc->wb, &wb_min_ratio, &wb_max_ratio);
> > +     wb_min_max_ratio(wb, &wb_min_ratio, &wb_max_ratio);
> >
> >       wb_thresh += (thresh * wb_min_ratio) / (100 * BDI_RATIO_SCALE);
> > -     if (wb_thresh > (thresh * wb_max_ratio) / (100 * BDI_RATIO_SCALE))
> > -             wb_thresh = thresh * wb_max_ratio / (100 * BDI_RATIO_SCALE);
> > +
> > +     /*
> > +      * It's very possible that wb_thresh is close to 0 not because the
> > +      * device is slow, but that it has remained inactive for long time.
> > +      * Honour such devices a reasonable good (hopefully IO efficient)
> > +      * threshold, so that the occasional writes won't be blocked and active
> > +      * writes can rampup the threshold quickly.
> > +      */
> > +     if (thresh > dtc->dirty) {
> > +             if (unlikely(wb->bdi->capabilities & BDI_CAP_STRICTLIMIT))
> > +                     wb_thresh = max(wb_thresh, (thresh - dtc->dirty) / 100);
> > +             else
> > +                     wb_thresh = max(wb_thresh, (thresh - dtc->dirty) / 8);
> > +     }
> > +
> > +     wb_max_thresh = thresh * wb_max_ratio / (100 * BDI_RATIO_SCALE);
> > +     if (wb_thresh > wb_max_thresh)
> > +             wb_thresh = wb_max_thresh;
> >
> >       return wb_thresh;
> >  }
> > @@ -944,6 +962,7 @@ unsigned long wb_calc_thresh(struct bdi_writeback *wb, unsigned long thresh)
> >  {
> >       struct dirty_throttle_control gdtc = { GDTC_INIT(wb) };
> >
> > +     domain_dirty_avail(&gdtc, true);
> >       return __wb_calc_thresh(&gdtc, thresh);
> >  }
> >
> > @@ -1120,12 +1139,6 @@ static void wb_position_ratio(struct dirty_throttle_control *dtc)
> >       if (unlikely(wb->bdi->capabilities & BDI_CAP_STRICTLIMIT)) {
> >               long long wb_pos_ratio;
> >
> > -             if (dtc->wb_dirty < 8) {
> > -                     dtc->pos_ratio = min_t(long long, pos_ratio * 2,
> > -                                        2 << RATELIMIT_CALC_SHIFT);
> > -                     return;
> > -             }
> > -
> >               if (dtc->wb_dirty >= wb_thresh)
> >                       return;
> >
> > @@ -1196,14 +1209,6 @@ static void wb_position_ratio(struct dirty_throttle_control *dtc)
> >        */
> >       if (unlikely(wb_thresh > dtc->thresh))
> >               wb_thresh = dtc->thresh;
> > -     /*
> > -      * It's very possible that wb_thresh is close to 0 not because the
> > -      * device is slow, but that it has remained inactive for long time.
> > -      * Honour such devices a reasonable good (hopefully IO efficient)
> > -      * threshold, so that the occasional writes won't be blocked and active
> > -      * writes can rampup the threshold quickly.
> > -      */
> > -     wb_thresh = max(wb_thresh, (limit - dtc->dirty) / 8);
> >       /*
> >        * scale global setpoint to wb's:
> >        *      wb_setpoint = setpoint * wb_thresh / thresh
> > @@ -1459,17 +1464,10 @@ static void wb_update_dirty_ratelimit(struct dirty_throttle_control *dtc,
> >        * balanced_dirty_ratelimit = task_ratelimit * write_bw / dirty_rate).
> >        * Hence, to calculate "step" properly, we have to use wb_dirty as
> >        * "dirty" and wb_setpoint as "setpoint".
> > -      *
> > -      * We rampup dirty_ratelimit forcibly if wb_dirty is low because
> > -      * it's possible that wb_thresh is close to zero due to inactivity
> > -      * of backing device.
> >        */
> >       if (unlikely(wb->bdi->capabilities & BDI_CAP_STRICTLIMIT)) {
> >               dirty = dtc->wb_dirty;
> > -             if (dtc->wb_dirty < 8)
> > -                     setpoint = dtc->wb_dirty + 1;
> > -             else
> > -                     setpoint = (dtc->wb_thresh + dtc->wb_bg_thresh) / 2;
> > +             setpoint = (dtc->wb_thresh + dtc->wb_bg_thresh) / 2;
> >       }
> >
> >       if (dirty < setpoint) {
> > --
> > 2.20.1
-- 
Jan Kara <jack@...e.com>
SUSE Labs, CR