linux-kernel - Re: [External] Re: [PATCH 0/3] Suppress undesirable hung task warnings.

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAHSKhteCMv0fUmDKHdKXhg=D-rz-Jmze5ei-Up16vMsNEy898w@mail.gmail.com>
Date: Thu, 25 Sep 2025 23:07:24 +0800
From: Julian Sun <sunjunchao@...edance.com>
To: Jan Kara <jack@...e.cz>
Cc: Peter Zijlstra <peterz@...radead.org>, Andrew Morton <akpm@...ux-foundation.org>, 
	Christoph Hellwig <hch@...radead.org>, cgroups@...r.kernel.org, linux-fsdevel@...r.kernel.org, 
	linux-kernel@...r.kernel.org, viro@...iv.linux.org.uk, brauner@...nel.org, 
	mingo@...hat.com, juri.lelli@...hat.com, vincent.guittot@...aro.org, 
	dietmar.eggemann@....com, rostedt@...dmis.org, bsegall@...gle.com, 
	mgorman@...e.de, vschneid@...hat.com, lance.yang@...ux.dev, 
	mhiramat@...nel.org, agruenba@...hat.com, hannes@...xchg.org, 
	mhocko@...nel.org, roman.gushchin@...ux.dev, shakeel.butt@...ux.dev, 
	muchun.song@...ux.dev
Subject: Re: [External] Re: [PATCH 0/3] Suppress undesirable hung task warnings.

Hi,

On Wed, Sep 24, 2025 at 6:34 PM Jan Kara <jack@...e.cz> wrote:
>
> On Tue 23-09-25 09:16:07, Peter Zijlstra wrote:
> > On Mon, Sep 22, 2025 at 02:50:45PM -0700, Andrew Morton wrote:
> > > On Mon, 22 Sep 2025 11:08:32 -0700 Christoph Hellwig <hch@...radead.org> wrote:
> > >
> > > > On Mon, Sep 22, 2025 at 03:27:18PM +0200, Peter Zijlstra wrote:
> > > > > > Julian Sun (3):
> > > > > >   sched: Introduce a new flag PF_DONT_HUNG.
> > > > > >   writeback: Introduce wb_wait_for_completion_no_hung().
> > > > > >   memcg: Don't trigger hung task when memcg is releasing.
> > > > >
> > > > > This is all quite terrible. I'm not at all sure why a task that is
> > > > > genuinely not making progress and isn't killable should not be reported.
> > > >
> > > > The hung device detector is way to aggressive for very slow I/O.
> > > > See blk_wait_io, which has been around for a long time to work
> > > > around just that.  Given that this series targets writeback I suspect
> > > > it is about an overloaded device as well.
> > >
> > > Yup, it's writeback - the bug report is in
> > > https://lkml.kernel.org/r/20250917212959.355656-1-sunjunchao@bytedance.com
> > >
> > > Memory is big and storage is slow, there's nothing wrong if a task
> > > which is designed to wait for writeback waits for a long time.
> > >
> > > Of course, there's something wrong if some other task which isn't
> > > designed to wait for writeback gets stuck waiting for the task which
> > > *is* designed to wait for writeback, but we'll still warn about that.
> > >
> > >
> > > Regarding an implementation, I'm wondering if we can put a flag in
> > > `struct completion' telling the hung task detector that this one is
> > > expected to wait for long periods sometimes.  Probably messy and it
> > > only works for completions (not semaphores, mutexes, etc).  Just
> > > putting it out there ;)
> >
> > So the problem is that there *is* progress (albeit rather slowly), the
> > watchdog just doesn't see that. Perhaps that is the thing we should look
> > at fixing.
> >
> > How about something like the below? That will 'spuriously' wake up the
> > waiters as long as there is some progress being made. Thereby increasing
> > the context switch counters of the tasks and thus the hung_task watchdog
> > sees progress.
> >
> > This approach should be safer than the blk_wait_io() hack, which has a
> > timer ticking, regardless of actual completions happening or not.
>
> I like the idea. The problem with your patch is that the progress is not
> visible with high enough granularity in wb_writeback_work->done completion.
> That is only incremented by 1, when say a request to writeout 1GB is queued
> and decremented by 1 when that 1GB is written. The progress can be observed
> with higher granularity by wb_writeback_work->nr_pages getting decremented
> as we submit pages for writeback but this counter still gets updated only
> once we are done with a particular inode so if all those 1GB of data are in
> one inode there wouldn't be much to observe. So we might need to observe
> how struct writeback_control member nr_to_write gets updated. That is
> really updated frequently on IO submission but each filesystem updates it
> in their writepages() function so implementing that gets messy pretty
> quickly.
>
> But maybe a good place to hook into for registering progress would be
> wbc_init_bio()? Filesystems call that whenever we create new bio for writeback
> purposes. We do have struct writeback_control available there so through
> that we could propagate information that forward progress is being made.
>
> What do people think?

Sorry for the late reply. Yes, Jan, I agree — your proposal sounds
both fine-grained and elegant. But do we really have a strong need for
such detailed progress tracking?

In background writeback, for example, if the bandwidth is very low
(e.g. avg_write_bandwidth=24), writeback_chunk_size() already splits
pages into chunks of MIN_WRITEBACK_PAGES (1024). This is usually
enough to avoid hung task warnings, so reporting progress there might
be sufficient.

I’m also a bit concerned that reporting progress on every
wbc_init_bio() could lead to excessive wakeups in normal or
high-throughput cases, which might have side effects. Please correct
me if I’m missing something.

>
>                                                                 Honza
>
> > ---
> >
> > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> > index a07b8cf73ae2..1326193b4d95 100644
> > --- a/fs/fs-writeback.c
> > +++ b/fs/fs-writeback.c
> > @@ -174,9 +174,10 @@ static void finish_writeback_work(struct wb_writeback_work *work)
> >               kfree(work);
> >       if (done) {
> >               wait_queue_head_t *waitq = done->waitq;
> > +             bool force_wake = (jiffies - done->stamp) > HZ/2;
> >
> >               /* @done can't be accessed after the following dec */
> > -             if (atomic_dec_and_test(&done->cnt))
> > +             if (atomic_dec_and_test(&done->cnt) || force_wake)
> >                       wake_up_all(waitq);
> >       }
> >  }
> > @@ -213,7 +214,7 @@ static void wb_queue_work(struct bdi_writeback *wb,
> >  void wb_wait_for_completion(struct wb_completion *done)
> >  {
> >       atomic_dec(&done->cnt);         /* put down the initial count */
> > -     wait_event(*done->waitq, !atomic_read(&done->cnt));
> > +     wait_event(*done->waitq, ({ done->stamp = jiffies; !atomic_read(&done->cnt); }));
> >  }
> >
> >  #ifdef CONFIG_CGROUP_WRITEBACK
> > diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
> > index 2ad261082bba..197593193ce3 100644
> > --- a/include/linux/backing-dev-defs.h
> > +++ b/include/linux/backing-dev-defs.h
> > @@ -63,6 +63,7 @@ enum wb_reason {
> >  struct wb_completion {
> >       atomic_t                cnt;
> >       wait_queue_head_t       *waitq;
> > +     unsigned long           stamp;
> >  };
> >
> >  #define __WB_COMPLETION_INIT(_waitq) \
> --
> Jan Kara <jack@...e.com>
> SUSE Labs, CR


Thanks,
-- 
Julian Sun <sunjunchao@...edance.com>