[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <dndr5xdp3bweqtwlyixtzajxgkhxbt2qb2fzg6o2wy5msrhzi4@h3klek5hff5i>
Date: Wed, 24 Sep 2025 12:34:28 +0200
From: Jan Kara <jack@...e.cz>
To: Peter Zijlstra <peterz@...radead.org>
Cc: Andrew Morton <akpm@...ux-foundation.org>,
Christoph Hellwig <hch@...radead.org>, Julian Sun <sunjunchao@...edance.com>, cgroups@...r.kernel.org,
linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org, viro@...iv.linux.org.uk,
brauner@...nel.org, jack@...e.cz, mingo@...hat.com, juri.lelli@...hat.com,
vincent.guittot@...aro.org, dietmar.eggemann@....com, rostedt@...dmis.org,
bsegall@...gle.com, mgorman@...e.de, vschneid@...hat.com, lance.yang@...ux.dev,
mhiramat@...nel.org, agruenba@...hat.com, hannes@...xchg.org, mhocko@...nel.org,
roman.gushchin@...ux.dev, shakeel.butt@...ux.dev, muchun.song@...ux.dev
Subject: Re: [PATCH 0/3] Suppress undesirable hung task warnings.
On Tue 23-09-25 09:16:07, Peter Zijlstra wrote:
> On Mon, Sep 22, 2025 at 02:50:45PM -0700, Andrew Morton wrote:
> > On Mon, 22 Sep 2025 11:08:32 -0700 Christoph Hellwig <hch@...radead.org> wrote:
> >
> > > On Mon, Sep 22, 2025 at 03:27:18PM +0200, Peter Zijlstra wrote:
> > > > > Julian Sun (3):
> > > > > sched: Introduce a new flag PF_DONT_HUNG.
> > > > > writeback: Introduce wb_wait_for_completion_no_hung().
> > > > > memcg: Don't trigger hung task when memcg is releasing.
> > > >
> > > > This is all quite terrible. I'm not at all sure why a task that is
> > > > genuinely not making progress and isn't killable should not be reported.
> > >
> > > The hung device detector is way to aggressive for very slow I/O.
> > > See blk_wait_io, which has been around for a long time to work
> > > around just that. Given that this series targets writeback I suspect
> > > it is about an overloaded device as well.
> >
> > Yup, it's writeback - the bug report is in
> > https://lkml.kernel.org/r/20250917212959.355656-1-sunjunchao@bytedance.com
> >
> > Memory is big and storage is slow, there's nothing wrong if a task
> > which is designed to wait for writeback waits for a long time.
> >
> > Of course, there's something wrong if some other task which isn't
> > designed to wait for writeback gets stuck waiting for the task which
> > *is* designed to wait for writeback, but we'll still warn about that.
> >
> >
> > Regarding an implementation, I'm wondering if we can put a flag in
> > `struct completion' telling the hung task detector that this one is
> > expected to wait for long periods sometimes. Probably messy and it
> > only works for completions (not semaphores, mutexes, etc). Just
> > putting it out there ;)
>
> So the problem is that there *is* progress (albeit rather slowly), the
> watchdog just doesn't see that. Perhaps that is the thing we should look
> at fixing.
>
> How about something like the below? That will 'spuriously' wake up the
> waiters as long as there is some progress being made. Thereby increasing
> the context switch counters of the tasks and thus the hung_task watchdog
> sees progress.
>
> This approach should be safer than the blk_wait_io() hack, which has a
> timer ticking, regardless of actual completions happening or not.
I like the idea. The problem with your patch is that the progress is not
visible with high enough granularity in wb_writeback_work->done completion.
That is only incremented by 1, when say a request to writeout 1GB is queued
and decremented by 1 when that 1GB is written. The progress can be observed
with higher granularity by wb_writeback_work->nr_pages getting decremented
as we submit pages for writeback but this counter still gets updated only
once we are done with a particular inode so if all those 1GB of data are in
one inode there wouldn't be much to observe. So we might need to observe
how struct writeback_control member nr_to_write gets updated. That is
really updated frequently on IO submission but each filesystem updates it
in their writepages() function so implementing that gets messy pretty
quickly.
But maybe a good place to hook into for registering progress would be
wbc_init_bio()? Filesystems call that whenever we create new bio for writeback
purposes. We do have struct writeback_control available there so through
that we could propagate information that forward progress is being made.
What do people think?
Honza
> ---
>
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index a07b8cf73ae2..1326193b4d95 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -174,9 +174,10 @@ static void finish_writeback_work(struct wb_writeback_work *work)
> kfree(work);
> if (done) {
> wait_queue_head_t *waitq = done->waitq;
> + bool force_wake = (jiffies - done->stamp) > HZ/2;
>
> /* @done can't be accessed after the following dec */
> - if (atomic_dec_and_test(&done->cnt))
> + if (atomic_dec_and_test(&done->cnt) || force_wake)
> wake_up_all(waitq);
> }
> }
> @@ -213,7 +214,7 @@ static void wb_queue_work(struct bdi_writeback *wb,
> void wb_wait_for_completion(struct wb_completion *done)
> {
> atomic_dec(&done->cnt); /* put down the initial count */
> - wait_event(*done->waitq, !atomic_read(&done->cnt));
> + wait_event(*done->waitq, ({ done->stamp = jiffies; !atomic_read(&done->cnt); }));
> }
>
> #ifdef CONFIG_CGROUP_WRITEBACK
> diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
> index 2ad261082bba..197593193ce3 100644
> --- a/include/linux/backing-dev-defs.h
> +++ b/include/linux/backing-dev-defs.h
> @@ -63,6 +63,7 @@ enum wb_reason {
> struct wb_completion {
> atomic_t cnt;
> wait_queue_head_t *waitq;
> + unsigned long stamp;
> };
>
> #define __WB_COMPLETION_INIT(_waitq) \
--
Jan Kara <jack@...e.com>
SUSE Labs, CR
Powered by blists - more mailing lists