[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <b9fd9738-f4bd-4b34-88fd-7dfb7ed0c043@bytedance.com>
Date: Wed, 24 Sep 2025 18:36:53 +0800
From: Julian Sun <sunjunchao@...edance.com>
To: Peter Zijlstra <peterz@...radead.org>
Cc: cgroups@...r.kernel.org, linux-kernel@...r.kernel.org,
akpm@...ux-foundation.org, lance.yang@...ux.dev, mhiramat@...nel.org,
yangyicong@...ilicon.com, will@...nel.org, dianders@...omium.org,
mingo@...nel.org, lihuafei1@...wei.com, hannes@...xchg.org,
mhocko@...nel.org, roman.gushchin@...ux.dev, shakeel.butt@...ux.dev,
muchun.song@...ux.dev, tj@...nel.org, brauner@...nel.org
Subject: Re: [External] Re: [PATCH v2 2/2] memcg: Don't trigger hung task
warnings when memcg is releasing resources.
On 9/24/25 4:28 PM, Peter Zijlstra wrote:
Hi,
> On Wed, Sep 24, 2025 at 03:50:41PM +0800, Julian Sun wrote:
>> On 9/24/25 2:32 PM, Peter Zijlstra wrote:
>>> On Wed, Sep 24, 2025 at 11:41:00AM +0800, Julian Sun wrote:
>>>> Hung task warning in mem_cgroup_css_free() is undesirable and
>>>> unnecessary since the behavior of waiting for a long time is
>>>> expected.
>>>>
>>>> Use touch_hung_task_detector() to eliminate the possible
>>>> hung task warning.
>>>>
>>>> Signed-off-by: Julian Sun <sunjunchao@...edance.com>
>>>
>>> Still hate this. It is not tied to progress. If progress really stalls,
>>> no warning will be given.
>>
>> Hi, peter
>>
>> Thanks for your review and comments.
>>
>> I did take a look at your solution provided yesterday, and get your point.
>> However AFAICS it can't resolve the unexpected warnings here. Because it
>> only works after we reach the finish_writeback_work(), and the key point
>> here is, it *already* takes a long time before we reach
>> finish_writeback_work(), and there is true progress before finish the
>> writeback work that hung task detector still can not know.
>
> But wb_split_bdi_pages() should already split things into smaller chunks
> if there is low bandwidth, right? And we call finish_writeback_work()
> for each chunk.
AFAICS, wb_split_bdi_pages() will only be invoked in the sync scenarios,
and not in the background writeback scenarios - which is exactly the
case here.
And I noticed that there's something similar in background writeback,
where writeback_chunk_size() will split all pages into several chunks,
the min chunk size is 1024(MIN_WRITEBACK_PAGES) pages. The difference
from wb_split_bdi_pages() is that we don't report progress after
finishing the writeback of a chunk.
>
> If a chunk is still taking too long, surely the solution is to use
> smaller chunks?
Yeah it still takes a long time, I checked the write_bandwidth and
avg_write_bandwidth when warning was triggered:
>>> wb.write_bandwidth
(unsigned long)24
>>> wb.avg_write_bandwidth
(unsigned long)24
>>> wb.write_bandwidth
(unsigned long)13
>>> wb.write_bandwidth
(unsigned long)13
At this bandwidth, it will still takes a lot of seconds to write back
MIN_WRITEBACK_PAGES pages.
So it might be a solution, but given the fact that the current minimum
chunk size (1024) has been in place for over ten years, and that making
it smaller would probably have a negative impact on performance. I'm
afraid the filesystem maintainers will not accept this change.
If we don’t modify this part but can report progress after finishing the
chunk writeback, it should probably eliminate most of the unexpected
warnings.
>
>> If we want to make the hung task detector to known the progress of writeback
>> work, we need to add some code within do_writepages(): after each finish of
>> a_ops->writepages(), we need to make detector to known there's progress.
>> Something like this:
>>
>> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
>> index 3e248d1c3969..49572a83c47b 100644
>> --- a/mm/page-writeback.c
>> +++ b/mm/page-writeback.c
>> @@ -2635,6 +2635,10 @@ int do_writepages(struct address_space *mapping,
>> struct writeback_control *wbc)
>> else
>> /* deal with chardevs and other special files */
>> ret = 0;
>> + /* Make hung task detector to known there's progress. */
>> + if (force_wake)
>> + wake_up_all(waitq);
>> +
>> if (ret != -ENOMEM || wbc->sync_mode != WB_SYNC_ALL)
>> break;
>>
>> which has a big impact on current code - I don't want to introduce this.
>
> You sure? It looks to me like the next level down is wb_writeback() and
> writeback_sb_inodes(), and those already have time based breaks in and
> still have access to wb_writeback_work::done, while do_writepages() no
> longer has that context.
Yeah, exactly. What I mean is report progress within the whole writeback
work, either writeback_sb_inodes() or do_writepages() is ok.
>
>> Yes, the behavior in this patch does have the possibility to paper cover the
>> real warnings, and what I want to argue is that the essence of this patch is
>> the same as the current touch_nmi_watchdog() and touch_softlockup_watchdog()
>> - these functions are used only in specific scenarios we known and only
>> affect a single event. And there seems no report that
>> touch_nmi/softlockup_watchdog() will paper cover the real warnings (do we?).
>>
>> Correct me if there's anything I'm missing or misunderstanding.
>
> The thing with touch_nmi_watchdog() is that you need to keep doing it.
> The moment you stop calling touch_nmi_watchdog(), you will cause it to
> fire.
>
> That is very much in line with the thing I proposed, and rather unlike
> your proposal that blanket kill reporting for the task, irrespective of
> how long it sits there waiting.
>
Thanks for clarification. So how about the following solution?
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index a07b8cf73ae2..e0698fd3f9ab 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -14,6 +14,7 @@
* Additions for address_space-based writeback
*/
+#include <linux/sched/sysctl.h>
#include <linux/kernel.h>
#include <linux/export.h>
#include <linux/spinlock.h>
@@ -213,7 +214,7 @@ static void wb_queue_work(struct bdi_writeback *wb,
void wb_wait_for_completion(struct wb_completion *done)
{
atomic_dec(&done->cnt); /* put down the initial count */
- wait_event(*done->waitq, !atomic_read(&done->cnt));
+ wait_event(*done->waitq, (done->stamp = jiffies;
!atomic_read(&done->cnt)));
}
#ifdef CONFIG_CGROUP_WRITEBACK
@@ -1975,6 +1976,11 @@ static long writeback_sb_inodes(struct
super_block *sb,
*/
__writeback_single_inode(inode, &wbc);
+ /* Report progress to make hung task detector know it. */
+ if (jiffies - work->done->stamp >
+ HZ * sysctl_hung_task_timeout_secs / 2)
+ wake_up_all(work->done->waitq);
+
wbc_detach_inode(&wbc);
work->nr_pages -= write_chunk - wbc.nr_to_write;
wrote = write_chunk - wbc.nr_to_write - wbc.pages_skipped;
Instead of waking up all waiting threads every half second, we only wake
them up if the writeback work lasts for the value of
sysctl_hung_task_timeout_secs / 2 seconds to reduce possible overhead.
Hi, Jan, Christian, how do you think about it?
Please correct me if there's anything I'm missing or misunderstanding.
Thanks,
--
Julian Sun <sunjunchao@...edance.com>
Powered by blists - more mailing lists