linux-kernel - Re: [RFC PATCH v1 0/9] freezer: Introduce freeze priority model to address process dependency issues

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <09df0911-9421-40af-8296-de1383be1c58@kylinos.cn>
Date: Mon, 11 Aug 2025 17:13:43 +0800
From: Zihuan Zhang <zhangzihuan@...inos.cn>
To: Michal Hocko <mhocko@...e.com>
Cc: "Rafael J . Wysocki" <rafael@...nel.org>,
 Peter Zijlstra <peterz@...radead.org>, Oleg Nesterov <oleg@...hat.com>,
 David Hildenbrand <david@...hat.com>, Jonathan Corbet <corbet@....net>,
 Ingo Molnar <mingo@...hat.com>, Juri Lelli <juri.lelli@...hat.com>,
 Vincent Guittot <vincent.guittot@...aro.org>,
 Dietmar Eggemann <dietmar.eggemann@....com>,
 Steven Rostedt <rostedt@...dmis.org>, Ben Segall <bsegall@...gle.com>,
 Mel Gorman <mgorman@...e.de>, Valentin Schneider <vschneid@...hat.com>,
 len brown <len.brown@...el.com>, pavel machek <pavel@...nel.org>,
 Kees Cook <kees@...nel.org>, Andrew Morton <akpm@...ux-foundation.org>,
 Lorenzo Stoakes <lorenzo.stoakes@...cle.com>,
 "Liam R . Howlett" <Liam.Howlett@...cle.com>,
 Vlastimil Babka <vbabka@...e.cz>, Mike Rapoport <rppt@...nel.org>,
 Suren Baghdasaryan <surenb@...gle.com>,
 Catalin Marinas <catalin.marinas@....com>, Nico Pache <npache@...hat.com>,
 xu xin <xu.xin16@....com.cn>, wangfushuai <wangfushuai@...du.com>,
 Andrii Nakryiko <andrii@...nel.org>, Christian Brauner <brauner@...nel.org>,
 Thomas Gleixner <tglx@...utronix.de>, Jeff Layton <jlayton@...nel.org>,
 Al Viro <viro@...iv.linux.org.uk>, Adrian Ratiu
 <adrian.ratiu@...labora.com>, linux-pm@...r.kernel.org, linux-mm@...ck.org,
 linux-fsdevel@...r.kernel.org, linux-doc@...r.kernel.org,
 linux-kernel@...r.kernel.org
Subject: Re: [RFC PATCH v1 0/9] freezer: Introduce freeze priority model to
 address process dependency issues


在 2025/8/8 16:58, Michal Hocko 写道:
> On Fri 08-08-25 15:52:31, Zihuan Zhang wrote:
>> 在 2025/8/8 15:00, Michal Hocko 写道:
>>> On Fri 08-08-25 09:13:30, Zihuan Zhang wrote:
>>> [...]
>>>> However, in practice, we’ve observed cases where tasks appear stuck in
>>>> uninterruptible sleep (D state) during the freeze phase  — and thus cannot
>>>> respond to signals or enter the refrigerator. These tasks are technically
>>>> TASK_FREEZABLE, but due to the nature of their sleep state, they don’t
>>>> freeze promptly, and may require multiple retry rounds, or cause the entire
>>>> suspend to fail.
>>> Right, but that is an inherent problem of the freezer implemenatation.
>>> It is not really clear to me how priorities or layers improve on that.
>>> Could you please elaborate on that?
>> Thanks for the follow-up.
>>
>>  From our observations, we’ve seen processes like Xorg that are in a normal
>> state before freezing begins, but enter D state during the freeze window.
>> Upon investigation,
>>
>> we found that these processes often depend on other user processes (e.g.,
>> I/O helpers or system services), and when those dependencies are frozen
>> first, the dependent process (like Xorg) gets stuck and can’t be frozen
>> itself.
> OK, I see.
>
>> This led us to treat such processes as “hard to freeze” tasks — not because
>> they’re inherently unfreezable, but because they are more likely to become
>> problematic if not frozen early enough.
>>
>> So our model works as follows:
>>      •    By default, freezer tries to freeze all freezable tasks in each
>> round.
>>      •    With our approach, we only attempt to freeze tasks whose
>> freeze_priority is less than or equal to the current round number.
>>      •    This ensures that higher-priority (i.e., harder-to-freeze) tasks
>> are attempted earlier, increasing the chance that they freeze before being
>> blocked by others.
>>
>> Since we cannot know in advance which tasks will be difficult to freeze, we
>> use heuristics:
>>      •    Any task that causes freeze failure or is found in D state during
>> the freeze window is treated as hard-to-freeze in the next attempt and its
>> priority is increased.
>>      •    Additionally, users can manually raise/reduce the freeze priority
>> of known problematic tasks via an exposed sysfs interface, giving them
>> fine-grained control.
> This would have been a very useful information for the changelog so that
> we can understand what you are trying to achieve.
>
Got it, I’ll add that info to the changelog. Thanks!
>> This doesn’t change the fundamental logic of the freezer — it still retries
>> until all tasks are frozen — but by adjusting the traversal order,
>>
>>   we’ve observed significantly fewer retries and more reliable success in
>> scenarios where these D state transitions occur.
>   
> OK, I believe I do understand what you are trying to achieve but I am
> not conviced this is a robust way to deal with the problem. This all
> seems highly timing specific that might work in very specific usecase
> but you are essentially trying to fight tiny race windows with a very
> probabilitistic interface.

Actually, our approach does not conflict with solving the problem. We 
plan to keep the freeze priority mechanism disabled by default and only 
enable it when issues arise, so as to maintain the consistency of the 
existing code flow as much as possible. It acts like a fallback mechanism.

We acknowledge that the causes of D-state tasks are complex and require 
high effort to fully resolve, which the current freezer mechanism cannot 
achieve. Our solution is low-cost and able to capture some problematic 
tasks effectively.

> Also the interface seems to be really coarse grained and it can easily
> turn out insufficient for other usecases while it is not entirely clear
> to me how this could be extended for those.
  We recognize that the current interface is relatively coarse-grained 
and may not be sufficient for all scenarios. The present implementation 
is a basic version.

Our plan is to introduce a classification-based mechanism that assigns 
different freeze priorities according to process categories. For 
example, filesystem and graphics-related processes will be given higher 
default freeze priority, as they are critical in the freezing workflow. 
This classification approach helps target important processes more 
precisely.

However, this requires further testing and refinement before full 
deployment. We believe this incremental, category-based design will make 
the mechanism more effective and adaptable over time while keeping it 
manageable.
> I believe it would be more useful to find sources of those freezer
> blockers and try to address those. Making more blocked tasks
> __set_task_frozen compatible sounds like a general improvement in
> itself.

we have already identified some causes of D-state tasks, many of which 
are related to the filesystem. On some systems, certain processes 
frequently execute ext4_sync_file, and under contention this can lead to 
D-state tasks.

  6616.650482] task:ThreadPoolForeg state:D stack:0     pid:262026 
tgid:4065  ppid:2490   task_flags:0x400040 flags:0x00004004
[ 6616.650485] Call Trace:
[ 6616.650486]  <TASK>
[ 6616.650489]  __schedule+0x532/0xea0
[ 6616.650494]  schedule+0x27/0x80
[ 6616.650496]  jbd2_log_wait_commit+0xa6/0x120
[ 6616.650499]  ? __pfx_autoremove_wake_function+0x10/0x10
[ 6616.650502]  ext4_sync_file+0x1ba/0x380
[ 6616.650505]  do_fsync+0x3b/0x80
[ 6616.650507]  __x64_sys_fdatasync+0x17/0x20
[ 6616.650509]  do_syscall_64+0x7d/0x2c0
[ 6616.650512]  ? syscall_exit_work+0x108/0x140
[ 6616.650515]  ? do_syscall_64+0x1f3/0x2c0
[ 6616.650517]  ? syscall_exit_work+0x108/0x140
[ 6616.650519]  ? do_syscall_64+0x1d5/0x2c0
[ 6616.650522]  ? audit_reset_context.part.0+0x284/0x2f0
[ 6616.650524]  ? syscall_exit_work+0x108/0x140
[ 6616.650527]  ? do_syscall_64+0x1f3/0x2c0
[ 6616.650529]  ? futex_unqueue+0x4e/0x80
[ 6616.650531]  ? __futex_wait+0x9b/0x100
[ 6616.650534]  ? __pfx_futex_wake_mark+0x10/0x10
[ 6616.650536]  ? timerqueue_del+0x2e/0x50
[ 6616.650539]  ? __remove_hrtimer+0x39/0x70
[ 6616.650542]  ? hrtimer_try_to_cancel+0x85/0x100
[ 6616.650544]  ? hrtimer_cancel+0x15/0x30
[ 6616.650546]  ? futex_wait+0x7d/0x110
[ 6616.650549]  ? __pfx_hrtimer_wakeup+0x10/0x10
[ 6616.650552]  ? audit_reset_context.part.0+0x284/0x2f0
[ 6616.650554]  ? syscall_exit_work+0x108/0x140
[ 6616.650556]  ? do_syscall_64+0x1d5/0x2c0
[ 6616.650558]  ? switch_fpu_return+0x4f/0xd0
[ 6616.650560]  ? do_syscall_64+0x1d5/0x2c0
[ 6616.650563]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 6616.650565] RIP: 0033:0x7f095ef8f3eb
[ 6616.650567] RSP: 002b:00007f07409fa360 EFLAGS: 00000293 ORIG_RAX: 
000000000000004b
[ 6616.650569] RAX: ffffffffffffffda RBX: 00000d38021f03a0 RCX: 
00007f095ef8f3eb
[ 6616.650570] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 
000000000000009a
[ 6616.650571] RBP: 00007f07409fa410 R08: 0000000000000000 R09: 
00007f07409fa570
[ 6616.650572] R10: 00007f0960a60000 R11: 0000000000000293 R12: 
00000d38021f0380
[ 6616.650573] R13: 000055c28c70b400 R14: 00007f07409fa3a0 R15: 
00007f07409fa380


While the kernel already supports freezing the filesystem, which can 
address this problem, it is quite expensive — enabling this feature 
increases the suspend time by about  3~4 seconds in our tests. We are 
therefore exploring lower-cost approaches to mitigate the issue without 
such a heavy performance impact.

root@...waxy-pc:/sys/power# echo 1 > freeze_filesystems
root@...waxy-pc:/sys/power# sudo dmesg | grep -E 'suspend'
[ 9844.984658] PM: suspend entry (deep)
[ 9850.998197] PM: suspend exit

root@...waxy-pc:/sys/power# echo 0 > freeze_filesystems
root@...waxy-pc:/sys/power# sudo dmesg | grep -E 'suspend'
[ 9893.928486] PM: suspend entry (deep)
[ 9896.239425] PM: suspend exit

> Thanks