lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <6350624c-7293-43de-8788-e52a236d91fb@kylinos.cn>
Date: Sun, 8 Jun 2025 15:22:20 +0800
From: zhangzihuan <zhangzihuan@...inos.cn>
To: David Hildenbrand <david@...hat.com>, rafael@...nel.org,
 len.brown@...el.com, pavel@...nel.org, kees@...nel.org, mingo@...hat.com,
 peterz@...radead.org, juri.lelli@...hat.com, vincent.guittot@...aro.org,
 dietmar.eggemann@....com, rostedt@...dmis.org, bsegall@...gle.com,
 mgorman@...e.de, vschneid@...hat.com, akpm@...ux-foundation.org,
 lorenzo.stoakes@...cle.com, Liam.Howlett@...cle.com, vbabka@...e.cz,
 rppt@...nel.org, surenb@...gle.com, mhocko@...e.com
Cc: linux-pm@...r.kernel.org, linux-kernel@...r.kernel.org, linux-mm@...ck.org
Subject: Re: [RFC PATCH] PM: Optionally block user fork during freeze to
 improve performance

Hi David,
Thanks for your feedback!

在 2025/6/6 15:20, David Hildenbrand 写道:
> Hi,
>
> On 06.06.25 08:25, Zihuan Zhang wrote:
>> Currently, the freezer traverses all tasks to freeze them during
>> system suspend or hibernation. If a user process forks during this
>> window, the new child may escape freezing and require a second
>> traversal or retry, adding non-trivial overhead.
>>
>> This patch introduces a CONFIG_PM_DISABLE_USER_FORK_DURING_FREEZE
>
> Not sure if a Kconfig is really the right choice here ...
>
I understand your concern. My initial thinking was to provide an opt-in 
configuration so that platforms sensitive to resume performance (or 
under constrained suspend time budgets) can selectively enable this 
behavior.
However, I agree that a runtime mechanism or a default-on behavior gated 
by suspend state might be cleaner. I'm happy to rework it in that 
direction — e.g., based on pm_freezing or a similar runtime flag.

>> option. When enabled, it prevents user processes from creating new
>> processes (via fork/clone) during the freezing period. This guarantees
>> a stable task list and avoids re-traversing the process list due to
>> late-created user tasks, thereby improving performance.
>
> Any performance numbers to back your claims?
>
We’ve completed the performance testing. To simulate a process escape 
scenario, we created a test environment where a large number of fork 
operations are triggered right before the freeze phase begins.

A few details worth mentioning:
     •    We avoided creating too many processes at once to prevent 
resource exhaustion.
     •    To increase the likelihood of hitting the freeze window 
precisely, we skipped the filesystem freezing time during the simulation.

     •    We also added a small delay to each process, ensuring they 
don’t all complete their fork operations before the system enters 
suspend/hibernate.

    •    Before starting the tests, we also added a debug print in 
try_to_freeze_task() to log the number of freeze retry attempts for each 
task

--- begin test code ---

#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/wait.h>
#include <sys/types.h>
#include <errno.h>
#include <string.h>
#include <time.h>
#define TOTAL_FORKS        1000      // total number
#define BATCH_SIZE          10       // Number of forks per batch
#define FORK_INTERVAL_US   300       // Each fork interval (microseconds)
#define CHILD_LIFETIME_SEC 60        // Subprocess runtime (seconds)
void usleep_precise(int usec) {
     struct timespec ts;
     ts.tv_sec = usec / 1000000;
     ts.tv_nsec = (usec % 1000000) * 1000;
     nanosleep(&ts, NULL);
}
void random_delay_ms(int min_ms, int max_ms) {
     int delay = min_ms + rand() % (max_ms - min_ms + 1);
     usleep_precise(delay * 1000);
}
void random_delay_us(int min_us, int max_us) {
     int delay = min_us + rand() % (max_us - min_us + 1);
     usleep_precise(delay);
}
int main() {
     int count = 0;
     int batch = 0;
     printf("Starting enhanced fork storm test...\n");
     printf("  Total forks:      %d\n", TOTAL_FORKS);
     printf("  Batch size:       %d\n", BATCH_SIZE);
     printf("  Fork interval:    %d us\n", FORK_INTERVAL_US);
     printf("  Child lifetime:   %d sec\n\n", CHILD_LIFETIME_SEC);

     random_delay_ms(2, 10); // Skip Filesystem freeze
     while (count < TOTAL_FORKS) {
         printf("Starting batch %d...\n", ++batch);

         for (int i = 0; i < BATCH_SIZE && count < TOTAL_FORKS; i++, 
count++) {
             pid_t pid = fork();

             if (pid == 0) {
                 printf("Child #%d (pid=%d) started\n", count, getpid());
                 // sleep(CHILD_LIFETIME_SEC);
                 pause();
                 exit(0);
             } else if (pid < 0) {
                 fprintf(stderr, "fork failed at %d: %s\n", count, 
strerror(errno));
                 exit(1);
             }

         }

         usleep_precise(50);
         printf("Batch %d completed. Total forked so far: %d\n", batch, 
count);
     }

     printf("All %d children created. Parent process sleeping...\n", 
TOTAL_FORKS);
     pause();
     return 0;

}

--- end test code ---

Then compile the code and run the test script.

gcc  -o  slow_fork  slow_fork.c

--- begin test code ---

#!/bin/bash
LOOPS=20
DELAY_BETWEEN_RUNS=1
NUM_FORKS_PER_ROUND=10
FORK_PIDS=()
echo freezer > /sys/power/pm_test
echo 3 > /sys/module/suspend/parameters/pm_test_delay
for ((i=1; i<=LOOPS; i++)); do
echo "===== Test round $i/$LOOPS ====="

FORK_PIDS=()
for ((j=1; j<=NUM_FORKS_PER_ROUND; j++)); do
     ./slow_fork &
     FORK_PIDS+=($!)
     echo "  Launched slow_fork #$j (pid=${FORK_PIDS[-1]})"
done

echo mem > /sys/power/state

for pid in "${FORK_PIDS[@]}"; do
     kill "$pid" 2>/dev/null
done

for pid in "${FORK_PIDS[@]}"; do
     wait "$pid" 2>/dev/null
done

echo "Round $i complete. Waiting ${DELAY_BETWEEN_RUNS}s..."
sleep $DELAY_BETWEEN_RUNS

done
pkill slow_fork
echo "==== All $LOOPS rounds complete ===="

}

--- end test code ---

The result like this:

dmesg | grep -E 'elap|Files|retry'
[  585.255784] Filesystems sync: 0.010 seconds
[  585.261620] Freezing user space processes completed (elapsed 0.005 
seconds)
[  585.263530] Freezing remaining freezable tasks completed (elapsed 
0.001 seconds)
[  589.323691] Filesystems sync: 0.012 seconds
[  589.336983] Freeing user space processes todo:0 retry:2
[  589.336996] Freezing user space processes completed (elapsed 0.013 
seconds)
[  589.342628] Freezing remaining freezable tasks completed (elapsed 
0.005 seconds)
[  593.424317] Filesystems sync: 0.011 seconds
[  593.446210] Freeing user space processes todo:0 retry:2
[  593.446227] Freezing user space processes completed (elapsed 0.021 
seconds)
[  593.454303] Freezing remaining freezable tasks completed (elapsed 
0.008 seconds)
[  597.528491] Filesystems sync: 0.012 seconds
[  597.561179] Freeing user space processes todo:0 retry:2
[  597.561200] Freezing user space processes completed (elapsed 0.032 
seconds)
[  597.570157] Freezing remaining freezable tasks completed (elapsed 
0.008 seconds)
[  601.645391] Filesystems sync: 0.010 seconds
[  601.682653] Freeing user space processes todo:0 retry:2
[  601.682671] Freezing user space processes completed (elapsed 0.037 
seconds)
[  601.694401] Freezing remaining freezable tasks completed (elapsed 
0.011 seconds)
[  605.789844] Filesystems sync: 0.011 seconds
[  605.830030] Freezing user space processes completed (elapsed 0.039 
seconds)
[  605.843602] Freezing remaining freezable tasks completed (elapsed 
0.013 seconds)
[  609.942143] Filesystems sync: 0.017 seconds
[  609.997859] Freeing user space processes todo:0 retry:2
[  609.997875] Freezing user space processes completed (elapsed 0.055 
seconds)
[  610.016413] Freezing remaining freezable tasks completed (elapsed 
0.018 seconds)
[  614.123700] Filesystems sync: 0.016 seconds
[  614.187743] Freeing user space processes todo:0 retry:2
[  614.187764] Freezing user space processes completed (elapsed 0.063 
seconds)
[  614.205004] Freezing remaining freezable tasks completed (elapsed 
0.017 seconds)
[  618.323268] Filesystems sync: 0.013 seconds
[  618.393868] Freeing user space processes todo:0 retry:2
[  618.393886] Freezing user space processes completed (elapsed 0.070 
seconds)
[  618.413420] Freezing remaining freezable tasks completed (elapsed 
0.019 seconds)
[  622.584589] Filesystems sync: 0.009 seconds
[  622.676274] Freeing user space processes todo:0 retry:2
[  622.676294] Freezing user space processes completed (elapsed 0.091 
seconds)
[  622.702762] Freezing remaining freezable tasks completed (elapsed 
0.026 seconds)
[  626.836610] Filesystems sync: 0.009 seconds
[  626.935583] Freeing user space processes todo:0 retry:2
[  626.935603] Freezing user space processes completed (elapsed 0.098 
seconds)
[  626.966460] Freezing remaining freezable tasks completed (elapsed 
0.030 seconds)
[  631.131669] Filesystems sync: 0.010 seconds
[  631.249412] Freeing user space processes todo:0 retry:2
[  631.249432] Freezing user space processes completed (elapsed 0.117 
seconds)
[  631.283333] Freezing remaining freezable tasks completed (elapsed 
0.033 seconds)
[  635.459169] Filesystems sync: 0.014 seconds
[  635.574913] Freeing user space processes todo:0 retry:2
[  635.574928] Freezing user space processes completed (elapsed 0.115 
seconds)
[  635.613557] Freezing remaining freezable tasks completed (elapsed 
0.038 seconds)
[  639.801842] Filesystems sync: 0.014 seconds
[  639.949023] Freeing user space processes todo:0 retry:2
[  639.949047] Freezing user space processes completed (elapsed 0.146 
seconds)
[  639.998032] Freezing remaining freezable tasks completed (elapsed 
0.048 seconds)
[  644.151229] Filesystems sync: 0.011 seconds
[  644.303744] Freeing user space processes todo:0 retry:2
[  644.303765] Freezing user space processes completed (elapsed 0.152 
seconds)
[  644.347925] Freezing remaining freezable tasks completed (elapsed 
0.043 seconds)
[  648.506472] Filesystems sync: 0.010 seconds
[  648.647752] Freeing user space processes todo:192 retry:2
[  648.670978] Freeing user space processes todo:0 retry:3
[  648.670997] Freezing user space processes completed (elapsed 0.164 
seconds)
[  648.724734] Freezing remaining freezable tasks completed (elapsed 
0.053 seconds)
[  652.947466] Filesystems sync: 0.021 seconds
[  653.112034] Freeing user space processes todo:0 retry:2
[  653.112055] Freezing user space processes completed (elapsed 0.164 
seconds)
[  653.163845] Freezing remaining freezable tasks completed (elapsed 
0.051 seconds)
[  657.364792] Filesystems sync: 0.012 seconds
[  657.510491] Freezing user space processes completed (elapsed 0.145 
seconds)
[  657.570268] Freezing remaining freezable tasks completed (elapsed 
0.059 seconds)
[  661.779728] Filesystems sync: 0.011 seconds
[  661.975654] Freeing user space processes todo:0 retry:2
[  661.975686] Freezing user space processes completed (elapsed 0.195 
seconds)
[  662.050074] Freezing remaining freezable tasks completed (elapsed 
0.074 seconds)
[  666.273377] Filesystems sync: 0.010 seconds
[  666.481081] Freeing user space processes todo:0 retry:2
[  666.481117] Freezing user space processes completed (elapsed 0.207 
seconds)
[  666.564340] Freezing remaining freezable tasks completed (elapsed 
0.083 seconds)

We observed the following log during one of the test runs:

[  648.647752] Freeing user space processes todo:192 retry:2
However, since the reproduction rate is currently low, it's still 
difficult to quantify exactly how much performance improvement the patch 
brings.

>>
>> The restriction is only active during the window when the system is
>> freezing user tasks. Once all tasks are frozen, or if the system aborts
>> the suspend/hibernate process, the restriction is lifted.
>> No kernel threads are affected, and kernel_create_* functions remain
>> unrestricted.
>>
>> Signed-off-by: Zihuan Zhang <zhangzihuan@...inos.cn>
>> ---
>>   include/linux/suspend.h |  8 ++++++++
>>   kernel/fork.c           |  6 ++++++
>>   kernel/power/Kconfig    | 10 ++++++++++
>>   kernel/power/main.c     | 44 +++++++++++++++++++++++++++++++++++++++++
>>   kernel/power/power.h    |  4 ++++
>>   kernel/power/process.c  |  7 +++++++
>>   6 files changed, 79 insertions(+)
>>
>> diff --git a/include/linux/suspend.h b/include/linux/suspend.h
>> index b1c76c8f2c82..2dd8b3eb50f0 100644
>> --- a/include/linux/suspend.h
>> +++ b/include/linux/suspend.h
>> @@ -591,4 +591,12 @@ enum suspend_stat_step {
>>   void dpm_save_failed_dev(const char *name);
>>   void dpm_save_failed_step(enum suspend_stat_step step);
>>   +#ifdef CONFIG_PM_DISABLE_USER_FORK_DURING_FREEZE
>> +extern bool pm_block_user_fork;
>> +bool pm_should_block_fork(void);
>> +bool pm_freeze_process_in_progress(void);
>> +#else
>> +static inline bool pm_should_block_fork(void) { return false; };
>> +static inline bool pm_freeze_process_in_progress(void) { return 
>> false; };
>> +#endif /* CONFIG_PM_DISABLE_USER_FORK_DURING_FREEZE */
>>   #endif /* _LINUX_SUSPEND_H */
>> diff --git a/kernel/fork.c b/kernel/fork.c
>> index 1ee8eb11f38b..b0bd0206b644 100644
>> --- a/kernel/fork.c
>> +++ b/kernel/fork.c
>> @@ -105,6 +105,7 @@
>>   #include <uapi/linux/pidfd.h>
>>   #include <linux/pidfs.h>
>>   #include <linux/tick.h>
>> +#include <linux/suspend.h>
>>     #include <asm/pgalloc.h>
>>   #include <linux/uaccess.h>
>> @@ -2596,6 +2597,11 @@ pid_t kernel_clone(struct kernel_clone_args 
>> *args)
>>               trace = 0;
>>       }
>>   +#ifdef CONFIG_PM_DISABLE_USER_FORK_DURING_FREEZE
>> +    if (pm_should_block_fork() && !(current->flags & PF_KTHREAD))
>> +        return -EBUSY;
>> +#endif
>

You're absolutely right — returning -EBUSY is not part of the documented 
interface for fork/clone3, and user space libraries like glibc are 
likely not prepared to handle that gracefully.
One alternative could be to block in kernel_clone() until freezing ends, 
instead of returning an error. That way, fork() would not fail, just 
potentially block briefly (similar to memory pressure or cgroup limits). 
Do you think that's more acceptable?
I’ll draft an updated version reflecting your suggestions. Really 
appreciate your time and review!
Best regards,
Zihuan Zhang

> fork() is not documented to return EBUSY and for clone3() it's 
> documented to only happen in specific cases.
>
> So user space is not prepared for that.
>

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ