lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1998a069-50a0-46a2-8420-ebdce7725720@redhat.com>
Date: Tue, 29 Oct 2024 23:59:43 +0100
From: Paolo Bonzini <pbonzini@...hat.com>
To: Tejun Heo <tj@...nel.org>, Luca Boccassi <bluca@...ian.org>,
 Roman Gushchin <roman.gushchin@...ux.dev>
Cc: kvm@...r.kernel.org, cgroups@...r.kernel.org,
 Michal Koutný <mkoutny@...e.com>,
 linux-kernel@...r.kernel.org
Subject: Re: cgroup2 freezer and kvm_vm_worker_thread()

On 10/29/24 01:07, Tejun Heo wrote:
> Hello,
> 
> Luca is reporting that cgroups which have kvm instances inside never
> complete freezing. This can be trivially reproduced:
> 
>    root@...t ~# mkdir /sys/fs/cgroup/test
>    root@...t ~# echo $fish_pid > /sys/fs/cgroup/test/cgroup.procs
>    root@...t ~# qemu-system-x86_64 --nographic -enable-kvm
> 
> and in another terminal:
> 
>    root@...t ~# echo 1 > /sys/fs/cgroup/test/cgroup.freeze
>    root@...t ~# cat /sys/fs/cgroup/test/cgroup.events
>    populated 1
>    frozen 0
>    root@...t ~# for i in (cat /sys/fs/cgroup/test/cgroup.threads); echo $i; cat /proc/$i/stack; end
>    2070
>    [<0>] do_freezer_trap+0x42/0x70
>    [<0>] get_signal+0x4da/0x870
>    [<0>] arch_do_signal_or_restart+0x1a/0x1c0
>    [<0>] syscall_exit_to_user_mode+0x73/0x120
>    [<0>] do_syscall_64+0x87/0x140
>    [<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e
>    2159
>    [<0>] do_freezer_trap+0x42/0x70
>    [<0>] get_signal+0x4da/0x870
>    [<0>] arch_do_signal_or_restart+0x1a/0x1c0
>    [<0>] syscall_exit_to_user_mode+0x73/0x120
>    [<0>] do_syscall_64+0x87/0x140
>    [<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e
>    2160
>    [<0>] do_freezer_trap+0x42/0x70
>    [<0>] get_signal+0x4da/0x870
>    [<0>] arch_do_signal_or_restart+0x1a/0x1c0
>    [<0>] syscall_exit_to_user_mode+0x73/0x120
>    [<0>] do_syscall_64+0x87/0x140
>    [<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e
>    2161
>    [<0>] kvm_nx_huge_page_recovery_worker+0xea/0x680
>    [<0>] kvm_vm_worker_thread+0x8f/0x2b0
>    [<0>] kthread+0xe8/0x110
>    [<0>] ret_from_fork+0x33/0x40
>    [<0>] ret_from_fork_asm+0x1a/0x30
>    2164
>    [<0>] do_freezer_trap+0x42/0x70
>    [<0>] get_signal+0x4da/0x870
>    [<0>] arch_do_signal_or_restart+0x1a/0x1c0
>    [<0>] syscall_exit_to_user_mode+0x73/0x120
>    [<0>] do_syscall_64+0x87/0x140
>    [<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e
> 
> The cgroup freezing happens in the signal delivery path but
> kvm_vm_worker_thread() thread never call into the signal delivery path while
> joining non-root cgroups, so they never get frozen. Because the cgroup
> freezer determines whether a given cgroup is frozen by comparing the number
> of frozen threads to the total number of threads in the cgroup, the cgroup
> never becomes frozen and users waiting for the state transition may hang
> indefinitely.
> 
> There are two paths that we can take:
> 
> 1. Make kvm_vm_worker_thread() call into signal delivery path.
>     io_wq_worker() is in a similar boat and handles signal delivery and can
>     be frozen and trapped like regular threads.

For the freezing part, would this be anything more than

fdiff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index d16ce8174ed6..b7b6a1c1b6a4 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -47,6 +47,7 @@
  #include <linux/kern_levels.h>
  #include <linux/kstrtox.h>
  #include <linux/kthread.h>
+#include <linux/freezer.h>
  #include <linux/wordpart.h>
  
  #include <asm/page.h>
@@ -7429,22 +7430,27 @@ static long get_nx_huge_page_recovery_timeout(u64 start_time)
  static int kvm_nx_huge_page_recovery_worker(struct kvm *kvm, uintptr_t data)
  {
  	u64 start_time;
-	long remaining_time;
+	u64 end_time;
+
+	set_freezable();
  
  	while (true) {
  		start_time = get_jiffies_64();
-		remaining_time = get_nx_huge_page_recovery_timeout(start_time);
+		end_time = start_time + get_nx_huge_page_recovery_timeout(start_time);
  
-		set_current_state(TASK_INTERRUPTIBLE);
-		while (!kthread_should_stop() && remaining_time > 0) {
-			schedule_timeout(remaining_time);
-			remaining_time = get_nx_huge_page_recovery_timeout(start_time);
+		for (;;) {
  			set_current_state(TASK_INTERRUPTIBLE);
+			if (kthread_freezable_should_stop(NULL))
+				break;
+			start_time = get_jiffies_64();
+			if ((s64)(end_time - start_time) <= 0)
+				break;
+			schedule_timeout(end_time - start_time);
  		}
  
  		set_current_state(TASK_RUNNING);
  
-		if (kthread_should_stop())
+		if (kthread_freezable_should_stop(NULL))
  			return 0;
  
  		kvm_recover_nx_huge_pages(kvm);

(untested beyond compilation).

I'm not sure if the KVM worker thread should process signals.  We want it
to take the CPU time it uses from the guest, but otherwise it's not running
on behalf of userspace in the way that io_wq_worker() is.

Paolo


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ