linux-kernel - Re: [CFT][PATCH v3] fork, vhost: Use CLONE

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <ae250076-7d55-c407-1066-86b37014c69c@oracle.com>
Date:   Sat, 3 Jun 2023 22:28:23 -0500
From:   michael.christie@...cle.com
To:     "Eric W. Biederman" <ebiederm@...ssion.com>,
        Oleg Nesterov <oleg@...hat.com>
Cc:     linux@...mhuis.info, nicolas.dichtel@...nd.com, axboe@...nel.dk,
        torvalds@...ux-foundation.org, linux-kernel@...r.kernel.org,
        virtualization@...ts.linux-foundation.org, mst@...hat.com,
        sgarzare@...hat.com, jasowang@...hat.com, stefanha@...hat.com,
        brauner@...nel.org
Subject: Re: [CFT][PATCH v3] fork, vhost: Use CLONE_THREAD to fix freezer/ps
 regression

On 6/2/23 11:15 PM, Eric W. Biederman wrote:
> 
> This fixes the ordering issue in vhost_task_fn so that get_signal
> should not work.
> 
> This patch is a gamble that during process exit or de_thread in exec
> work will not be commonly queued from other threads.
> 
> If this gamble turns out to be false the existing WARN_ON in
> vhost_worker_free will fire.
> 
> Can folks test this and let us know if the WARN_ON fires?

I don't hit the WARN_ONs but probably not for the reason you are thinking
of. We are hung like:

Jun 03 22:25:23 ol4 kernel: Call Trace:
Jun 03 22:25:23 ol4 kernel:  <TASK>
Jun 03 22:25:23 ol4 kernel:  __schedule+0x334/0xac0
Jun 03 22:25:23 ol4 kernel:  ? wait_for_completion+0x86/0x150
Jun 03 22:25:23 ol4 kernel:  schedule+0x5a/0xd0
Jun 03 22:25:23 ol4 kernel:  schedule_timeout+0x240/0x2a0
Jun 03 22:25:23 ol4 kernel:  ? __wake_up_klogd.part.0+0x3c/0x60
Jun 03 22:25:23 ol4 kernel:  ? vprintk_emit+0x104/0x270
Jun 03 22:25:23 ol4 kernel:  ? wait_for_completion+0x86/0x150
Jun 03 22:25:23 ol4 kernel:  wait_for_completion+0xb0/0x150
Jun 03 22:25:23 ol4 kernel:  vhost_scsi_flush+0xc2/0xf0 [vhost_scsi]
Jun 03 22:25:23 ol4 kernel:  vhost_scsi_clear_endpoint+0x16f/0x240 [vhost_scsi]
Jun 03 22:25:23 ol4 kernel:  vhost_scsi_release+0x7d/0xf0 [vhost_scsi]
Jun 03 22:25:23 ol4 kernel:  __fput+0xa2/0x270
Jun 03 22:25:23 ol4 kernel:  task_work_run+0x56/0xa0
Jun 03 22:25:23 ol4 kernel:  do_exit+0x337/0xb40
Jun 03 22:25:23 ol4 kernel:  ? __remove_hrtimer+0x39/0x70
Jun 03 22:25:23 ol4 kernel:  do_group_exit+0x30/0x90
Jun 03 22:25:23 ol4 kernel:  get_signal+0x9cd/0x9f0
Jun 03 22:25:23 ol4 kernel:  ? kvm_arch_vcpu_put+0x12b/0x170 [kvm]
Jun 03 22:25:23 ol4 kernel:  ? vcpu_put+0x1e/0x50 [kvm]
Jun 03 22:25:23 ol4 kernel:  ? kvm_arch_vcpu_ioctl_run+0x193/0x4e0 [kvm]
Jun 03 22:25:23 ol4 kernel:  arch_do_signal_or_restart+0x2a/0x260
Jun 03 22:25:23 ol4 kernel:  exit_to_user_mode_prepare+0xdd/0x120
Jun 03 22:25:23 ol4 kernel:  syscall_exit_to_user_mode+0x1d/0x40
Jun 03 22:25:23 ol4 kernel:  do_syscall_64+0x48/0x90
Jun 03 22:25:23 ol4 kernel:  entry_SYSCALL_64_after_hwframe+0x72/0xdc
Jun 03 22:25:23 ol4 kernel: RIP: 0033:0x7f2d004df50b


The problem is that as part of the flush the drivers/vhost/scsi.c code
will wait for outstanding commands, because we can't free the device and
it's resources before the commands complete or we will hit the accessing
freed memory bug.

We got hung because the patch had us now do:

vhost_dev_flush() -> vhost_task_flush() 

and that saw VHOST_TASK_FLAGS_STOP was set and the exited completion has
completed. However, the scsi code is still waiting on commands in vhost_scsi_flush.
The cmds wanted to use the vhost_task task to complete and couldn't since the task
has exited.

To handle those types of issues above, it's a lot more code. We would add
some rcu in vhost_work_queue to handle the worker being freed from under us.
Then add a callback similar to what I did on one of the past patchsets that
stops the drivers. Then modify scsi, so in the callback it also sets some
bits so the completion paths just do a fast failing that doesn't try to
queue the completion to the vhost_task.

If we want to go that route, I can get it done in more like a 6.6 time frame.