linux-kernel - Re: [PATCH v11 8/8] vhost: use vhost

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <683d0c30-4aeb-3547-49a9-90ce6eb1a7d0@oracle.com>
Date:   Tue, 16 May 2023 11:24:48 -0500
From:   Mike Christie <michael.christie@...cle.com>
To:     Christian Brauner <brauner@...nel.org>,
        Linus Torvalds <torvalds@...ux-foundation.org>
Cc:     Thorsten Leemhuis <linux@...mhuis.info>, nicolas.dichtel@...nd.com,
        Linux kernel regressions list <regressions@...ts.linux.dev>,
        hch@...radead.org, stefanha@...hat.com, jasowang@...hat.com,
        mst@...hat.com, sgarzare@...hat.com,
        virtualization@...ts.linux-foundation.org, ebiederm@...ssion.com,
        konrad.wilk@...cle.com, linux-kernel@...r.kernel.org,
        Jens Axboe <axboe@...nel.dk>
Subject: Re: [PATCH v11 8/8] vhost: use vhost_tasks for worker threads

On 5/16/23 3:39 AM, Christian Brauner wrote:
> On Mon, May 15, 2023 at 05:23:12PM -0500, Mike Christie wrote:
>> On 5/15/23 10:44 AM, Linus Torvalds wrote:
>>> On Mon, May 15, 2023 at 7:23 AM Christian Brauner <brauner@...nel.org> wrote:
>>>>
>>>> So I think we will be able to address (1) and (2) by making vhost tasks
>>>> proper threads and blocking every signal except for SIGKILL and SIGSTOP
>>>> and then having vhost handle get_signal() - as you mentioned - the same
>>>> way io uring already does. We should also remove the ingore_signals
>>>> thing completely imho. I don't think we ever want to do this with user
>>>> workers.
>>>
>>> Right. That's what IO_URING does:
>>>
>>>         if (args->io_thread) {
>>>                 /*
>>>                  * Mark us an IO worker, and block any signal that isn't
>>>                  * fatal or STOP
>>>                  */
>>>                 p->flags |= PF_IO_WORKER;
>>>                 siginitsetinv(&p->blocked, sigmask(SIGKILL)|sigmask(SIGSTOP));
>>>         }
>>>
>>> and I really think that vhost should basically do exactly what io_uring does.
>>>
>>> Not because io_uring fundamentally got this right - but simply because
>>> io_uring had almost all the same bugs (and then some), and what the
>>> io_uring worker threads ended up doing was to basically zoom in on
>>> "this works".
>>>
>>> And it zoomed in on it largely by just going for "make it look as much
>>> as possible as a real user thread", because every time the kernel
>>> thread did something different, it just caused problems.
>>>
>>> So I think the patch should just look something like the attached.
>>> Mike, can you test this on whatever vhost test-suite?
>>
>> I tried that approach already and it doesn't work because io_uring and vhost
>> differ in that vhost drivers implement a device where each device has a vhost_task
>> and the drivers have a file_operations for the device. When the vhost_task's
>> parent gets signal like SIGKILL, then it will exit and call into the vhost
>> driver's file_operations->release function. At this time, we need to do cleanup
> 
> But that's no reason why the vhost worker couldn't just be allowed to
> exit on SIGKILL cleanly similar to io_uring. That's just describing the
> current architecture which isn't a necessity afaict. And the helper
> thread could e.g., crash.
> 
>> like flush the device which uses the vhost_task. There is also the case where if
>> the vhost_task gets a SIGKILL, we can just exit from under the vhost layer.
> 
> In a way I really don't like the patch below. Because this should be
> solvable by adapting vhost workers. Right now, vhost is coming from a
> kthread model and we ported it to a user worker model and the whole
> point of this excercise has been that the workers behave more like
> regular userspace processes. So my tendency is to not massage kernel
> signal handling to now also include a special case for user workers in
> addition to kthreads. That's just the wrong way around and then vhost
> could've just stuck with kthreads in the first place.

I would have preferred that :) Maybe let's take a step back and revisit
that decision to make sure it was right. The vhost layer wants:

1. inherit cgroups.
2. share mm.
3. no signals
4. to not show up was an extra process like in Nicolas's bug.
5. have it's worker threads counted under its parent nproc limit.

We can do 1 - 4 today with kthreads. Can we do #5 with kthreads? My first
attempt which passed around the creds to use for kthreads or exported a
helper for the nproc accounting was not liked and we eventually ended up
here.

Is this hybird user/kernel thread/task still the right way to go or is
better to use kthreads and add some way to handle #5?


> 
> So I'm fine with skipping over the freezing case for now but SIGKILL
> should be handled imho. Only init and kthreads should get the luxury of
> ignoring SIGKILL.
> 
> So, I'm afraid I'm asking some work here of you but how feasible would a
> model be where vhost_worker() similar to io_wq_worker() gracefully
> handles SIGKILL. Yes, I see there's
> 
> net.c:   .release = vhost_net_release
> scsi.c:  .release = vhost_scsi_release
> test.c:  .release = vhost_test_release
> vdpa.c:  .release = vhost_vdpa_release
> vsock.c: .release = virtio_transport_release
> vsock.c: .release = vhost_vsock_dev_release
> 
> but that means you have all the basic logic in place and all of those
> drivers also support the VHOST_RESET_OWNER ioctl which also stops the
> vhost worker. I'm confident that a lof this can be leveraged to just
> cleanup on SIGKILL.

We can do this, but the issue I'm worried about is that right now if there
is queued/running IO and userspace escalates to SIGKILL, then the vhost layer
will still complete those IOs. If we now allow SIGKILL on the vhost thread,
then those IOs might fail.

If we get a SIGKILL, I can modify vhost_worker() so that it temporarily
ignores the signal and allows IO/flushes/whatever-operations to complete
at that level. However, we could hit issues where when vhost_worker()
calls into the drivers listed above, and those drivers call into whatever
kernel layer they use, that might do

if (signal_pending(current))
	return failure;

and we now fail.

If we say that since we got a SIGKILL, then failing is acceptable behavior
now, I can code what you are requesting.