linux-kernel - Re: [PATCH v11 8/8] vhost: use vhost

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20230516-summieren-bankintern-7def19d6ec45@brauner>
Date:   Tue, 16 May 2023 18:44:27 +0200
From:   Christian Brauner <brauner@...nel.org>
To:     Mike Christie <michael.christie@...cle.com>
Cc:     Linus Torvalds <torvalds@...ux-foundation.org>,
        Thorsten Leemhuis <linux@...mhuis.info>,
        nicolas.dichtel@...nd.com,
        Linux kernel regressions list <regressions@...ts.linux.dev>,
        hch@...radead.org, stefanha@...hat.com, jasowang@...hat.com,
        mst@...hat.com, sgarzare@...hat.com,
        virtualization@...ts.linux-foundation.org, ebiederm@...ssion.com,
        konrad.wilk@...cle.com, linux-kernel@...r.kernel.org,
        Jens Axboe <axboe@...nel.dk>
Subject: Re: [PATCH v11 8/8] vhost: use vhost_tasks for worker threads

On Tue, May 16, 2023 at 11:24:48AM -0500, Mike Christie wrote:
> On 5/16/23 3:39 AM, Christian Brauner wrote:
> > On Mon, May 15, 2023 at 05:23:12PM -0500, Mike Christie wrote:
> >> On 5/15/23 10:44 AM, Linus Torvalds wrote:
> >>> On Mon, May 15, 2023 at 7:23 AM Christian Brauner <brauner@...nel.org> wrote:
> >>>>
> >>>> So I think we will be able to address (1) and (2) by making vhost tasks
> >>>> proper threads and blocking every signal except for SIGKILL and SIGSTOP
> >>>> and then having vhost handle get_signal() - as you mentioned - the same
> >>>> way io uring already does. We should also remove the ingore_signals
> >>>> thing completely imho. I don't think we ever want to do this with user
> >>>> workers.
> >>>
> >>> Right. That's what IO_URING does:
> >>>
> >>>         if (args->io_thread) {
> >>>                 /*
> >>>                  * Mark us an IO worker, and block any signal that isn't
> >>>                  * fatal or STOP
> >>>                  */
> >>>                 p->flags |= PF_IO_WORKER;
> >>>                 siginitsetinv(&p->blocked, sigmask(SIGKILL)|sigmask(SIGSTOP));
> >>>         }
> >>>
> >>> and I really think that vhost should basically do exactly what io_uring does.
> >>>
> >>> Not because io_uring fundamentally got this right - but simply because
> >>> io_uring had almost all the same bugs (and then some), and what the
> >>> io_uring worker threads ended up doing was to basically zoom in on
> >>> "this works".
> >>>
> >>> And it zoomed in on it largely by just going for "make it look as much
> >>> as possible as a real user thread", because every time the kernel
> >>> thread did something different, it just caused problems.
> >>>
> >>> So I think the patch should just look something like the attached.
> >>> Mike, can you test this on whatever vhost test-suite?
> >>
> >> I tried that approach already and it doesn't work because io_uring and vhost
> >> differ in that vhost drivers implement a device where each device has a vhost_task
> >> and the drivers have a file_operations for the device. When the vhost_task's
> >> parent gets signal like SIGKILL, then it will exit and call into the vhost
> >> driver's file_operations->release function. At this time, we need to do cleanup
> > 
> > But that's no reason why the vhost worker couldn't just be allowed to
> > exit on SIGKILL cleanly similar to io_uring. That's just describing the
> > current architecture which isn't a necessity afaict. And the helper
> > thread could e.g., crash.
> > 
> >> like flush the device which uses the vhost_task. There is also the case where if
> >> the vhost_task gets a SIGKILL, we can just exit from under the vhost layer.
> > 
> > In a way I really don't like the patch below. Because this should be
> > solvable by adapting vhost workers. Right now, vhost is coming from a
> > kthread model and we ported it to a user worker model and the whole
> > point of this excercise has been that the workers behave more like
> > regular userspace processes. So my tendency is to not massage kernel
> > signal handling to now also include a special case for user workers in
> > addition to kthreads. That's just the wrong way around and then vhost
> > could've just stuck with kthreads in the first place.
> 
> I would have preferred that :) Maybe let's take a step back and revisit
> that decision to make sure it was right. The vhost layer wants:
> 
> 1. inherit cgroups.
> 2. share mm.
> 3. no signals
> 4. to not show up was an extra process like in Nicolas's bug.
> 5. have it's worker threads counted under its parent nproc limit.
> 
> We can do 1 - 4 today with kthreads. Can we do #5 with kthreads? My first
> attempt which passed around the creds to use for kthreads or exported a
> helper for the nproc accounting was not liked and we eventually ended up
> here.
> 
> Is this hybird user/kernel thread/task still the right way to go or is
> better to use kthreads and add some way to handle #5?

I think the io_uring model makes a lot more sense for vhost than the
current approach.

> 
> 
> > 
> > So I'm fine with skipping over the freezing case for now but SIGKILL
> > should be handled imho. Only init and kthreads should get the luxury of
> > ignoring SIGKILL.
> > 
> > So, I'm afraid I'm asking some work here of you but how feasible would a
> > model be where vhost_worker() similar to io_wq_worker() gracefully
> > handles SIGKILL. Yes, I see there's
> > 
> > net.c:   .release = vhost_net_release
> > scsi.c:  .release = vhost_scsi_release
> > test.c:  .release = vhost_test_release
> > vdpa.c:  .release = vhost_vdpa_release
> > vsock.c: .release = virtio_transport_release
> > vsock.c: .release = vhost_vsock_dev_release
> > 
> > but that means you have all the basic logic in place and all of those
> > drivers also support the VHOST_RESET_OWNER ioctl which also stops the
> > vhost worker. I'm confident that a lof this can be leveraged to just
> > cleanup on SIGKILL.
> 
> We can do this, but the issue I'm worried about is that right now if there
> is queued/running IO and userspace escalates to SIGKILL, then the vhost layer
> will still complete those IOs. If we now allow SIGKILL on the vhost thread,
> then those IOs might fail.
> 
> If we get a SIGKILL, I can modify vhost_worker() so that it temporarily
> ignores the signal and allows IO/flushes/whatever-operations to complete
> at that level. However, we could hit issues where when vhost_worker()

It's really not that different from io_uring though which also flushes
out remaining io, no? This seems to basically line up with what
io_wq_worker() does.

> calls into the drivers listed above, and those drivers call into whatever
> kernel layer they use, that might do
> 
> if (signal_pending(current))
> 	return failure;
> 
> and we now fail.
> 
> If we say that since we got a SIGKILL, then failing is acceptable behavior
> now, I can code what you are requesting.

I think this is fine but I don't maintain vhost and we'd need their
opinion.