linux-kernel - Re: signal: break out of wait loops on kthread

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Y1BKf4ol3YtKvLiG@sultan-box.localdomain>
Date:   Wed, 19 Oct 2022 12:05:35 -0700
From:   Sultan Alsawaf <sultan@...neltoast.com>
To:     "Jason A. Donenfeld" <Jason@...c4.com>
Cc:     Tvrtko Ursulin <tvrtko.ursulin@...ux.intel.com>,
        "Eric W. Biederman" <ebiederm@...ssion.com>,
        linux-kernel@...r.kernel.org,
        "Intel-gfx@...ts.freedesktop.org" <Intel-gfx@...ts.freedesktop.org>,
        Ville Syrjälä 
        <ville.syrjala@...ux.intel.com>
Subject: Re: signal: break out of wait loops on kthread_stop()

On Wed, Oct 19, 2022 at 12:16:22PM -0600, Jason A. Donenfeld wrote:
> On Wed, Oct 19, 2022 at 06:57:38PM +0100, Tvrtko Ursulin wrote:
> > 
> > On 19/10/2022 17:00, Jason A. Donenfeld wrote:
> > > On Wed, Oct 19, 2022 at 7:31 AM Tvrtko Ursulin
> > > <tvrtko.ursulin@...ux.intel.com> wrote:
> > >>
> > >>
> > >> Hi,
> > >>
> > >> A question regarding a7c01fa93aeb ("signal: break out of wait loops on
> > >> kthread_stop()") if I may.
> > >>
> > >> We have a bunch code in i915, possibly limited to self tests (ie debug
> > >> builds) but still important for our flows, which spawn kernel threads
> > >> and exercises parts of the driver.
> > >>
> > >> Problem we are hitting with this patch is that code did not really need
> > >> to be signal aware until now. Well to say that more accurately - we were
> > >> able to test the code which is normally executed from userspace, so is
> > >> signal aware, but not worry about -ERESTARTSYS or -EINTR within the test
> > >> cases itself.
> > >>
> > >> For example threads which exercise an internal API for a while until the
> > >> parent calls kthread_stop. Now those tests can hit unexpected errors.
> > >>
> > >> Question is how to best approach working around this change. It is of
> > >> course technically possible to rework our code in more than one way,
> > >> although with some cost and impact already felt due reduced pass rates
> > >> in our automated test suites.
> > >>
> > >> Maybe an opt out kthread flag from this new behavior? Would that be
> > >> acceptable as a quick fix? Or any other comments?
> > > 
> > > You can opt out by running `clear_tsk_thread_flag(current,
> > > TIF_NOTIFY_SIGNAL);` at the top of your kthread. But you should really
> > > fix your code instead. Were I your reviewer, I wouldn't merge code
> > > that took the lazy path like that. However, that should work, if you
> > > do opt for the quick fix.
> >
> > Also, are you confident that the change will not catch anyone else by 
> > surprise? In the original thread I did not spot any concerns about the 
> > kthreads being generally unprepared to start receiving EINTR/ERESTARTSYS 
> > from random call chains.
> 
> Pretty sure, yea. i915 is unique in its abuse of the API. Keep in mind
> that kthread_stop() also sets KTHREAD_SHOULD_STOP and such. Your code is
> abusing the API by calling kthread_run() followed by kthread_stop().
> 
> As evidence of how broken your code actually is, the kthread_stop()
> function has a comment that makes it clear, "This can also be called
> after kthread_create() instead of calling wake_up_process(): the thread
> will exit without calling threadfn()," yet i915 has attempted to hack
> around it with ridiculous yields and msleeps. For example:
> 
>                 threads[n] = kthread_run(__igt_breadcrumbs_smoketest,
>                                          &t, "igt/%d", n);
> ...
> 
>         yield(); /* start all threads before we begin */
>         msleep(jiffies_to_msecs(i915_selftest.timeout_jiffies));
> ...
>                 err = kthread_stop(threads[n]);
> 
> 
> Or here's another one:
> 
>                 tsk = kthread_run(fn, &thread[i], "igt-%d", i);
> ...
>         msleep(10); /* start all threads before we kthread_stop() */
> ...
>                 status = kthread_stop(tsk);
> 
> I mean come on.
> 
> This is brittle and bad and kind of ridiculous that it shipped this way.
> Now you're asking to extend your brittleness, so that you can avoid the
> work of cleaning up 5 call sites. Just clean up those 5 call sites. It's
> only 5, as far as I can tell.
> 
> 
> > Right, but our hand is a bit forced at the moment. Since 6.1-rc1 has 
> > propagated to our development tree on Monday, our automated testing 
> > started failing significantly, which prevents us merging new work until 
> > resolved. So a quick fix trumps the ideal road in the short term. Just 
> > because it is quick.
> 
> "Short term" -- somehow I can imagine the short term hack will turn into
> a long term one.
> 
> Anyway, what I suspect you might actually want as a bandaid is a
> "kthread_wait()"-like function, that doesn't try to stop the thread with
> KTHREAD_SHOULD_STOP and such, but just waits for the completion:
> 
> diff --git a/include/linux/kthread.h b/include/linux/kthread.h
> index 30e5bec81d2b..2699cc45ad15 100644
> --- a/include/linux/kthread.h
> +++ b/include/linux/kthread.h
> @@ -86,6 +86,7 @@ void free_kthread_struct(struct task_struct *k);
>  void kthread_bind(struct task_struct *k, unsigned int cpu);
>  void kthread_bind_mask(struct task_struct *k, const struct cpumask *mask);
>  int kthread_stop(struct task_struct *k);
> +int kthread_wait(struct task_struct *k);
>  bool kthread_should_stop(void);
>  bool kthread_should_park(void);
>  bool __kthread_should_park(struct task_struct *k);
> diff --git a/kernel/kthread.c b/kernel/kthread.c
> index f97fd01a2932..d581d78a3a26 100644
> --- a/kernel/kthread.c
> +++ b/kernel/kthread.c
> @@ -715,6 +715,22 @@ int kthread_stop(struct task_struct *k)
>  }
>  EXPORT_SYMBOL(kthread_stop);
> 
> +int kthread_wait(struct task_struct *k)
> +{
> +	struct kthread *kthread;
> +	int ret;
> +
> +	get_task_struct(k);
> +	kthread = to_kthread(k);
> +	wake_up_process(k);
> +	wait_for_completion(&kthread->exited);
> +	ret = kthread->result;
> +	put_task_struct(k);
> +
> +	return ret;
> +}
> +EXPORT_SYMBOL(kthread_stop);
> +
>  int kthreadd(void *unused)
>  {
>  	struct task_struct *tsk = current;
> 

Hi,

Given the need to ensure the kthreads are started and then synchronously
flushed, this seems like a good fit for the kthread_work API, which provides all
of the necessary plumbing for this usecase. No need for the eldritch kthread API
abuse and flimsy yield+msleep.

Sultan