linux-kernel - Re: [RFC] rcu: use killable versions of swait

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20170615173452.GD3721@linux.vnet.ibm.com>
Date:   Thu, 15 Jun 2017 10:34:52 -0700
From:   "Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>
To:     "Luis R. Rodriguez" <mcgrof@...nel.org>
Cc:     Josh Triplett <josh@...htriplett.org>,
        Steven Rostedt <rostedt@...dmis.org>,
        mathieu.desnoyers@...icios.com, jiangshanlai@...il.com,
        Paul Gortmaker <paul.gortmaker@...driver.com>,
        "Eric W. Biederman" <ebiederm@...ssion.com>,
        Dmitry Torokhov <dmitry.torokhov@...il.com>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        Oleg Nesterov <oleg@...hat.com>,
        Peter Zijlstra <peterz@...radead.org>
Subject: Re: [RFC] rcu: use killable versions of swait

On Thu, Jun 15, 2017 at 09:35:45AM -0700, Luis R. Rodriguez wrote:
> On Thu, Jun 15, 2017 at 9:22 AM, Paul E. McKenney
> <paulmck@...ux.vnet.ibm.com> wrote:
> > On Thu, Jun 15, 2017 at 05:50:39PM +0200, Luis R. Rodriguez wrote:
> >> On Wed, Jun 14, 2017 at 04:43:45PM -0700, Paul E. McKenney wrote:
> >> > On Wed, Jun 14, 2017 at 04:06:39PM -0700, Luis R. Rodriguez wrote:
> >> > > These waits don't even check for the return value for interruption,
> >> > > using the non-killable variants means we could be killed by other
> >> > > signals than SIGKILL, this is fragile.
> >> >
> >> > A number of people asserted that kthreads could never catch signals.
> >> > I checked this at the time, and it seemed like this was the case, and
> >> > the call to ignore_signals() seems to confirm this.
> >> >
> >> > So it looks to me like RCU doesn't care about this change (give or
> >> > take any affect on the load average, anyway), but there is much that I
> >> > don't know about Linux-kernel signal handling, so please feel free to
> >> > educate me.
> >>
> >> Thanks, I had seen the kthread but figured best to ask, just got into parnaoia
> >> mode. If we were to do a sanity check for usage we'd then have to white list
> >> when kthreads are used, however since we don't care to be interrupted why not
> >> use a wait which is also explicit about our current uninterruptible state?
> >
> > I do appreciate any and all inspection, actually, so thank you!
> >
> > I used to have it uninterruptable, but got complaints on the effect on
> > the load average -- sysadms didn't like the "D" state and the fact that
> > the load average was always greater than 2 (or 3 on PREEMPT=y kernels)
> > when the system was completely idle.
> 
> Ah wow, never would I have imagined this to be the reason.

Let's just say that it came as a bit of a surprise to me as well.  ;-)

And yes, this does deserve a comment.  Please see the patch at the
end of this email.

>                                                            Are you
> aware of other cases in the kernel that use interruptible wait even
> though kthreads are used (therefore uninterruptible) as work around
> for this? The above information is very helpful for folks doing random
> code introspection, having this annotated somehow via API seems fair
> to me. Would we be concerned with abuse? Would it be fair for other
> kthreads to use similar strategy? If not, why?

The per-CPU smp_hotplug_thread kthreads also use TASK_INTERRUPTIBLE
when they are idle (or TASK_PARKED when their CPU is offline).  However,
their use of TASK_INTERRUPTIBLE is buried inside smpboot_thread_fn(),
which seems less likely to cause confusion.

Taking a quick look at a few other kthreads...

o	event_test_thread() manually sets TASK_INTERRUPTIBLE and
	calls schedule() directly.

o	ecard_task() uses wait_event_interruptible(), just like
	rcu_gp_kthread() used to do before it switched to swait.

o	eeh_event_handler() gets the same effect via down_interruptible(),
	but silently terminates if a signal is received.

o	pika_dtm_thread() manually sets TASK_INTERRUPTIBLE and then
	calls schedule_timeout().

o	xen_blkif_schedule() uses wait_event_interruptible().

So waiting interruptibly seems to be a common kthread implementation
pattern, though there does appear to be some variety in the actual
means of interruptibly waiting.  I could argue either way on having
a separate API for this purpose, though my default position is to add
comments rather than new APIs -- it is not like the Linux kernel has
any sort of API shortage.  But it would be good to get others' opinions.

							Thanx, Paul

------------------------------------------------------------------------

commit 2ef081931b9be6abdda4b6355edcd970913bed4c
Author: Paul E. McKenney <paulmck@...ux.vnet.ibm.com>
Date:   Thu Jun 15 10:06:57 2017 -0700

    rcu: Add comment relating _interruptible to load average
    
    The rcu_gp_kthread() function uses swait_event_interruptible() and
    swait_event_interruptible_timeout(), which at first glance seems quite
    strange.  Especially given WARN_ON(signal_pending(current)) following
    both calls.  After all, this function is invoked only from a kthread,
    which ignores all signals, so why bother with either the _interruptible
    or the WARN_ON()?
    
    Let's start with the swait_event_interruptible().  If the system is idle,
    the RCU grace-period kthread will spend all its time blocked inside this
    call.  If the _interruptible() form was not used, then this kthread would
    contribute to the load average.  This means that an idle system would
    have a load average of 2 (or 3 if PREEMPT=y), rather than the load average
    of 0 that almost fifty years of UNIX has conditioned sysadms to expect.
    Ergo, swait_event_interruptible() is used to avoid surprising load-average
    values.
    
    This same argument applies to the swait_event_interruptible_timeout()
    invocation.  The RCU grace-period kthread spends its time blocked inside
    this call while waiting for grace periods to complete.  In particular,
    if there was only one busy CPU, but that CPU was frequently invoking
    call_rcu(), then the RCU grace-period kthread would spend almost all its
    time blocked inside the swait_event_interruptible_timeout().  This would
    mean that the load average would be 2 rather than the expected 1 for
    the single busy CPU.
    
    This is not an intuitively obvious situation, in fact, it came as quite
    a surprise to me when I first implemented the grace-period kthreads.
    It therefore deserves comments, which this commit supplies.
    
    Reported-by: "Luis R. Rodriguez" <mcgrof@...nel.org>
    Signed-off-by: Paul E. McKenney <paulmck@...ux.vnet.ibm.com>

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 695fee7cafe0..2d5cc9315cfb 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -2052,7 +2052,7 @@ static bool rcu_gp_init(struct rcu_state *rsp)
 }
 
 /*
- * Helper function for wait_event_interruptible_timeout() wakeup
+ * Helper function for swait_event_interruptible_timeout() wakeup
  * at force-quiescent-state time.
  */
 static bool rcu_gp_fqs_check_wake(struct rcu_state *rsp, int *gfp)
@@ -2191,6 +2191,7 @@ static int __noreturn rcu_gp_kthread(void *arg)
 					       READ_ONCE(rsp->gpnum),
 					       TPS("reqwait"));
 			rsp->gp_state = RCU_GP_WAIT_GPS;
+			/* _interruptible() to avoid messing up load average. */
 			swait_event_interruptible(rsp->gp_wq,
 						 READ_ONCE(rsp->gp_flags) &
 						 RCU_GP_FLAG_INIT);
@@ -2224,6 +2225,7 @@ static int __noreturn rcu_gp_kthread(void *arg)
 					       READ_ONCE(rsp->gpnum),
 					       TPS("fqswait"));
 			rsp->gp_state = RCU_GP_WAIT_FQS;
+			/* _interruptible() to avoid messing up load average. */
 			ret = swait_event_interruptible_timeout(rsp->gp_wq,
 					rcu_gp_fqs_check_wake(rsp, &gf), j);
 			rsp->gp_state = RCU_GP_DOING_FQS;