linux-kernel - RFC: rt + uncontrollable kthreads/workqueues = generic evil

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [day] [month] [year] [list]
Message-ID: <1387794220.15103.79.camel@marge.simpson.net>
Date:	Mon, 23 Dec 2013 11:23:40 +0100
From:	Mike Galbraith <bitbucket@...ine.de>
To:	LKML <linux-kernel@...r.kernel.org>
Cc:	Tejun Heo <tj@...nel.org>, Peter Zijlstra <peterz@...radead.org>,
	Ingo Molnar <mingo@...e.hu>,
	Thomas Gleixner <tglx@...utronix.de>,
	Steven Rostedt <rostedt@...dmis.org>
Subject: RFC: rt + uncontrollable kthreads/workqueues = generic evil


1. rt tasks can kill the whole box or jam up random applications via
kthreadd and/or kworker starvation, even when the user is being careful.
2. uncontrollable kthreads create unfixable rt priority inversions in
the workqueue case, and even if workqueues could be prioritized, dynamic
worker pools can insert huge memory allocation latencies into any rt
task that depends upon a workqueue.

A couple samples:

CPU2,3 are "completely" isolated via cpusets, CPU3 is running a "super
critical" rt hog (while(1);) at FIFO:1.  Joe User fires up firefox on a
system cpuset CPU, firefox hangs, lots of things do.


marge:~ # cat /proc/5840/stack
[<ffffffff81101d0e>] sleep_on_page+0xe/0x20
[<ffffffff81101f00>] wait_on_page_bit+0x80/0x90
[<ffffffff81102004>] filemap_fdatawait_range+0xf4/0x180
[<ffffffff811035ad>] filemap_write_and_wait_range+0x4d/0x80
[<ffffffff811cab8a>] ext4_sync_file+0xca/0x290
[<ffffffff81186e38>] do_fsync+0x58/0x80
[<ffffffff81187230>] SyS_fsync+0x10/0x20
[<ffffffff81559ed2>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff

rt_rq[3]:
  .rt_nr_running                 : 1
  .rt_throttled                  : 0
  .rt_time                       : 0.000000
  .rt_runtime                    : 0.000001

runnable tasks:
            task   PID         tree-key  switches  prio     exec-runtime         sum-exec        sum-sleep
----------------------------------------------------------------------------------------------------------
        kthreadd     2     32390.390741        89   120     32390.390741         1.103162    417146.885135
    kworker/u8:0     6     32390.390741        50   120     32390.390741         0.679007    303904.697089
     kworker/3:1    37     32391.042046      4971   120     32391.042046        77.975683    197424.475026
    kworker/3:1H   269     32390.390741      2542   100     32390.390741        15.520425    193210.559919
R         cpuhog  5625         0.000000        13    98         0.000000    382385.886326        89.825704

Well now, kthreadd waking to an isolated and 100% rt consumed CPU
doesn't bode well for the future of this box, that's a killer.
kworker/3:1H is what was blocking firefox and more though, bumping it to
FIFO:10 freed firefox and friends.

Try again with kthread prioritized.. evolution hangs at startup.

rt_rq[3]:
  .rt_nr_running                 : 1
  .rt_throttled                  : 0
  .rt_time                       : 0.000000
  .rt_runtime                    : 0.000001

runnable tasks:
            task   PID         tree-key  switches  prio     exec-runtime         sum-exec        sum-sleep
----------------------------------------------------------------------------------------------------------
     kworker/3:1    37     32392.189438      5092   120     32392.189438        79.811331    318171.326151
R         cpuhog 15101         0.000000         4    98         0.000000     48118.123331         0.043160

marge:~ # pidof evolution
15103
marge:~ # cat /proc/15103/stack
[<ffffffff81064359>] flush_work+0x29/0x40
[<ffffffff81110113>] lru_add_drain_all+0x163/0x1a0
[<ffffffff8112df48>] SyS_mlock+0x38/0x130
[<ffffffff81559ed2>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff

cpuhog 15101 [003]  5027.777502: irq:softirq_entry: vec=1 [action=TIMER]
cpuhog 15101 [003]  5027.777504: workqueue:workqueue_queue_work: work struct=0xffff88022fd8f060 function=vmstat_update workqueue=0xffff880226c5aa00 req_cpu=64 cpu=3
cpuhog 15101 [003]  5027.777505: workqueue:workqueue_activate_work: work struct 0xffff88022fd8f060
cpuhog 15101 [003]  5027.777507: sched:sched_wakeup: comm=kworker/3:1 pid=37 prio=120 success=1 target_cpu=003
cpuhog 15101 [003]  5027.777508: irq:softirq_exit: vec=1 [action=TIMER]
cpuhog 15101 [003]  5027.777508: irq:softirq_entry: vec=9 [action=RCU]
cpuhog 15101 [003]  5027.777509: irq:softirq_exit: vec=9 [action=RCU]
cpuhog 15101 [003]  5027.781500: irq:softirq_raise: vec=1 [action=TIMER]

flush_work is gonna take a while.  Bump pid 37 to FIFO:10, evolution can
finally run.

I created an ugly hack in enterprise to let the user prioritize kthreads
and/or workqueues, and that works as far as empowering the user to do
whatever he wants to do without the box just falling over, or the stuff
he thinks is super critical starving its own dependencies (or innocent
bystanders as above), and ergo itself, no matter how "clever" that
"critical stuff" may seem to me.

Most of the time, when I see these kind of issues, it's stuff that I'd
call rt abuse, but I've also recently seen some image processing stuff
that looked much more legit, and which used to be able to get away with
using a workqueue fall flat, and I had to tell the user that workqueue
should be removed from their driver, as the things are not the least bit
rt friendly.  Dynamic pool constituted a regression for that user. 

	-Mike

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/