linux-kernel - [PATCH 0/5 v2] SYNC_NOIDLE preemption for ancestor cgroups

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <1452180496-18483-1-git-send-email-jack@suse.cz>
Date:	Thu,  7 Jan 2016 16:28:11 +0100
From:	Jan Kara <jack@...e.cz>
To:	axboe@...nel.dk
Cc:	Tejun Heo <tj@...nel.org>, linux-kernel@...r.kernel.org,
	Jeff Moyer <jmoyer@...hat.com>, Jan Kara <jack@...e.cz>
Subject: [PATCH 0/5 v2] SYNC_NOIDLE preemption for ancestor cgroups

Hello Jens,

This is v2 of the patch series. Since v1 I have added missing export of
cgroup_is_descendant(). I thought the series is sitting in linux-block tree but
when checking now I see that you probably ripped it out due to the compilation
issue.

If you find the series too hasty for the current merge window, just postpone it
for the next one.

---

Recently we have been debugging regression of basically any IO workload
when systemd started enabling blkio controller for user sessions (due to
delegation feature). Now using blkio controller certainly has its costs
but some of the hits seemed just too heavy - e.g. dbench4 throughput
dropped from ~150 MB/s to ~26 MB/s for ext4 with barrier=0 mount option on
an ordinary SATA drive. The reason for the drop is visible in the following
blktrace:

0.000383426  5122  A  WS 27691328 + 8 <- (259,851968) 21473600
0.000384039  5122  Q  WS 27691328 + 8 [jbd2/sdb3-8]
0.000385944  5122  G  WS 27691328 + 8 [jbd2/sdb3-8]
0.000386315  5122  P   N [jbd2/sdb3-8]
...
0.000394031  5122  A  WS 27691384 + 8 <- (259,851968) 21473656
0.000394210  5122  Q  WS 27691384 + 8 [jbd2/sdb3-8]
0.000394569  5122  M  WS 27691384 + 8 [jbd2/sdb3-8]
0.000395239  5122  I  WS 27691328 + 64 [jbd2/sdb3-8]
0.000396572     0  m   N cfq5122SN / insert_request
0.000397389     0  m   N cfq5122SN / add_to_rr
0.000398458  5122  U   N [jbd2/sdb3-8] 1

<<< Here we wait for 7.5 ms for idle timer on dbench sync-noidle queue to fire

0.008001111     0  m   N cfq idle timer fired
0.008003152     0  m   N cfq5174SN /dbench slice expired t=0
0.008004871     0  m   N /dbench served: vt=24796020 min_vt=24771438
0.008006508     0  m   N cfq5174SN /dbench sl_used=2 disp=1 charge=2 iops=0 sect=24
0.008007509     0  m   N cfq5174SN /dbench del_from_rr
0.008008197     0  m   N /dbench del_from_rr group
0.008008771     0  m   N cfq schedule dispatch
0.008013506     0  m   N cfq workload slice:16
0.008014979     0  m   N cfq5122SN / set_active wl_class:0 wl_type:1
0.008017229     0  m   N cfq5122SN / fifo=          (null)
0.008018149     0  m   N cfq5122SN / dispatch_insert
0.008019863     0  m   N cfq5122SN / dispatched a request
0.008020829     0  m   N cfq5122SN / activate rq, drv=1
0.008021578   389  D  WS 27691328 + 64 [kworker/5:1H]
0.008491262     0  C  WS 27691328 + 64 [0]
0.008498654     0  m   N cfq5122SN / complete rqnoidle 1
0.008500202     0  m   N cfq5122SN / set_slice=19
0.008501797     0  m   N cfq5122SN / arm_idle: 2 group_idle: 0
0.008502073     0  m   N cfq schedule dispatch
0.008517281  5122  A  WS 27691392 + 8 <- (259,851968) 21473664
0.008517627  5122  Q  WS 27691392 + 8 [jbd2/sdb3-8]
0.008519126  5122  G  WS 27691392 + 8 [jbd2/sdb3-8]
0.008519534  5122  I  WS 27691392 + 8 [jbd2/sdb3-8]
0.008520560     0  m   N cfq5122SN / insert_request
0.008521908     0  m   N cfq5122SN / dispatch_insert
0.008522798     0  m   N cfq5122SN / dispatched a request
0.008523558     0  m   N cfq5122SN / activate rq, drv=1
0.008523841  5122  D  WS 27691392 + 8 [jbd2/sdb3-8]
0.008718527     0  C  WS 27691392 + 8 [0]
0.008721911     0  m   N cfq5122SN / complete rqnoidle 1
0.008723186     0  m   N cfq5122SN / arm_idle: 2 group_idle: 0
0.008723578     0  m   N cfq schedule dispatch
0.009062333  5174  A  WS 23276680 + 24 <- (259,851968) 17058952
0.009062950  5174  Q  WS 23276680 + 24 [dbench4]
0.009065427  5174  G  WS 23276680 + 24 [dbench4]
0.009065717  5174  P   N [dbench4]
0.009067472  5174  I  WS 23276680 + 24 [dbench4]
0.009069038     0  m   N cfq5174SN /dbench insert_request
0.009069913     0  m   N cfq5174SN /dbench add_to_rr
0.009071190  5174  U   N [dbench4] 1

<<<< Here we wait another 7 ms for idle timer on jbd2 sync-noidle queue to fire

0.016001504     0  m   N cfq idle timer fired
0.016002924     0  m   N cfq5122SN / slice expired t=0
0.016004424     0  m   N / served: vt=24783779 min_vt=24771488
0.016005888     0  m   N cfq5122SN / sl_used=2 disp=2 charge=2 iops=0 sect=72
0.016006635     0  m   N cfq5122SN / del_from_rr
0.016007152     0  m   N / del_from_rr group
0.016007613     0  m   N cfq schedule dispatch
0.016014571     0  m   N cfq workload slice:24
0.016015679     0  m   N cfq5174SN /dbench set_active wl_class:0 wl_type:1
0.016016794     0  m   N cfq5174SN /dbench fifo=          (null)
0.016017652     0  m   N cfq5174SN /dbench dispatch_insert
0.016018883     0  m   N cfq5174SN /dbench dispatched a request
0.016019714     0  m   N cfq5174SN /dbench activate rq, drv=1
0.016019973   382  D  WS 23276680 + 24 [kworker/6:1H]
0.016347056     0  C  WS 23276680 + 24 [0]
0.016357022     0  m   N cfq5174SN /dbench complete rqnoidle 1
0.016358509     0  m   N cfq5174SN /dbench set_slice=24
0.016360127     0  m   N cfq5174SN /dbench arm_idle: 2 group_idle: 0
0.016360508     0  m   N cfq schedule dispatch
...

When dbench isn't in a separate cgroup, dbench and jbd2 sync-noidle queues just
freely preempt each other. When dbench gets contained in a dedicated blkio
cgroup, preemption is not allowed and the throughput dropped.

The idling is happening because we want to provide separation of IO between
different blkio cgroups and thus we idle to avoid starving one cgroup where
process is submitting only dependent IO. I am of the opinion that in case
ancestor would like to preempt a descendant cgroup, there is no strong reason
to provide the separation and we can save at least one of the idle times
(when switching from dbench to jbd2 thread). Thus the following patch set
which improves the throughput of dbench4 from ~26 MB/s to ~48 MB/s.

The first patch in the patch set is just unrelated improvement where I've
spotted some asymetry in how slice_idle and group_idle are handled. Patches two
and three prepare cfq_should_preempt() to be able to work on service trees of
different cgroups, patch 4 then adds the logic in cfq_should_preempt() to allow
preemption by ancestor cgroup.

Comments welcome!

								Honza
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/