linux-kernel - CFQ timer precision

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Date:	Mon, 16 Nov 2015 16:11:59 +0100
From:	Jan Kara <jack@...e.cz>
To:	LKML <linux-kernel@...r.kernel.org>
Cc:	axboe@...nel.dk, Jeff Moyer <jmoyer@...hat.com>
Subject: CFQ timer precision

Hello,

lately I was looking into a big performance hit we take when blkio
controller is enabled and jbd2 thread ends up in a different cgroup than
user process. E.g. dbench4 throughput drops from ~140 MB/s to ~20 MB/s.
However artificial dbench4 is, this kind of drop will likely be clearly
visible in real life workloads as well. With unified cgroup hierarchy
the above cgroup split between jbd2 and user processes is unavoidable
once you enable blkio controller so IMO we should accomodate that better.

I have couple of CFQ idling improvements / fixes which I'll post later this
week once I'll complete some round of benchmarking. They improve the
throughput to ~40 MB/s which helps but clearly there's still a big room for
improvement. The reason for the performance drop is essentially in idling
we do to avoid starvation of CFQ queues. Now when idling in this context,
current default of 8 ms idle window is far to large - we start the timer
after the final request is completed and thus we effectively give the
process 8 ms of CPU time to submit the next IO request. Which I think is
usually far too much. The problem is that more fine grained idling is
actually problematic because e.g. SUSE distro kernels have HZ=250 and thus
1 jiffy is 4 ms. Hence my proposal: Do you think it would be OK to convert
CFQ to use highres timers and do all the accounting in microseconds?
Then we could tune the idle time to be say 1ms or even autotune it based on
process' think time both of which I expect would get us much closer to
original throughput (4 ms idle window gets us to ~70 MB/s with my patches,
disabling idling gets us to original throughput as expected).

Thoughts?

								Honza

-- 
Jan Kara <jack@...e.com>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/