linux-kernel - Re: [RFC][PATCH 08/10] sched/fair: Implement delayed dequeue

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <ZhEUjX1Nw0y2eJ1o@chenyu5-mobl2>
Date: Sat, 6 Apr 2024 17:23:25 +0800
From: Chen Yu <yu.c.chen@...el.com>
To: Peter Zijlstra <peterz@...radead.org>
CC: <mingo@...hat.com>, <juri.lelli@...hat.com>, <vincent.guittot@...aro.org>,
	<dietmar.eggemann@....com>, <rostedt@...dmis.org>, <bsegall@...gle.com>,
	<mgorman@...e.de>, <bristot@...hat.com>, <vschneid@...hat.com>,
	<linux-kernel@...r.kernel.org>, <kprateek.nayak@....com>,
	<wuyun.abel@...edance.com>, <tglx@...utronix.de>, <efault@....de>,
	<yu.chen.surf@...il.com>
Subject: Re: [RFC][PATCH 08/10] sched/fair: Implement delayed dequeue

On 2024-04-05 at 12:28:02 +0200, Peter Zijlstra wrote:
> Extend / fix 86bfbb7ce4f6 ("sched/fair: Add lag based placement") by
> noting that lag is fundamentally a temporal measure. It should not be
> carried around indefinitely.
> 
> OTOH it should also not be instantly discarded, doing so will allow a
> task to game the system by purposefully (micro) sleeping at the end of
> its time quantum.
> 
> Since lag is intimately tied to the virtual time base, a wall-time
> based decay is also insufficient, notably competition is required for
> any of this to make sense.
> 
> Instead, delay the dequeue and keep the 'tasks' on the runqueue,
> competing until they are eligible.
> 
> Strictly speaking, we only care about keeping them until the 0-lag
> point, but that is a difficult proposition, instead carry them around
> until they get picked again, and dequeue them at that point.
> 
> Since we should have dequeued them at the 0-lag point, truncate lag
> (eg. don't let them earn positive lag).
> 
> XXX test the cfs-throttle stuff
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@...radead.org>
> ---

Tested schbench on xeon server, which has 240 CPUs/2 sockets.
schbench -m 2 -r 100
the result seems ok to me.

baseline:
NO_DELAY_DEQUEUE
NO_DELAY_ZERO
Wakeup Latencies percentiles (usec) runtime 100 (s) (1658446 total samples)
	  50.0th: 5          (361126 samples)
	  90.0th: 11         (654121 samples)
	* 99.0th: 25         (123032 samples)
	  99.9th: 673        (13845 samples)
	  min=1, max=8337
Request Latencies percentiles (usec) runtime 100 (s) (1662381 total samples)
	  50.0th: 14992      (524771 samples)
	  90.0th: 15344      (657370 samples)
	* 99.0th: 15568      (129769 samples)
	  99.9th: 15888      (10017 samples)
	  min=3529, max=43841
RPS percentiles (requests) runtime 100 (s) (101 total samples)
	  20.0th: 16544      (37 samples)
	* 50.0th: 16608      (30 samples)
	  90.0th: 16736      (31 samples)
	  min=16403, max=17698
average rps: 16623.81


DELAY_DEQUEUE
DELAY_ZERO
Wakeup Latencies percentiles (usec) runtime 100 (s) (1668161 total samples)
	  50.0th: 6          (394867 samples)
	  90.0th: 12         (653021 samples)
	* 99.0th: 31         (142636 samples)
	  99.9th: 755        (14547 samples)
	  min=1, max=5226
Request Latencies percentiles (usec) runtime 100 (s) (1671859 total samples)
	  50.0th: 14384      (511809 samples)
	  90.0th: 14992      (653508 samples)
	* 99.0th: 15408      (149257 samples)
	  99.9th: 15984      (12090 samples)
	  min=3546, max=38360
RPS percentiles (requests) runtime 100 (s) (101 total samples)
	  20.0th: 16672      (45 samples)
	* 50.0th: 16736      (52 samples)
	  90.0th: 16736      (0 samples)
	  min=16629, max=16800
average rps: 16718.59


The 99th wakeup latency increases a little bit, and should be in the acceptible
range(25 -> 31 us). Meanwhile the throughput increases accordingly. Here are
the possible reason I can think of:

1. wakeup latency: The time to find an eligible entity in the tree
   during wakeup might take longer - if there are more delayed-dequeue
   tasks in the tree.
2. throughput: Inhibit task dequeue can decrease the ratio to touch the
   task group's load_avg: dequeue_entity()-> { update_load_avg(), update_cfs_group()),
   which reduces the cache contention among CPUs, and improves throughput.


> +	} else {
> +		bool sleep = flags & DEQUEUE_SLEEP;
> +
> +		SCHED_WARN_ON(sleep && se->sched_delayed);
> +		update_curr(cfs_rq);
> +
> +		if (sched_feat(DELAY_DEQUEUE) && sleep &&
> +		    !entity_eligible(cfs_rq, se)) {

Regarding the elibigle check, it was found that there could be an overflow
issue, and it brings false negative of entity_eligible(), which was described here:
https://lore.kernel.org/lkml/20240226082349.302363-1-yu.c.chen@intel.com/
and also reported on another machine
https://lore.kernel.org/lkml/ZeCo7STWxq+oyN2U@gmail.com/
I don't have good idea to avoid that overflow properly, while I'm trying to
reproduce it locally, do you have any guidance on how to address it?

thanks,
Chenyu