linux-kernel - Re: [PATCH 0/2] [tip: sched/core] sched: Disable PLACE_LAG and RUN_TO

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <C0E39DE3-EEEB-4A08-850F-A4B7EC809E3A@amazon.com>
Date: Sat, 19 Oct 2024 02:30:28 +0000
From: "Prundeanu, Cristian" <cpru@...zon.com>
To: K Prateek Nayak <kprateek.nayak@....com>, Peter Zijlstra
	<peterz@...radead.org>
CC: "linux-tip-commits@...r.kernel.org" <linux-tip-commits@...r.kernel.org>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>, Ingo Molnar
	<mingo@...hat.com>, "x86@...nel.org" <x86@...nel.org>,
	"linux-arm-kernel@...ts.infradead.org"
	<linux-arm-kernel@...ts.infradead.org>, "Doebel, Bjoern" <doebel@...zon.de>,
	"Mohamed Abuelfotoh, Hazem" <abuehaze@...zon.com>, "Blake, Geoff"
	<blakgeof@...zon.com>, "Saidi, Ali" <alisaidi@...zon.com>, "Csoma, Csaba"
	<csabac@...zon.com>, "gautham.shenoy@....com" <gautham.shenoy@....com>
Subject: Re: [PATCH 0/2] [tip: sched/core] sched: Disable PLACE_LAG and
 RUN_TO_PARITY and move them to sysctl

On 2024-10-18, 02:08, "K Prateek Nayak" <kprateek.nayak@....com> wrote:

> Most of our testing used sysbench as the benchmark driver. How does
> mysql+hammerdb work specifically? Do the tasks driving the request are
> located on a separate server or are co-located with the benchmarks
> threads on the same server?

The hammerdb test is a bit more complex than sysbench. It uses two
independent physical machines to perform a TPC-C derived test [1], aiming
to simulate a real-world database workload. The machines are allocated as
an AWS EC2 instance pair on the same cluster placement group [2], to avoid
measuring network bottlenecks instead of server performance. The SUT
instance runs mysql configured to use 2 worker threads per vCPU (32
total); the load generator instance runs hammerdb configured with 64
virtual users and 24 warehouses [3]. Each test consists of multiple
20-minute rounds, run consecutively on multiple independent instance
pairs.

[1] https://www.tpc.org/tpcc/default5.asp
[2] https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/placement-strategies.html
[3] https://hammerdb.com/docs/ch03s05.html

> Did you see any glaring changes in scheduler statistics with the
> introduction of EEVDF in v6.6? EEVDF commits up till v6.9 were easy to
> revert from my experience but I've not tried it on v6.12-rcX with the
> EEVDF complete series. Is all the regression seen purely
> attributable to EEVDF alone on the more recent kernels?

Yes, the regression is attributable to EEVDF:
After seeing indications that there was a performance degradation
somewhere after kernel 6.5, bisect testing narrowed the same degradation
to merge commit b41bbb33cf75 (Merge branch 'sched/eevdf' into sched/core).
Expanding testing to all stable kernel versions next (6.6 through 6.11)
showed very similar performance data, confirming that non-EEDVF changes
introduced along the way do not have any significant impact.

Testing kernel 6.12 at various stages, starting with commit 2004cef11ea0
(Merge tag 'sched-core-2024-09-19') and continuing with the v6.12-rcX tags
as they became available, shows a different performance profile than
previous kernels: the degradation is smaller than 6.6 through 6.11, but
the positive impact from disabling PLACE_LAG and RUN_TO_PARITY is also
smaller. However, after testing a fractional factorial of combinations for
all EEVDF-specific features, the only configuration that yielded better
performance than NO_PLACE_LAG+NO_RUN_TO_PARITY was with all 7 features
disabled (NO_PLACE_LAG, NO_RUN_TO_PARITY, NO_DELAY_DEQUEUE, NO_DELAY_ZERO,
NO_PLACE_DEADLINE_INITIAL, NO_PLACE_REL_DEADLINE, NO_PREEMPT_SHORT). After
considering the potential impact on other workloads and the ease of
backporting for the two best options, NO_PLACE_LAG+NO_RUN_TO_PARITY seemed
like the better choice for 6.12 as well.

Looking at the comparative aperf [4] reports showed no diverging
configuration issues, and no noticeable differences in the PMU stats
(which confirms there are no unrelated system differences affecting the
results).

[4] https://github.com/aws/aperf

>> I haven't tested with SCHED_BATCH yet, will update the thread with results
>> as they accumulate

Testing with SCHED_BATCH (and default scheduler settings) resulted in no
significant performance change.

As an additional data point, using SCHED_FIFO or SCHED_RR further degraded
the mysql performance (but improved postgresql).

> Could you also test running with:
> echo NO_WAKEUP_PREEMPTION > /sys/kernel/debug/sched/features

Certainly; will update the thread when the results are available.   

> On a side note, what is the CONFIG_HZ and the
> preemption model on your test kernel (most of my testing was with
> CONFIG+HZ=250, voluntary preemption)

CONFIG_HZ was 250. Testing with other values did not reveal anything
relevant to this regression either: both CFS and EEVDF had a slight
improvement with CONFIG_HZ=100, and no change otherwise.

Preemption was the default (voluntary) for all tests.

> The data in the latter link helped root-cause the actual issue with the
> algorithm that the benchmark disliked. Similar information for the
> database benchmarks you are running, can help narrow down the issue.

Thank you for the links! I'll gladly continue gathering data and help 
diagnose this issue. I am concerned, however, with keeping the default 
configuration the way it currently is while the investigation continues.

Do you happen to know how the reported blogbench performance compares to 
the pre-EEVDF (v6.5) results?

> From what I can tell, your benchmark has a set of threads that like to
> get cpu time as fast as possible. With EEVDF Complete (I would recommend
> using current tip:sched/urgent branch to test them out) setting a more
> aggressive nice value to these threads should enable them to negate the
> effect of RUN_TO_PARITY thanks to PREEMPT_SHORT.
>
> As for NO_PLACE_LAG, the DELAY_DEQUEUE feature should help task shed off
> any lag it has built up and should very likely start from the zero-lag
> point unless it is a very short sleeper.

Agree with the thread assessment. It seems that the best outcome is when
the threads run as fast as possible, with overhead as small as possible. 

I'll test with EEVDF complete as well. Note that both DELAY_DEQUEUE and
PREEMPT_SHORT were part of the combinations in the test suite on commit
2004cef11ea0, as mentioned above, and did not effect dramatic performance
changes when flipped.
At that time (and this is no longer true in v6.12-rc2) NO_DELAY_ZERO was
also needed along with NO_PLACE_LAG and NO_RUN_TO_PARITY.

> Is there any reason to flip it very early into the boot? Have you seen
> anything go awry with system processes during boot with EEVDF?

I haven't, as this benchmarking is specifically measuring the stable state
of a system.
The boot order argument was only in the context of discussing the
suitability of rc.local as compared to sysctl for persisting scheduler
options. It's conceivable that options which are different in a stable
state than at startup could lead to process management outcomes which
affect performance (and are harder to reproduce scenarios).