linux-kernel - Re: [PATCH 0/2] [tip: sched/core] sched: Disable PLACE_LAG and RUN_TO

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <d3306655-c4e7-20ab-9656-b1b01417983c@amd.com>
Date: Mon, 4 Nov 2024 16:04:56 +0530
From: K Prateek Nayak <kprateek.nayak@....com>
To: Cristian Prundeanu <cpru@...zon.com>, "Gautham R. Shenoy"
	<gautham.shenoy@....com>
CC: <linux-tip-commits@...r.kernel.org>, <linux-kernel@...r.kernel.org>,
	"Peter Zijlstra" <peterz@...radead.org>, Ingo Molnar <mingo@...hat.com>,
	<x86@...nel.org>, <linux-arm-kernel@...ts.infradead.org>, Bjoern Doebel
	<doebel@...zon.com>, Hazem Mohamed Abuelfotoh <abuehaze@...zon.com>, "Geoff
 Blake" <blakgeof@...zon.com>, Ali Saidi <alisaidi@...zon.com>, Csaba Csoma
	<csabac@...zon.com>, Benjamin Herrenschmidt <benh@...nel.crashing.org>
Subject: Re: [PATCH 0/2] [tip: sched/core] sched: Disable PLACE_LAG and
 RUN_TO_PARITY and move them to sysctl

Hello Cristian, Gautham,

On 11/4/2024 3:49 PM, Gautham R. Shenoy wrote:
> On Mon, Oct 28, 2024 at 11:57:49PM -0500, Cristian Prundeanu wrote:
>> Hi Gautham,
>>
>> On 2024-10-25, 09:44, "Gautham R. Shenoy" <gautham.shenoy@....com <mailto:gautham.shenoy@....com>> wrote:
>>
>>> On Thu, Oct 24, 2024 at 07:12:49PM +1100, Benjamin Herrenschmidt wrote:
>>>> On Sat, 2024-10-19 at 02:30 +0000, Prundeanu, Cristian wrote:
>>>>>
>>>>> The hammerdb test is a bit more complex than sysbench. It uses two
>>>>> independent physical machines to perform a TPC-C derived test [1], aiming
>>>>> to simulate a real-world database workload. The machines are allocated as
>>>>> an AWS EC2 instance pair on the same cluster placement group [2], to avoid
>>>>> measuring network bottlenecks instead of server performance. The SUT
>>>>> instance runs mysql configured to use 2 worker threads per vCPU (32
>>>>> total); the load generator instance runs hammerdb configured with 64
>>>>> virtual users and 24 warehouses [3]. Each test consists of multiple
>>>>> 20-minute rounds, run consecutively on multiple independent instance
>>>>> pairs.
>>>>
>>>> Would it be possible to produce something that Prateek and Gautham
>>>> (Hi Gautham btw !) can easily consume to reproduce ?
>>>>
>>>> Maybe a container image or a pair of container images hammering each
>>>> other ? (the simpler the better).
>>>
>>> Yes, that would be useful. Please share your recipe. We will try and
>>> reproduce it at our end. In our testing from a few months ago (some of
>>> which was presented at OSPM 2024), most of the database related
>>> regressions that we observed with EEVDF went away after running these
>>> the server threads under SCHED_BATCH.
>>
>> I am working on a repro package that is self contained and as simple to
>> share as possible.
> 
> Sorry for the delay in response. I was away for the Diwali festival.
> Thank you for working on the repro package.
> 
> 
>>
>> My testing with SCHED_BATCH is meanwhile concluded. It did reduce the
>> regression to less than half - but only with WAKEUP_PREEMPTION enabled.
>> When using NO_WAKEUP_PREEMPTION, there was no performance change compared
>> to SCHED_OTHER.
>>
>> (At the risk of stating the obvious, using SCHED_BATCH only to get back to
>> the default CFS performance is still only a workaround, just as disabling
>> PLACE_LAG+RUN_TO_PARITY is; these give us more room to investigate the
>> root cause in EEVDF, but shouldn't be seen as viable alternate solutions.)
>>
>> Do you have more detail on the database regressions you saw a few months
>> ago? What was the magnitude, and which workloads did it manifest on?
> 
> 
> There were three variants of sysbench + MySQL which showed regression
> with EEVDF.
> 
> 1. 1 Table, 10M Rows, read-only queries.
> 2. 3 Tables, 10M Rows each, read-only queries.
> 3. 1 Segmented Table, 10M Rows, read-only queries.
> 
> These saw regressions in the range of 9-12%.
> 
> The other database workload which showed regression was MongoDB + YCSB
> workload c. There the magnitude of the regression was around 17%.
> 
> As mentioned by Dietmar, we observed these regressions to go away with
> the original EEVDF complete patches which had a feature called
> RESPECT_SLICE which allowed a running task to run till its slice gets
> over without being preempted by a newly woken up task.
> 
> However, Peter suggested exploring SCHED_BATCH which fixed the
> regression even without EEVDF complete patchset.

Adding to that, since we had to test a variety of workloads, often where
number of threads autoscales, we used the following methodology to check
if using SCHED_BATCH solves the regressions observed:

     # echo 1 > /sys/kernel/tracing/events/task/enable
     # cat dump_python.py
     import time
     import sys
     
     with open("/sys/kernel/tracing/trace_pipe") as tf:
       for l in tf:
         if not l.startswith("#") or "comm=bash" not in l:
           pid_start = l.index("pid=") + 4
           pid = int(l[pid_start: l.index(" ", pid_start)])
           print(pid)
           sys.stdout.flush()

     # watch 'python3 dump_python.py | while read i; do chrt -v -b --pid 0 $i; done'


Post running the above, we launch the benchmark. It is not pretty but it
has worked for various different kind of benchmarks we've tested.

On an addition note, since EEVDF got rid of both "wakeup_granularity_ns"
and "latency_ns", and SCHED_BATCH helps with the absence of former, have
you tested using a larger values of "base_slice_ns" in tandum with
SCHED_BATCH / NO_WAKEUP_PREEMPTION ?

> 
>>
>> -Cristian
> 
> --
> Thanks and Regards
> gautham.

-- 
Thanks and Regards,
Prateek