linux-kernel - Re: [RFC PATCH 3/3] sched/fair: Add a per-shard overload flag

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <6fef325c-4415-c3c8-8254-531b2883b8e3@amd.com>
Date:   Thu, 5 Oct 2023 09:20:05 +0530
From:   K Prateek Nayak <kprateek.nayak@....com>
To:     David Vernet <void@...ifault.com>
Cc:     linux-kernel@...r.kernel.org, peterz@...radead.org,
        mingo@...hat.com, juri.lelli@...hat.com,
        vincent.guittot@...aro.org, dietmar.eggemann@....com,
        rostedt@...dmis.org, bsegall@...gle.com, mgorman@...e.de,
        bristot@...hat.com, vschneid@...hat.com, tj@...nel.org,
        roman.gushchin@...ux.dev, gautham.shenoy@....com,
        aaron.lu@...el.com, wuyun.abel@...edance.com, kernel-team@...a.com
Subject: Re: [RFC PATCH 3/3] sched/fair: Add a per-shard overload flag

Hello David,

On 10/4/2023 10:50 PM, David Vernet wrote:
> On Wed, Oct 04, 2023 at 09:51:18AM +0530, K Prateek Nayak wrote:
>> Hello David,
> 
> Hello Prateek,
> 
>>
>> Thank you for answering my queries, I'll leave some data below to
>> answer yours.
>>
>> On 9/29/2023 10:31 PM, David Vernet wrote:
>>> On Fri, Sep 01, 2023 at 01:53:12AM +0530, K Prateek Nayak wrote:
>>>> Hello David,
>>>>
>>>> On 9/1/2023 12:41 AM, David Vernet wrote:
>>>>> On Thu, Aug 31, 2023 at 04:15:08PM +0530, K Prateek Nayak wrote:
>>>>>
>>>>> Hi Prateek,
>>>>>
>>>>>> Even with the two patches, I still observe the following lock
>>>>>> contention when profiling the tbench 128-clients run with IBS:
>>>>>>
>>>>>>   -   12.61%  swapper          [kernel.vmlinux]         [k] native_queued_spin_lock_slowpath
>>>>>>      - 10.94% native_queued_spin_lock_slowpath
>>>>>>         - 10.73% _raw_spin_lock
>>>>>>            - 9.57% __schedule
>>>>>>                 schedule_idle
>>>>>>                 do_idle
>>>>>>               + cpu_startup_entry
>>>>>>            - 0.82% task_rq_lock
>>>>>>                 newidle_balance
>>>>>>                 pick_next_task_fair
>>>>>>                 __schedule
>>>>>>                 schedule_idle
>>>>>>                 do_idle
>>>>>>               + cpu_startup_entry
>>>>>>
>>>>>> Since David mentioned rq->avg_idle check is probably not the right step
>>>>>> towards the solution, this experiment introduces a per-shard
>>>>>> "overload" flag. Similar to "rq->rd->overload", per-shard overload flag
>>>>>> notifies of the possibility of one or more rq covered in the shard's
>>>>>> domain having a queued task. shard's overload flag is set at the same
>>>>>> time as "rq->rd->overload", and is cleared when shard's list is found
>>>>>> to be empty.
>>>>>
>>>>> I think this is an interesting idea, but I feel that it's still working
>>>>> against the core proposition of SHARED_RUNQ, which is to enable work
>>>>> conservation.
>>>>
>>>> I don't think so! Work conservation is possible if there is an
>>>> imbalance. Consider the case where we 15 tasks in the shared_runq but we
>>>> have 16 CPUs, 15 of which are running these 15 tasks, and one going
>>>
>>> I'm not sure I'm fully following. Those 15 tasks would not be enqueued
>>> in the shared runq if they were being run. They would be dequeued from
>>> the shared_runq in __dequeue_entity(), which would be called from
>>> set_next_entity() before they were run. In this case, the
>>> shard->overload check should be equivalent to the
>>> !list_empty(&shard->list) check.
>>>
>>> Oh, or is the idea that we're not bothering to pull them from the
>>> shared_runq if they're being woken up and enqueued on an idle core that
>>> will immediately run them on the next resched path? If so, I wonder if
>>> we would instead just want to not enqueue the task in the shared_runq at
>>> all? Consider that if another task comes in on an rq with
>>> rq->nr_running >= 2, that we still wouldn't want to pull the tasks that
>>> were being woken up on idle cores (nor take the overhead of inserting
>>> and then immediately removing them from the shared_runq).
> 
> Friendly ping on this point. This is the only scenario where I could see
> the overload check helping, so I want to make sure I'm understanding it
> and am correct in that just avoiding enqueueing the task in the shard in
> this scenario would give us the same benefit.

Woops! Missed answering this. So the original motivation for
'shard->overload' was that there is a rq lock contention, very likely as
a result of shared_runq_pick_next_task() trying to grab a remote rq's
lock. Looking at shared_runq_pick_next_task(), the criteria
"!task_on_cpu(src_rq, p)" led me to believe we might end up enqueuing a
task that is running on the CPU but now that I take a look at
shared_runq_enqueue_task() being called from __enqueue_entity(), this
should be a very rare scenario. 

However, if a running task was enqueued often into a shared_runq, the
'shard->overload' is an indication that all the runqueues covered by
the shard are not overloaded and hence, peeking into the shard can be
skipped. Let me see if I can grab some more stats to verify what
exactly is happening.

> 
>> So this is the breakdown of outcomes after peeking into the shared_runq
>> during newidle_balance:
>>
>>                                                 SHARED_RUNQ                     SHARED_RUNQ
>>                                         + correct cost accounting       + correct cost accounting
>>                                                                         + rq->avg_idle early bail
>>
>> tbench throughput (normalized)		:	     1.00			2.47	       (146.84%)
>>
>> attempts                                :       6,560,413                  2,273,334           (-65.35%)
>> shared_runq was empty                   :       2,276,307 [34.70%]         1,379,071 [60.66%]  (-39.42%)
>> successful at pulling task              :       2,557,158 [38/98%]           342,839 [15.08%]  (-86.59%)
>> unsuccessful despite fetching task      :       1,726,948 [26.32%]           551,424 [24.26%]  (-68.06%)
>>
>> As you can see, there are more attempts and a greater chance of success
>> in the case without the rq->avg_idle check upfront. Where the problem
>> lies (at least what I believe is) a task is waiting to be enqueued / has
>> been enqueued while we are trying to migrate a task fetched from the
>> shared_runq. Thus, instead of just being idle for a short duration and
>> running the task, we are now making it wait till we fetch another task
>> onto the CPU.
>>
>> I think the scenario changes as follows with shared_runq:
>>
>> - Current
>>
>>
>>       [Short Idling]	[2 tasks]                        [1 task]	[2 tasks]
>> 	+-------+	+-------+                       +-------+	+-------+
>> 	|	|	|	|        wakeup         |	|	|	|
>> 	| CPU 0 |	| CPU 1 |	 on CPU0        | CPU 0 |	| CPU 1 |
>> 	|	|	|	|       -------->       |	|	|	|
>> 	+-------+	+-------+                       +-------+	+-------+
>>
>> - With shared_runq
>>
>>       [pull from CPU1]	[2 tasks]                       [2 tasks]	[1 task]
>> 	+-------+	+-------+                       +-------+	+-------+
>> 	|	|	|	|        wakeup         |	|	|	|
>> 	| CPU 0 |	| CPU 1 |	 on CPU0        | CPU 0 |	| CPU 1 |
>> 	|	|	|	|       -------->       |	|	|	|
>> 	+-------+	+-------+                       +-------+	+-------+
>>
>> We reach a similar final state but with shared_runq we've paid a price
>> for task migration. Worst case, the following timeline can happen:
>>
>>         |
>>   CPU0  | [T0 R, T1 Q] [       T0 R      ] [newidle_balance] [T4 R ...
>>         |
>>         |                  pull T1 \             pull T4 /
>>         |
>>   CPU1  | [T3 R] [newidle_balance] [T1 R, T4 Q] [       T1 R      ]
>>         |            [T4 TTWU]
>>         |
>>
>> With the rq->avg_idle bailout, it might end up looking like:
>>
>>         |
>>   CPU0  | [          T0 R, T1 Q          ] [T1 R ...
>>         |
>>         |
>>   CPU1  | [T3 R] [ I ] [T4 R ...
>>         |            
>>         |
> 
> This certainly seems possible, and wouldn't be terribly surprising or
> unexpected. Taking a step back here, I want to be clear that I do
> understand the motivation for including the rq->avg_idle check for
> SHARED_RUNQ; even just conceptually, and regardless of the numbers you
> and others have observed for workloads that do these short sleeps. The
> whole idea behind that check is that we want to avoid doing
> newidle_balance() if the overhead of doing newidle_balance() would
> exceed the amount of time that a task was blocked. Makes sense. Why
> would you take the overhead of balancing if you have reason to believe
> that a task is likely to be idle for less time than it takes to do a
> migration?
> 
> There's certainly a reasonable argument for why that should also apply
> to SHARED_RUNQ. If the overhead of doing a SHARED_RUNQ migration is
> greater than the amount of time that an sd is expected to be idle, then
> it's not worth bothering with SHARED_RUNQ either. On the other hand, the
> claim of SHARED_RUNQ is that it's faster than doing a regular balance
> pass, because we're doing an O(# shards) iteration to find tasks (before
> sharding it was O(1)), rather than O(# CPUs). So if we also do the
> rq->avg_idle check, that basically means that SHARED_RUNQ becomes a
> cache for a full load_balance() call.
> 
> Maybe that makes sense and is ultimately the correct design /
> implementation for the feature. I'm not fundamentally opposed to that,
> but I think we should be cognizant of the tradeoff we're making. If we
> don't include this rq->avg_idle check, then some workloads will regress
> because we're doing excessive migrations, but if we do check it, then
> others will also regress because we're doing insufficient migrations due
> to incorrectly assuming that an rq won't be idle for long. On yet
> another hand, maybe it's fine to allow users to work around that by
> setting sysctl_sched_migration_cost_ns = 0? That only sort of works,
> because we ignore that and set rq->max_idle_balance_cost = curr_cost in
> newidle_balance() if we end up doing a balance pass. I also know that
> Peter and others discourage the use of these debugfs knobs, so I'm not
> sure it's even applicable to point that out as a workaround.
> 
> And so hopefully the problem starts to become clear. It doesn't take
> long for for us to get mired in heuristics that make it difficult to
> reason about the expected behavior of the feature, and also difficult to
> reason about future changes as these heuristics have now all crossed
> streams. Maybe that's OK, and is preferable to the alternative. My
> personal opinion, however, is that it's preferable to provide users with
> knobs that do straightforward things that are independent from existing
> heuristics and knobs which were added for other circumstances. I'd
> rather have confidence that I understand how a feature is supposed to
> work, and can easily reason about when it's stupid (or not) to use it,
> vs. have an expectation for it to not regress workloads in any scenario.
> 
> Note that this doesn't mean we can't make my patches less dumb. I think
> your suggestions to e.g. check the overload flag (or possibly even
> better to just not enqueue in a shard if the rq isn't overloaded),
> re-check ttwu->pending after failing to find a task in the shard, etc
> make complete sense. There's no downside -- we're just avoiding
> pointless work. It's the heuristics like checking rq->avg_idle that
> really worry me.

I agree since avg_idle is merely a prediction that may or may not be
true.

> 
> Peter -- I think it would be helpful if you could weigh in here just to
> provide your thoughts on this more "philosophical" question.
> 
>> If possible, can you check how long is the avg_idle running your
>> workload? Meanwhile, I believe there are a few workloads that
>> exhibit same behavior as tbench (large scale idling for short
>> duration) Let me go check if I can see tbench like issue there.
> 
> Sure thing, in the meantime I'll test this out on HHVM. I've actually
> been working on getting a build + testbed ready for a few days, so
> hopefully it won't take much longer to get some results. Even if it
> turns out that this works great for HHVM, I'd ideally like to get
> Peter's and others' thoughts on the above.

I'll gather some more data too in the meantime :)

> 
> Thanks,
> David
 
--
Thanks and Regards,
Prateek