lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <925f13cd-b020-a799-3505-b3df46a51ffe@amd.com>
Date:   Wed, 4 Oct 2023 09:51:18 +0530
From:   K Prateek Nayak <kprateek.nayak@....com>
To:     David Vernet <void@...ifault.com>
Cc:     linux-kernel@...r.kernel.org, peterz@...radead.org,
        mingo@...hat.com, juri.lelli@...hat.com,
        vincent.guittot@...aro.org, dietmar.eggemann@....com,
        rostedt@...dmis.org, bsegall@...gle.com, mgorman@...e.de,
        bristot@...hat.com, vschneid@...hat.com, tj@...nel.org,
        roman.gushchin@...ux.dev, gautham.shenoy@....com,
        aaron.lu@...el.com, wuyun.abel@...edance.com, kernel-team@...a.com
Subject: Re: [RFC PATCH 3/3] sched/fair: Add a per-shard overload flag

Hello David,

Thank you for answering my queries, I'll leave some data below to
answer yours.

On 9/29/2023 10:31 PM, David Vernet wrote:
> On Fri, Sep 01, 2023 at 01:53:12AM +0530, K Prateek Nayak wrote:
>> Hello David,
>>
>> On 9/1/2023 12:41 AM, David Vernet wrote:
>>> On Thu, Aug 31, 2023 at 04:15:08PM +0530, K Prateek Nayak wrote:
>>>
>>> Hi Prateek,
>>>
>>>> Even with the two patches, I still observe the following lock
>>>> contention when profiling the tbench 128-clients run with IBS:
>>>>
>>>>   -   12.61%  swapper          [kernel.vmlinux]         [k] native_queued_spin_lock_slowpath
>>>>      - 10.94% native_queued_spin_lock_slowpath
>>>>         - 10.73% _raw_spin_lock
>>>>            - 9.57% __schedule
>>>>                 schedule_idle
>>>>                 do_idle
>>>>               + cpu_startup_entry
>>>>            - 0.82% task_rq_lock
>>>>                 newidle_balance
>>>>                 pick_next_task_fair
>>>>                 __schedule
>>>>                 schedule_idle
>>>>                 do_idle
>>>>               + cpu_startup_entry
>>>>
>>>> Since David mentioned rq->avg_idle check is probably not the right step
>>>> towards the solution, this experiment introduces a per-shard
>>>> "overload" flag. Similar to "rq->rd->overload", per-shard overload flag
>>>> notifies of the possibility of one or more rq covered in the shard's
>>>> domain having a queued task. shard's overload flag is set at the same
>>>> time as "rq->rd->overload", and is cleared when shard's list is found
>>>> to be empty.
>>>
>>> I think this is an interesting idea, but I feel that it's still working
>>> against the core proposition of SHARED_RUNQ, which is to enable work
>>> conservation.
>>
>> I don't think so! Work conservation is possible if there is an
>> imbalance. Consider the case where we 15 tasks in the shared_runq but we
>> have 16 CPUs, 15 of which are running these 15 tasks, and one going
> 
> I'm not sure I'm fully following. Those 15 tasks would not be enqueued
> in the shared runq if they were being run. They would be dequeued from
> the shared_runq in __dequeue_entity(), which would be called from
> set_next_entity() before they were run. In this case, the
> shard->overload check should be equivalent to the
> !list_empty(&shard->list) check.
> 
> Oh, or is the idea that we're not bothering to pull them from the
> shared_runq if they're being woken up and enqueued on an idle core that
> will immediately run them on the next resched path? If so, I wonder if
> we would instead just want to not enqueue the task in the shared_runq at
> all? Consider that if another task comes in on an rq with
> rq->nr_running >= 2, that we still wouldn't want to pull the tasks that
> were being woken up on idle cores (nor take the overhead of inserting
> and then immediately removing them from the shared_runq).

So this is the breakdown of outcomes after peeking into the shared_runq
during newidle_balance:

                                                SHARED_RUNQ                     SHARED_RUNQ
                                        + correct cost accounting       + correct cost accounting
                                                                        + rq->avg_idle early bail

tbench throughput (normalized)		:	     1.00			2.47	       (146.84%)

attempts                                :       6,560,413                  2,273,334           (-65.35%)
shared_runq was empty                   :       2,276,307 [34.70%]         1,379,071 [60.66%]  (-39.42%)
successful at pulling task              :       2,557,158 [38/98%]           342,839 [15.08%]  (-86.59%)
unsuccessful despite fetching task      :       1,726,948 [26.32%]           551,424 [24.26%]  (-68.06%)

As you can see, there are more attempts and a greater chance of success
in the case without the rq->avg_idle check upfront. Where the problem
lies (at least what I believe is) a task is waiting to be enqueued / has
been enqueued while we are trying to migrate a task fetched from the
shared_runq. Thus, instead of just being idle for a short duration and
running the task, we are now making it wait till we fetch another task
onto the CPU.

I think the scenario changes as follows with shared_runq:

- Current


      [Short Idling]	[2 tasks]                        [1 task]	[2 tasks]
	+-------+	+-------+                       +-------+	+-------+
	|	|	|	|        wakeup         |	|	|	|
	| CPU 0 |	| CPU 1 |	 on CPU0        | CPU 0 |	| CPU 1 |
	|	|	|	|       -------->       |	|	|	|
	+-------+	+-------+                       +-------+	+-------+

- With shared_runq

      [pull from CPU1]	[2 tasks]                       [2 tasks]	[1 task]
	+-------+	+-------+                       +-------+	+-------+
	|	|	|	|        wakeup         |	|	|	|
	| CPU 0 |	| CPU 1 |	 on CPU0        | CPU 0 |	| CPU 1 |
	|	|	|	|       -------->       |	|	|	|
	+-------+	+-------+                       +-------+	+-------+

We reach a similar final state but with shared_runq we've paid a price
for task migration. Worst case, the following timeline can happen:

        |
  CPU0  | [T0 R, T1 Q] [       T0 R      ] [newidle_balance] [T4 R ...
        |
        |                  pull T1 \             pull T4 /
        |
  CPU1  | [T3 R] [newidle_balance] [T1 R, T4 Q] [       T1 R      ]
        |            [T4 TTWU]
        |

With the rq->avg_idle bailout, it might end up looking like:

        |
  CPU0  | [          T0 R, T1 Q          ] [T1 R ...
        |
        |
  CPU1  | [T3 R] [ I ] [T4 R ...
        |            
        |

If possible, can you check how long is the avg_idle running your
workload? Meanwhile, I believe there are a few workloads that
exhibit same behavior as tbench (large scale idling for short
duration) Let me go check if I can see tbench like issue there.

> 
>> idle. Work is conserved. What we need to worry about is tasks being in
>> the shared_runq that are queued on their respective CPU. This can only
>> happen if any one of the rq has nr_running >= 2, which is also the point
>> we are setting "shard->overload".
> 
> Assuming this is about the "wakeup / enqueue to idle core" case, ok,
> this makes sense. I still think it probably makes more sense to just not
> enqueue in the shared_runq for this case though, which would allow us to
> instead just rely on list_empty(&shard->list).
> 
>> Now situation can change later and all tasks in the shared_runq might be
>> running on respective CPUs but "shard->overload" is only cleared when
>> the shared_runq becomes empty. If this is too late, maybe we can clear
>> it if periodic load balancing finds no queuing (somewhere around the
>> time we update nr_idle_scan).
>>
>> So the window where we do not go ahead with popping a task from the
>> shared_runq_shard->list is between the list being empty and at least one
>> of the CPU associated with the shard reporting nr_running >= 2, which is
>> when work conservation is needed.
> 
> So, I misread your patch the first time I reviewed it, and for some
> reason thought you were only setting shard->overload on the
> load_balance(). That's obviously not the case, and I now understand it
> better, modulo my points above being clarified.
> 
>>>
>>>> With these changes, following are the results for tbench 128-clients:
>>>
>>> Just to make sure I understand, this is to address the contention we're
>>> observing on tbench with 64 - 256 clients, right?  That's my
>>> understanding from Gautham's reply in [0].
>>>
>>> [0]: https://lore.kernel.org/all/ZOc7i7wM0x4hF4vL@BLR-5CG11610CF.amd.com/
>>
>> I'm not sure if Gautham saw a contention with IBS but he did see an
>> abnormal blowup in newidle_balance() counts which he suspected were the
>> cause for the regression. I noticed the rq lock contention after doing a
>> fair bit of surgery. Let me go check if that was the case with vanilla
>> v3 too.
>>
>>>
>>> If so, are we sure this change won't regress other workloads that would
>>> have benefited from the work conservation?
>>
>> I don't think we'll regress any workloads as I explained above because
>> the "overload" flag being 0 almost (since update/access is not atomic)
>> always indicate a case where the tasks cannot be pulled. However, that
>> needs to be tested since there is a small behavior change in
>> shared_runq_pick_next_task(). Where previously if the task is running
>> on CPU, we would have popped it from shared_runq, did some lock
>> fiddling before finding out it is running, some more lock fiddling
>> before the function returned "-1", now with the changes here, it'll
>> simply return a "0" and although that is correct, we have seen some
>> interesting cases in past [1] where a random lock contention actually
>> helps certain benchmarks ¯\_(ツ)_/¯
> 
> I don't think we need to worry about less lock contention possibly
> hurting benchmarks :-)

Yup :)

> 
>> [1] https://lore.kernel.org/all/44428f1e-ca2c-466f-952f-d5ad33f12073@amd.com/ 
>>
>>>
>>> Also, I assume that you don't see the improved contention without this,
>>> even if you include your fix to the newidle_balance() that has us skip
>>> over the <= LLC domain?
>>
>> No improvements! The lock is still very much contended for. I wonder if
>> it could be because of the unlocking and locking the rq again in
>> shared_runq_pick_next_task() even when task is on CPU. Also since it
>> return -1 for this case, pick_next_task_fair() will return RETRY_TASK
>> which can have further implications.
> 
> Yeah, I could see it being an issue if we're essentially thrashing tasks
> in the shared_runq that are just temporarily enqueued in the shared_runq
> between activate and doing a resched pass on an idle core.
> 
> Unfortunately, I don't think we have any choice but to drop and
> reacquire the rq lock. It's not safe to look at task_cpu(p) without
> pi_lock, and we can't safely acquire that without dropping the rq lock.

True that. We wouldn't want to run into a deadlock scenario or cause
more lock contention with double locking :(

> 
> Thanks,
> David

--
Thanks and Regards,
Prateek

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ