linux-kernel - Re: [PATCH 00/10] steal tasks to improve CPU utilization

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <09b10abc-8357-2db3-3d30-8aa9e95e8655@arm.com>
Date:   Thu, 25 Oct 2018 12:31:12 +0100
From:   Valentin Schneider <valentin.schneider@....com>
To:     Steven Sistare <steven.sistare@...cle.com>,
        Peter Zijlstra <peterz@...radead.org>
Cc:     mingo@...hat.com, subhra.mazumdar@...cle.com,
        dhaval.giani@...cle.com, daniel.m.jordan@...cle.com,
        pavel.tatashin@...rosoft.com, matt@...eblueprint.co.uk,
        umgwanakikbuti@...il.com, riel@...hat.com, jbacik@...com,
        juri.lelli@...hat.com, linux-kernel@...r.kernel.org
Subject: Re: [PATCH 00/10] steal tasks to improve CPU utilization


On 24/10/2018 20:27, Steven Sistare wrote:
[...]
> Hi Valentin,
> 
> Asymmetric systems could maintain a separate bitmap for misfits; set a bit 
> when a CPU goes on CPU, clear it going off.  When a fast CPU goes new idle,
> it would first search the misfits mask, then search cfs_overload_cpus.
> The misfits logic would be conditionalized with CONFIG or sched feat static 
> branches so symmetric systems do not incur extra overhead.
> 

That sounds reasonable - besides, misfit already introduces a
sched_asym_cpucapacity static key. I'll try to play around with that.

>> We'd also lose the NOHZ update done in idle_balance(), though I think it's
>> not such a big deal - were were piggy-backing this on idle_balance() just
>> because it happened to be convenient, and we still have NOHZ_STATS_KICK
>> anyway.
> 
> Agreed.
>  
>> Another thing - in your test cases, what is the most prevalent cause of
>> failure to pull a task in idle_balance()? Is it the load_balance() itself
>> that fails to find a task (e.g. because the imbalance is not deemed big
>> enough), or is it the idle migration cost logic that prevents
>> load_balance() from running to completion?
> 
> The latter.  Eg, for the test "X6-2, 40 CPUs, hackbench 3 process 50000",
> CPU avg_idle is 355566 nsec, and sched_migration_cost_ns = 500000,
> so idle_balance bails at the top:
>           if (this_rq->avg_idle < sysctl_sched_migration_cost ||
>             ...
>             goto out
> 
> For other tests, we get past that clause but bail from a domain:
>       if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost) {
>            ...
>            break;
> 
>> In the first case, try_steal() makes perfect sense to me. In the second
>> case, I'm not sure if we really want to pull something if we know (well,
>> we *think*) we're about to resume the execution of some other task.
> 
> 355.566 microsec is enough time to steal, go on CPU, do useful work, and go 
> off CPU, particularly for chatty workloads like hackbench.  The performance
> data bear this out.  For the higher loads, the average timeslice for 
> hackbench 
> 

Thanks for the explanation. AIUI the big difference here is that try_steal()
is considerably cheaper than load_balance(), so the rq->avg_idle concerns
matter less (or at least, on a considerably smaller scale).

> Perhaps I could skip try_steal() if avg_idle is very small, although with
> hackbench I have seen average time slice as small as 10 microsec under 
> high load and preemptions.  I'll run some experiments.
> 

That might be a safe thing to do. In the same department, maybe we could
skip try_steal() if we bail out of idle_balance() because
!(this_rq->rd->overload). Although rq->rd->overload and cfs_overload_cpus
are decoupled, they should express the same thing here.

>>> We could merge the stealing code into the idle_balance() code to get a
>>> union of the two, but IMO that would be less readable.
>>>
>>> We could remove the core and socket levels from idle_balance()
>>
>> I understand that as only doing load_balance() at DIE level in
>> idle_balance(), as that is what makes most sense to me (with big.LITTLE
>> those misfit migrations are done at DIE level), is that correct?
> 
> Correct. 
>> Also, with DynamIQ (next gen big.LITTLE) we could have asymmetry at MC
>> level, which could cause issues there.
> 
> We could keep idle_balance for this level and fall back to stealing as in
> my patch, or you could extend the misfits bitmap to also include CPUs 
> with reduced memory bandwidth and active tasks. (if I understand the asymmetry 
> correctly).
> 

It's mostly µarch asymmetry, so by "asymmetry at MC level" I meant "we'll
see the SD_ASYM_CPUCAPACITY flag at MC level". But if we tweak stealing
to take misfit tasks into account (so we'd rely on SD_ASYM_CPUCAPACITY
in some way or another), that could work.

>>> and let
>>> stealing handle those levels.  I think that makes sense after stealing
>>> performance is validated on more architectures, but we would still have
>>> two different mechanisms.
>>>
>>> - Steve
>>
>> I'll try out those patches on top of the misfit series to see how the
>> whole thing behaves.
> 
> Very good, thanks.
> 
> - Steve
>