linux-kernel - Re: [PATCH 00/10] steal tasks to improve CPU utilization

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <495866f6-6ab8-55fa-1743-1b6910f94733@oracle.com>
Date:   Thu, 25 Oct 2018 08:21:58 -0400
From:   Steven Sistare <steven.sistare@...cle.com>
To:     Valentin Schneider <valentin.schneider@....com>,
        Peter Zijlstra <peterz@...radead.org>
Cc:     mingo@...hat.com, subhra.mazumdar@...cle.com,
        dhaval.giani@...cle.com, daniel.m.jordan@...cle.com,
        pavel.tatashin@...rosoft.com, matt@...eblueprint.co.uk,
        umgwanakikbuti@...il.com, riel@...hat.com, jbacik@...com,
        juri.lelli@...hat.com, linux-kernel@...r.kernel.org
Subject: Re: [PATCH 00/10] steal tasks to improve CPU utilization

On 10/25/2018 7:31 AM, Valentin Schneider wrote:
> 
> On 24/10/2018 20:27, Steven Sistare wrote:
> [...]
>> Hi Valentin,
>>
>> Asymmetric systems could maintain a separate bitmap for misfits; set a bit 
>> when a CPU goes on CPU, clear it going off.  When a fast CPU goes new idle,
>> it would first search the misfits mask, then search cfs_overload_cpus.
>> The misfits logic would be conditionalized with CONFIG or sched feat static 
>> branches so symmetric systems do not incur extra overhead.
> 
> That sounds reasonable - besides, misfit already introduces a
> sched_asym_cpucapacity static key. I'll try to play around with that.
> 
>>> We'd also lose the NOHZ update done in idle_balance(), though I think it's
>>> not such a big deal - were were piggy-backing this on idle_balance() just
>>> because it happened to be convenient, and we still have NOHZ_STATS_KICK
>>> anyway.
>>
>> Agreed.
>>  
>>> Another thing - in your test cases, what is the most prevalent cause of
>>> failure to pull a task in idle_balance()? Is it the load_balance() itself
>>> that fails to find a task (e.g. because the imbalance is not deemed big
>>> enough), or is it the idle migration cost logic that prevents
>>> load_balance() from running to completion?
>>
>> The latter.  Eg, for the test "X6-2, 40 CPUs, hackbench 3 process 50000",
>> CPU avg_idle is 355566 nsec, and sched_migration_cost_ns = 500000,
>> so idle_balance bails at the top:
>>           if (this_rq->avg_idle < sysctl_sched_migration_cost ||
>>             ...
>>             goto out
>>
>> For other tests, we get past that clause but bail from a domain:
>>       if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost) {
>>            ...
>>            break;
>>
>>> In the first case, try_steal() makes perfect sense to me. In the second
>>> case, I'm not sure if we really want to pull something if we know (well,
>>> we *think*) we're about to resume the execution of some other task.
>>
>> 355.566 microsec is enough time to steal, go on CPU, do useful work, and go 
>> off CPU, particularly for chatty workloads like hackbench.  The performance
>> data bear this out.  For the higher loads, the average timeslice for 
>> hackbench 
>>
> 
> Thanks for the explanation. AIUI the big difference here is that try_steal()
> is considerably cheaper than load_balance(), so the rq->avg_idle concerns
> matter less (or at least, on a considerably smaller scale).

Right.

>> Perhaps I could skip try_steal() if avg_idle is very small, although with
>> hackbench I have seen average time slice as small as 10 microsec under 
>> high load and preemptions.  I'll run some experiments.
> 
> That might be a safe thing to do. In the same department, maybe we could
> skip try_steal() if we bail out of idle_balance() because
> !(this_rq->rd->overload). Although rq->rd->overload and cfs_overload_cpus
> are decoupled, they should express the same thing here.

I tried that in an earlier version of my code:

    new_tasks = idle_balance(rq, rf);
    if (new_tasks == 0 && rq->rd->overload)
        new_tasks = try_steal(rq, rf);

but I did not see any performance improvement vs without the overload check,
so I omitted it for simplicity.

- Steve

>>>> We could merge the stealing code into the idle_balance() code to get a
>>>> union of the two, but IMO that would be less readable.
>>>>
>>>> We could remove the core and socket levels from idle_balance()
>>>
>>> I understand that as only doing load_balance() at DIE level in
>>> idle_balance(), as that is what makes most sense to me (with big.LITTLE
>>> those misfit migrations are done at DIE level), is that correct?
>>
>> Correct. 
>>> Also, with DynamIQ (next gen big.LITTLE) we could have asymmetry at MC
>>> level, which could cause issues there.
>>
>> We could keep idle_balance for this level and fall back to stealing as in
>> my patch, or you could extend the misfits bitmap to also include CPUs 
>> with reduced memory bandwidth and active tasks. (if I understand the asymmetry 
>> correctly).
>>
> 
> It's mostly µarch asymmetry, so by "asymmetry at MC level" I meant "we'll
> see the SD_ASYM_CPUCAPACITY flag at MC level". But if we tweak stealing
> to take misfit tasks into account (so we'd rely on SD_ASYM_CPUCAPACITY
> in some way or another), that could work.
> 
>>>> and let
>>>> stealing handle those levels.  I think that makes sense after stealing
>>>> performance is validated on more architectures, but we would still have
>>>> two different mechanisms.
>>>>
>>>> - Steve
>>>
>>> I'll try out those patches on top of the misfit series to see how the
>>> whole thing behaves.
>>
>> Very good, thanks.
>>
>> - Steve
>>