linux-kernel - Re: [PATCH 07/10] sched/fair: Provide can_migrate_task

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <d018a2d4-815a-9731-63e4-ba61c8259435@arm.com>
Date:   Wed, 31 Oct 2018 18:48:20 +0000
From:   Valentin Schneider <valentin.schneider@....com>
To:     Steven Sistare <steven.sistare@...cle.com>, mingo@...hat.com,
        peterz@...radead.org
Cc:     subhra.mazumdar@...cle.com, dhaval.giani@...cle.com,
        rohit.k.jain@...cle.com, daniel.m.jordan@...cle.com,
        pavel.tatashin@...rosoft.com, matt@...eblueprint.co.uk,
        umgwanakikbuti@...il.com, riel@...hat.com, jbacik@...com,
        juri.lelli@...hat.com, linux-kernel@...r.kernel.org,
        Quentin Perret <quentin.perret@....com>
Subject: Re: [PATCH 07/10] sched/fair: Provide can_migrate_task_llc



On 31/10/2018 15:43, Steven Sistare wrote:
> On 10/29/2018 3:34 PM, Valentin Schneider wrote:
[...]
>> Suppose you have 2 rq's sharing a workload of 3 tasks. You get one rq with
>> nr_running == 1 (r_1) and one rq with nr_running == 2 (r_2).
>>
>> As soon as the task on r_1 ends/blocks, we'll go through idle balancing and
>> can potentially steal the non-running task from r_2. Sometime later the task
>> that was running on r_1 wakes up, and we end up with r_1->nr_running == 2
>> and r_2->nr_running == 1.
>>
>> IOW we've swapped their role in that example, and the whole thing can
>> repeat.
>>
>> The shorter the period of those tasks, the more we'll migrate them
>> between rq's, hence why I wonder if we shouldn't have some sort of
>> throttling.
> 
> Stealing is still the right move in this scenario.  Idle cycles become useful
> cycles.  The only cost is the CPU time to dequeue from a remote rq and
> enqueue on the local rq.  Earlier we discussed skipping try_steal() if avg_idle
> is very small, on the order of 10 usec.  I think that type of throttling would 
> cover your scenario.  I will add it in my next version.
> 

Sounds good to me. Out of curiosity, how did you establish this 10 µsec
threshold? I guess we just want something very tiny but still big enough
to broadly cover try_steal() worst case exec times.

[...]
>> Mmm so task_hot() mainly implements two mechanisms - the CACHE_HOT_BUDDY
>> sched feature and the exec_start threshold.
>>
>> The first one should be sidestepped in the stealing case since we won't
>> pass (if env->dst_rq->nr_running), that leaves us with the threshold.
>>
>> We might want to sidestep it when we are doing balancing within an LLC
>> domain (env->sd->flags & SD_SHARE_PKG_RESOURCES) - or use a lower threshold
>> in such cases.
>>
>> In any case, I think it would make sense to add some LLC conditions to
>> task_hot() so that
>> - regular load_balance() can also benefit from them
> 
> This is probably a good idea (lower threshold for task_hot within LLC).
> I would rather see it done as a separate patch, with a separate performance
> evaluation, as it will affect all workloads, even those that do not steal.

Agreed.

> A load balancing migration when !task_hot() may be performed even when
> the dst CPU already has a task to run, so the migration may or may not
> improve utilization.  By contrast, a newly idle CPU that does not find
> work goes idle and definitely wastes cycles.  Note how
> migrate_degrades_locality() chooses migration regardless of preferred node
> when the dst is idle:
> 
>         /* Leaving a core idle is often worse than degrading locality. */
>         if (env->idle != CPU_NOT_IDLE)
>                 return -1;
> 
> I apply the same principle in can_migrate_task_llc().
> 
[...]
>> Right, so my line of thinking was that by not doing a load_balance() and
>> taking a shortcut (stealing a task), we may end up just postponing a
>> load_balance() to after we've stolen a task. I guess in those cases
>> there's no magic trick to be found and we just have to deal with it.
> 
> In the current code I call idle_balance/load_balance first and then try_steal.
> If idle_balance fails because of cost, then it has effectively postponed itself,
> independently of stealing.  The next successful call to load_balance will
> correct any imbalance caused by stealing.
> 
>> And then there's some of the logic like we have in update_sd_pick_busiest()
>> where we e.g. try to prevent misfit tasks from running on LITTLEs, but
>> then if such tasks are waiting to be run and a LITTLE frees itself up,
>> I *think* it's okay to steal it.
> 
> Should be OK to steal.  If a BIG subsequently goes idle, load_balance will move
> the task to the BIG, or the BIG may steal it when we support misfit stealing.
> 
> Questions for you, Valentin:
> 
>   - Should misfit stealing be a separate patch, after my series?  I prefer that,
>     so we get stealing in peoples hands as soon as possible.  I think separating 
>     it is OK because stealing should not cause any regression for misfits, as 
>     my code still calls idle_balance/load_balance, which handles misfits.
> 

I don't have any objections, Quentin (added on cc) might want to disable
stealing when !overutilized for Energy Aware Scheduling, but then that
also depends on what gets in first :)

>   - Who should implement misfit stealing -- you, me, someone else?  I have no 
>     preference.

It would make more sense if I did that, as it's easy for me to test it -
unless you're really curious about Arm stuff and want to have some fun :)

> 
> - Steve
>