linux-kernel - Re: [RFC 1/2] sched: reduce migration cost between faster caches for idle

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <7e6ee88e-8462-b1ab-a7bf-536a2c576c7d@oracle.com>
Date:   Thu, 15 Feb 2018 10:07:43 -0800
From:   Rohit Jain <rohit.k.jain@...cle.com>
To:     Steven Sistare <steven.sistare@...cle.com>,
        Mike Galbraith <efault@....de>, linux-kernel@...r.kernel.org
Cc:     peterz@...radead.org, mingo@...hat.com, joelaf@...gle.com,
        jbacik@...com, riel@...hat.com, juri.lelli@...hat.com,
        dhaval.giani@...cle.com
Subject: Re: [RFC 1/2] sched: reduce migration cost between faster caches for
 idle_balance



On 02/15/2018 08:35 AM, Steven Sistare wrote:
> On 2/10/2018 1:37 AM, Mike Galbraith wrote:
>> On Fri, 2018-02-09 at 11:08 -0500, Steven Sistare wrote:
>>>>> @@ -8804,7 +8803,8 @@ static int idle_balance(struct rq *this_rq, struct rq_flags *rf)
>>>>>   		if (!(sd->flags & SD_LOAD_BALANCE))
>>>>>   			continue;
>>>>>   
>>>>> -		if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost) {
>>>>> +		if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost +
>>>>> +		    sd->sched_migration_cost) {
>>>>>   			update_next_balance(sd, &next_balance);
>>>>>   			break;
>>>>>   		}
>>>> Ditto.
>>> The old code did not migrate if the expected costs exceeded the expected idle
>>> time.  The new code just adds the sd-specific penalty (essentially loss of cache
>>> footprint) to the costs.  The for_each_domain loop visit smallest to largest
>>> sd's, hence visiting smallest to largest migration costs (though the tunables do
>>> not enforce an ordering), and bails at the first sd where the total cost is a lose.
>> Hrm..
>>
>> You're now adding a hypothetical cost to the measured cost of running
>> the LB machinery, which implies that the measurement is insufficient,
>> but you still don't say why it is insufficient.  What happens if you
>> don't do that?  I ask, because when I removed the...
>>
>>     this_rq->avg_idle < sysctl_sched_migration_cost
>>
>> ...bits to check removal effect for Peter, the original reason for it
>> being added did not re-materialize, making me wonder why you need to
>> make this cutoff more aggressive.
> The current code with sysctl_sched_migration_cost discourages migration
> too much, per our test results.  Deleting it entirely from idle_balance()
> may be the right solution, or it may allow too much migration and
> cause regressions due to loss of cache warmth on some workloads.
> Rohit's patch deletes it and adds the sd->sched_migration_cost term
> to allow a migration rate that is somewhere in the middle, and is
> logically sound.  It discourages but does not prevent migration between
> nodes, and encourages but does not always allow migration between cores.
> By contrast, setting relax_domain_level to disable SD_BALANCE_NEWIDLE
> at the SD_NUMA level is a big hammer.
>
> I would be perfectly happy if deleting sysctl_sched_migration_cost from
> idle_balance does the trick.  Last week in a different thread you mentioned
> it did not hurt tbench:
>
>>> Mike, do you remember what comes apart when we take
>>> out the sysctl_sched_migration_cost test in idle_balance()?
>> Used to be anything scheduling cross-core heftily suffered, ie pretty
>> much any localhost communication heavy load.  I just tried disabling it
>> in 4.13 though (pre pti cliff), tried tbench, and it made zip squat
>> difference.  I presume that's due to the meanwhile added
>> this_rq->rd->overload and/or curr_cost checks.
> Can you provide more details on the sysbench oltp test that motivated you
> to add sysctl_sched_migration_cost to idle_balance, so Rohit can re-test it?
>     1b9508f6 sched: Rate-limit newidle
>     Rate limit newidle to migration_cost. It's a win for all stages of
>     sysbench oltp tests.
>
> Rohit is running more tests with a patch that deletes
> sysctl_sched_migration_cost from idle_balance, and for his patch but
> with the 5000 usec mistake corrected back to 500 usec.  So far both
> give improvements over the baseline, but for different cases, so we
> need to try more workloads before we draw any conclusions.
>
> Rohit, can you share your data so far?

Results:

In the following results, "Domain based" approach is as mentioned in the
RFC sent out with the values fixed (As pointed out by Mike). "No check" is
the patch where I just remove the check against sysctl_sched_migration_cost

1) Hackbench results on 2 socket, 44 core and 88 threads Intel x86 machine
(lower is better):

+--------------+-----------------+--------------------------+-------------------------+
|              | Without Patch   |Domain Based              |No Check                 |
+------+-------+--------+--------+-----------------+--------+----------------+--------+
|Loops | Groups|Average |%Std Dev|Average          |%Std Dev|Average         |%Std Dev|
+------+-------+--------+--------+-----------------+--------+----------------+--------+
|100000| 4     |9.701   |0.78    |7.971  (+17.84%) | 1.34   |8.919  (+8.07%) |1.07    |
|100000| 8     |17.186  |0.77    |16.712 (+2.76%)  | 0.87   |17.043 (+0.83%) |0.83    |
|100000| 16    |30.378  |0.55    |29.780 (+1.97%)  | 0.38   |29.565 (+2.67%) |0.29    |
|100000| 32    |54.712  |0.54    |53.001 (+3.13%)  | 0.19   |52.158 (+4.67%) |0.22    |
+------+-------+--------+--------+-----------------+--------+----------------+--------+

2) Sysbench MySQL results on 2 socket, 44 core and 88 threads Intel x86
machine (higher is better):

+-------+--------------------+----------------------------+----------------------------+
|       | Without Patch      | Domain based               | No check                   |
+-------+-----------+--------+-------------------+--------+-------------------+--------+
|Num    | Average   |        | Average           |        | Average           |        |
|Threads| throughput|%Std Dev| throughput        |%Std Dev| throughput        |%Std Dev|
+-------+-----------+--------+-------------------+--------+-------------------+--------+
|    8  | 133658.2  | 0.66   | 134909.4 (+0.94%) | 0.94   | 134232.2 (+0.43%) | 1.29   |
|   16  | 266540    | 0.48   | 268253.4 (+0.64%) | 0.64   | 268584.6 (+0.77%) | 0.37   |
|   32  | 466315.6  | 0.15   | 465903.6 (-0.09%) | 0.28   | 468594.2 (+0.49%) | 0.23   |
|   64  | 720039.4  | 0.23   | 725663.8 (+0.78%) | 0.42   | 717253.8 (-0.39%) | 0.36   |
|   72  | 757284.4  | 0.25   | 770693.4 (+1.77%) | 0.29   | 764984.0 (+1.02%) | 0.38   |
|   80  | 807955.6  | 0.22   | 818446.0 (+1.30%) | 0.24   | 831372.2 (+2.90%) | 0.10   |
|   88  | 863173.8  | 0.25   | 870520.4 (+0.85%) | 0.23   | 887049.0 (+2.77%) | 0.56   |
|   96  | 882950.8  | 0.32   | 890775.4 (+0.89%) | 0.40   | 892913.8 (+1.13%) | 0.41   |
|  128  | 895112.6  | 0.13   | 898524.2 (+0.38%) | 0.16   | 901195.0 (+0.68%) | 0.28   |
+-------+-----------+--------+-------------------+--------+-------------------+--------+

Thanks,
Rohit