linux-kernel - Re: [PATCH v2] sched: reduce contention on tg's load_avg & runnable

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <52DFF0DC.3050303@hp.com>
Date:	Wed, 22 Jan 2014 11:25:00 -0500
From:	Waiman Long <waiman.long@...com>
To:	Peter Zijlstra <peterz@...radead.org>
CC:	Ingo Molnar <mingo@...hat.com>, linux-kernel@...r.kernel.org,
	"Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>,
	Frederic Weisbecker <fweisbec@...il.com>,
	"Eric W. Biederman" <ebiederm@...ssion.com>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Serge Hallyn <serge.hallyn@...onical.com>,
	Aswin Chandramouleeswaran <aswin@...com>,
	Scott J Norton <scott.norton@...com>,
	Paul Turner <pjt@...gle.com>, Ben Segall <bsegall@...gle.com>
Subject: Re: [PATCH v2] sched: reduce contention on tg's load_avg & runnable_avg

On 01/16/2014 07:44 AM, Peter Zijlstra wrote:
> First of.. WTF is v1?
>
> Secondly, please always CC the authors of the code you're changing.

The v1 patch was sent quite a while ago on 9/21/2013. See 
https://lkml.org/lkml/2013/9/23/551

There was no feedback at that time. As this was not a high priority 
patch for me, I didn't follow up at that time. I will include the change 
log in the next version.

> On Wed, Jan 15, 2014 at 09:22:36PM -0500, Waiman Long wrote:
>> It was found that with a perf profile of a compute workload (at 1500
>> users) of the AIM7 benchmark running on a glueless 4-socket 40-core
>> Westmere-EX system (HT on) on a 3.13-rc8 kernel that the scheduling
>> tick related functions account for quite a significant portion of
>> the total kernel cpu cycles.
>>
>>    0.62%  reaim  [kernel.kallsyms]  [k] update_cfs_rq_blocked_load
>>    0.47%  reaim  [kernel.kallsyms]  [k] entity_tick
>>    0.10%  reaim  [kernel.kallsyms]  [k] update_cfs_shares
>>    0.03%  reaim  [kernel.kallsyms]  [k] update_curr
>>
>> The scheduling tick functions account for about 1.22% of the total
>> CPU cycles. Of the top 2 function in the above list, the reading
>> and writing of the tg->load_avg variable account for over 90% of the
>> CPU cycles:
>>
>>    atomic_long_add(tg_contrib,&tg->load_avg);
>>    atomic_long_read(&tg->load_avg) + 1);
>>
>> This patch reduces the contention on the load_avg variable (and
>> secondarily on the runnable_avg variable) by the following 2 measures:
>>
>> 1. Make the load_avg and runnable_avg fields of the task_group
>>     structure sit in their own cacheline without sharing it with others.
>>     This only applies if the kernel is built for NUMA systems with
>>     multiple sockets.
> So why not for SMP?

The cache coherency traffic is generally not a problem for single-socket 
multi-core system, that is why I currently increase the data structure 
size only for kernel built for multi-socket systems. Of course, I can 
also enable it for SMP system in general.

> Also, what's the difference between having both of them in the same
> cacheline as opposed to a cacheline each?
> They're both touched from the same tick, so it makes sense to have them
> in one cacheline. Now you get to move two lines into exclusive state,
> instead of just the one.

Below is the performance data for different cacheline placements:

    Cacheline Placement | %CPU  |  JPM   |
   ---------------------+-------+--------+
   2 separate cachelines| 0.55% | 405803 |
   1 common cacheline   | 1.01% | 403462 |
   2nd change only      | 1.06% | 403820 |
   Original code        | 1.22% | 398509 |

It seems like forcing the 2 fields to be in the same cacheline actually 
make it perform a little bit worse. It is likely that the 2 fields just 
happen to be in 2 different cachelines in x86.

>> 2. Use atomic_long_add_return() to update the fields and save the
>>     returned value in a temporary location in the cfs structure to
>>     be used later instead of reading the fields directly.
> Then why aren't this two patches?

I will break it into 2 patches.

> Furthermore, I completely hate the way you implemented this; the stuff
> like in the first hunk below makes the entire code flow horrid. Its
> already difficult code, using conditional variables makes it even worse.

I can try to encapsulate the change in macros to not disrupt the current 
flow.

> Who's to say your 'cached' value is recent? You didn't put in a call
> chain analysis to show you always first pass through the add_return()
> before using the cached value.

Will provide a more detailed call chain analysis to show when and how 
the cache value is used.

>> The second change does require some changes in the ordering of how
>> some of the average counts are being computed and hence may have a
>> slight effect on their behavior.
> Might have is no good, either you work through it and make damn sure its
> solid or you walk.

I will do a more detailed analysis and provide that in the change log.

> Preserved the rest for the added Cc's.
>

Will do.

-Longman

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/