[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <cf13647f-6bb6-34d5-c6b6-41a7500a9612@bytedance.com>
Date: Mon, 20 Jun 2022 21:37:59 +0800
From: Abel Wu <wuyun.abel@...edance.com>
To: K Prateek Nayak <kprateek.nayak@....com>,
Tim Chen <tim.c.chen@...ux.intel.com>,
Yicong Yang <yangyicong@...wei.com>,
Yicong Yang <yangyicong@...ilicon.com>, peterz@...radead.org,
mingo@...hat.com, juri.lelli@...hat.com,
vincent.guittot@...aro.org, gautham.shenoy@....com,
linux-kernel@...r.kernel.org, linux-arm-kernel@...ts.infradead.org
Cc: dietmar.eggemann@....com, rostedt@...dmis.org, bsegall@...gle.com,
bristot@...hat.com, prime.zeng@...wei.com,
jonathan.cameron@...wei.com, ego@...ux.vnet.ibm.com,
srikar@...ux.vnet.ibm.com, linuxarm@...wei.com, 21cnbao@...il.com,
guodong.xu@...aro.org, hesham.almatary@...wei.com,
john.garry@...wei.com, shenyang39@...wei.com, feng.tang@...el.com
Subject: Re: [PATCH v4 1/2] sched: Add per_cpu cluster domain info and
cpus_share_resources API
On 6/20/22 7:20 PM, K Prateek Nayak Wrote:
> Hello Tim,
>
> Thank you for looking into this.
>
> On 6/17/2022 10:20 PM, Tim Chen wrote:
>> On Fri, 2022-06-17 at 17:50 +0530, K Prateek Nayak wrote:
>>>
>>>
>>> --
>>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>>> index e9f3dc6dcbf4..97a3895416ab 100644
>>> --- a/kernel/sched/sched.h
>>> +++ b/kernel/sched/sched.h
>>> @@ -1750,12 +1750,12 @@ static inline struct sched_domain *lowest_flag_domain(int cpu, int flag)
>>> return sd;
>>> }
>>>
>>> +DECLARE_PER_CPU(struct sched_domain __rcu *, sd_cluster);
>>> +DECLARE_PER_CPU(int, sd_share_id);
>>> DECLARE_PER_CPU(struct sched_domain __rcu *, sd_llc);
>>> DECLARE_PER_CPU(int, sd_llc_size);
>>> DECLARE_PER_CPU(int, sd_llc_id);
>>> -DECLARE_PER_CPU(int, sd_share_id);
>>> DECLARE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
>>> -DECLARE_PER_CPU(struct sched_domain __rcu *, sd_cluster);
>>> DECLARE_PER_CPU(struct sched_domain __rcu *, sd_numa);
>>> DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
>>> DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_cpucapacity);
>>> --
>>>
>>> The System-map of each kernel is as follows:
>>>
>>> - On "tip"
>>>
>>> 0000000000020518 D sd_asym_cpucapacity
>>> 0000000000020520 D sd_asym_packing
>>> 0000000000020528 D sd_numa
>>> 0000000000020530 D sd_llc_shared
>>> 0000000000020538 D sd_llc_id
>>> 000000000002053c D sd_llc_size
>>> -------------------------------------------- 64B Cacheline Boundary
>>> 0000000000020540 D sd_llc
>>>
>>> - On "tip + Patch 1 only" and "tip + both patches"
>>>
>>> 0000000000020518 D sd_asym_cpucapacity
>>> 0000000000020520 D sd_asym_packing
>>> 0000000000020528 D sd_numa
>>> 0000000000020530 D sd_cluster <-----
>>> 0000000000020538 D sd_llc_shared
>>> -------------------------------------------- 64B Cacheline Boundary
>>> 0000000000020540 D sd_share_id <-----
>>> 0000000000020544 D sd_llc_id
>>> 0000000000020548 D sd_llc_size
>>> 0000000000020550 D sd_llc
>>>
>>>
>>> - On "tip + both patches (Move declaration to top)"
>>>
>>> 0000000000020518 D sd_asym_cpucapacity
>>> 0000000000020520 D sd_asym_packing
>>> 0000000000020528 D sd_numa
>>> 0000000000020530 D sd_llc_shared
>>> 0000000000020538 D sd_llc_id
>>> 000000000002053c D sd_llc_size
>>> -------------------------------------------- 64B Cacheline Boundary
>>> 0000000000020540 D sd_llc
>>
>> Wonder if it will help to try keep sd_llc and sd_llc_size into the same
>> cache line. They are both used in the wake up path.
>
> We are still evaluating keeping which set of variables on the same
> cache line will provide the best results.
>
> I would have expected the two kernel variants - "tip" and the
> "tip + both patches (Move declaration to top)" - to give similar results
> as their System map for all the old variables remain the same and the
> addition of "sd_share_id" and "sd_cluster: fit in the gap after "sd_llc".
> However, now we see a regression for higher number of client.
>
> This probably hints that access to "sd_cluster" variable in Patch 2 and
> bringing in the extra cache line could be responsible for the regression
> we see with "tip + both patches (Move declaration to top)"
>
>>
>>
>>> 0000000000020548 D sd_share_id <-----
>>> 0000000000020550 D sd_cluster <-----
>>>
>>>> Or change the layout a bit to see if there's any difference,
>>>> like:
>>>>
>>>> DEFINE_PER_CPU(struct sched_domain __rcu *, sd_llc);
>>>> DEFINE_PER_CPU(int, sd_llc_size);
>>>> DEFINE_PER_CPU(int, sd_llc_id);
>>>> DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
>>>> +DEFINE_PER_CPU(int, sd_share_id);
>>>> +DEFINE_PER_CPU(struct sched_domain __rcu *, sd_cluster);
>>>> DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa);
>>>> DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
>>>>
>>>> I need to further look into it and have some tests on a SMT machine. Would you mind to share
>>>> the kernel config as well? I'd like to compare the config as well.
>>>
>>> I've attached the kernel config used to build the test kernel
>>> to this mail.
>>>
>>>> Thanks,
>>>> Yicong
>>>
>>> We are trying to debug the issue using perf and find an optimal
>>> arrangement of the per cpu declarations to get the relevant data
>>> used in the wakeup path on the same 64B cache line.
>>
>> A check of perf c2c profile difference between tip and the move new declarations to
>> the top case could be useful. It may give some additional clues of possibel
>> false sharing issues.
>
> Thank you for the suggestion. We are currently looking at perf counter
> data to see how the cache efficiency has changed between the two kernels.
> We suspect that the need for the data in the other cache line too in the
> wakeup path is resulting in higher cache misses in the levels closer to
> the core.
>
> I don't think it is a false sharing problem as these per CPU data are
> set when the system first boots up and will only be change again during
> a CPU hotplug operation. However, it might be beneficial to see the c2c
> profile if perf counters don't indicate anything out of the ordinary.
>
Would it be possible if any other frequent-written variables share
same cacheline with these per-cpu data causing false sharing? What
about making all these sd_* data DEFINE_PER_CPU_READ_MOSTLY?
Powered by blists - more mailing lists