[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <d4279615-4122-d7a1-5397-c570e2f09fa8@amd.com>
Date: Tue, 9 May 2023 20:40:00 +0530
From: K Prateek Nayak <kprateek.nayak@....com>
To: Sudeep Holla <sudeep.holla@....com>, gregkh@...uxfoundation.org
Cc: linux-kernel@...r.kernel.org, rafael@...nel.org,
yongxuan.wang@...ive.com, pierre.gondois@....com,
vincent.chen@...ive.com, greentime.hu@...ive.com,
yangyicong@...wei.com, prime.zeng@...ilicon.com,
palmer@...osinc.com, puwen@...on.cn
Subject: Re: [PATCH 0/2] drivers: base: cacheinfo: Fix shared_cpu_list
inconsistency in event of CPU hotplug
Hello Sudeep,
On 5/9/2023 8:31 PM, Sudeep Holla wrote:
> On Mon, May 08, 2023 at 02:11:13PM +0530, K Prateek Nayak wrote:
>> Since v6.3-rc1, the shared_cpu_list in per-cpu cacheinfo breaks in case
>> of hotplug activity on x86. This can be tracked back to two commits:
>>
>> o commit 198102c9103f ("cacheinfo: Fix shared_cpu_map to handle shared
>> caches at different levels") that matches cache instance IDs without
>> considering if the instance IDs belong to same cache level or not.
>>
>> o commit 5c2712387d48 ("cacheinfo: Fix LLC is not exported through
>> sysfs") which skips calling populate_cache_leaves() if
>> last_level_cache_is_valid(cpu) returns true. populate_cache_leaves()
>> on x86 would have populated the shared_cpu_map when CPU comes online,
>> which is now skipped, and the alternate path has an early bailout
>> before setting the CPU in the shared_cpu_map is even attempted.
>>
>> On x86, populate_cache_leaves() also sets the
>> cpu_cacheinfo->cpu_map_populated flag when the cacheinfo is first
>> populated, the cache_shared_cpu_map_setup() in the driver is bypassed
>> when a thread comes back online during the hotplug activity. This leads
>> to the shared_cpu_list displaying abnormal values for the CPU that was
>> offlined and then onlined since the shared_cpu_maps are never
>> revaluated.
>>
>> Following is the output from a dual socket 3rd Generation AMD EPYC
>> processor (2 x 64C/128T) for cachinfo when offlining and then onlining
>> CPU8:
>>
>> o v6.3-rc5 with no changes:
>>
>> # for i in /sys/devices/system/cpu/cpu8/cache/index*/shared_cpu_list; do echo -n "$i: "; cat $i; done
>> /sys/devices/system/cpu/cpu8/cache/index0/shared_cpu_list: 8,136
>> /sys/devices/system/cpu/cpu8/cache/index1/shared_cpu_list: 8,136
>> /sys/devices/system/cpu/cpu8/cache/index2/shared_cpu_list: 8,136
>> /sys/devices/system/cpu/cpu8/cache/index3/shared_cpu_list: 8-15,136-143
>>
>> # echo 0 > /sys/devices/system/cpu/cpu8/online
>> # echo 1 > /sys/devices/system/cpu/cpu8/online
>>
>> # for i in /sys/devices/system/cpu/cpu8/cache/index*/shared_cpu_list; do echo -n "$i: "; cat $i; done
>> /sys/devices/system/cpu/cpu8/cache/index0/shared_cpu_list: 8
>> /sys/devices/system/cpu/cpu8/cache/index1/shared_cpu_list: 8
>> /sys/devices/system/cpu/cpu8/cache/index2/shared_cpu_list: 8
>> /sys/devices/system/cpu/cpu8/cache/index3/shared_cpu_list: 8
>>
>> # cat /sys/devices/system/cpu/cpu136/cache/index0/shared_cpu_list
>> 136
>>
>> # cat /sys/devices/system/cpu/cpu136/cache/index3/shared_cpu_list
>> 9-15,136-143
>>
>> o v6.3-rc5 with commit 5c2712387d48 ("cacheinfo: Fix LLC is not exported
>> through sysfs") reverted (Behavior consistent with v6.2):
>>
>> # for i in /sys/devices/system/cpu/cpu8/cache/index*/shared_cpu_list; do echo -n "$i: "; cat $i; done
>> /sys/devices/system/cpu/cpu8/cache/index0/shared_cpu_list: 8,136
>> /sys/devices/system/cpu/cpu8/cache/index1/shared_cpu_list: 8,136
>> /sys/devices/system/cpu/cpu8/cache/index2/shared_cpu_list: 8,136
>> /sys/devices/system/cpu/cpu8/cache/index3/shared_cpu_list: 8-15,136-143
>>
>> # echo 0 > /sys/devices/system/cpu/cpu8/online
>> # echo 1 > /sys/devices/system/cpu/cpu8/online
>>
>> # for i in /sys/devices/system/cpu/cpu8/cache/index*/shared_cpu_list; do echo -n "$i: "; cat $i; done
>> /sys/devices/system/cpu/cpu8/cache/index0/shared_cpu_list: 8,136
>> /sys/devices/system/cpu/cpu8/cache/index1/shared_cpu_list: 8,136
>> /sys/devices/system/cpu/cpu8/cache/index2/shared_cpu_list: 8,136
>> /sys/devices/system/cpu/cpu8/cache/index3/shared_cpu_list: 8-15,136-143
>>
>> # cat /sys/devices/system/cpu/cpu136/cache/index0/shared_cpu_list
>> 8,136
>>
>> # cat /sys/devices/system/cpu/cpu136/cache/index3/shared_cpu_list
>> 8-15,136-143
>>
>> This is not only limited to AMD processors but affects Intel processors
>> too. Following is the output from same experiment on a dual socket Intel
>> Ice Lake server (2 x 32C/64T) running kernel v6.3-rc5:
>>
>> # for i in /sys/devices/system/cpu/cpu8/cache/index*/shared_cpu_list; do echo -n "$i: "; cat $i; done
>> /sys/devices/system/cpu/cpu8/cache/index0/shared_cpu_list: 8,72
>> /sys/devices/system/cpu/cpu8/cache/index1/shared_cpu_list: 8,72
>> /sys/devices/system/cpu/cpu8/cache/index2/shared_cpu_list: 8,72
>> /sys/devices/system/cpu/cpu8/cache/index3/shared_cpu_list: 0,2,4,6,8,10,12,14,16,18,20,22,24,
>> 26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,66,68,70,72,74,76,78,80,82,84,86,
>> 88,90,92,94,96,98,100,102,104,106,108,110,112,114,116,118,120,122,124,126
>>
>> # echo 0 > /sys/devices/system/cpu/cpu8/online
>> # echo 1 > /sys/devices/system/cpu/cpu8/online
>>
>> # for i in /sys/devices/system/cpu/cpu8/cache/index*/shared_cpu_list; do echo -n "$i: "; cat $i; done
>> /sys/devices/system/cpu/cpu8/cache/index0/shared_cpu_list: 8
>> /sys/devices/system/cpu/cpu8/cache/index1/shared_cpu_list: 8
>> /sys/devices/system/cpu/cpu8/cache/index2/shared_cpu_list: 8
>> /sys/devices/system/cpu/cpu8/cache/index3/shared_cpu_list: 8
>>
>> # cat /sys/devices/system/cpu/cpu72/cache/index0/shared_cpu_list
>> 72
>>
>> # cat /sys/devices/system/cpu/cpu72/cache/index3/shared_cpu_list
>> 0,2,4,6,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,
>> 66,68,70,72,74,76,78,80,82,84,86,88,90,92,94,96,98,100,102,104,106,108,110,112,114,116,118,
>> 120,122,124,126
>>
>> This patch addresses two issues associated with building
>> shared_cpu_list:
>>
>> o Patch 1 fixes an ID matching issue that can lead to cacheinfo
>> associating CPUs from different cache levels in case IDs are not
>> unique across all the different cache levels.
>>
>> o Patch 2 clears the cpu_cacheinfo->cpu_map_populated flag when CPU goes
>> offline and is removed from the shared_cpu_map.
>>
>> Following are the results after applying the series on v6.3-rc5 on
>> respective x86 platforms:
>>
>> o 3rd Generation AMD EPYC Processor (2 x 64C/128T)
>>
>> # for i in /sys/devices/system/cpu/cpu8/cache/index*/shared_cpu_list; do echo -n "$i: "; cat $i; done
>> /sys/devices/system/cpu/cpu8/cache/index0/shared_cpu_list: 8,136
>> /sys/devices/system/cpu/cpu8/cache/index1/shared_cpu_list: 8,136
>> /sys/devices/system/cpu/cpu8/cache/index2/shared_cpu_list: 8,136
>> /sys/devices/system/cpu/cpu8/cache/index3/shared_cpu_list: 8-15,136-143
>>
>> # echo 0 > /sys/devices/system/cpu/cpu8/online
>> # echo 1 > /sys/devices/system/cpu/cpu8/online
>>
>> # for i in /sys/devices/system/cpu/cpu8/cache/index*/shared_cpu_list; do echo -n "$i: "; cat $i; done
>> /sys/devices/system/cpu/cpu8/cache/index0/shared_cpu_list: 8,136
>> /sys/devices/system/cpu/cpu8/cache/index1/shared_cpu_list: 8,136
>> /sys/devices/system/cpu/cpu8/cache/index2/shared_cpu_list: 8,136
>> /sys/devices/system/cpu/cpu8/cache/index3/shared_cpu_list: 8-15,136-143
>>
>> # cat /sys/devices/system/cpu/cpu136/cache/index0/shared_cpu_list
>> 8,136
>>
>> # cat /sys/devices/system/cpu/cpu136/cache/index3/shared_cpu_list
>> 8-15,136-143
>>
>> o Intel Ice Lake Xeon (2 x 32C/128T)
>>
>> # for i in /sys/devices/system/cpu/cpu8/cache/index*/shared_cpu_list; do echo -n "$i: "; cat $i; done
>> /sys/devices/system/cpu/cpu8/cache/index0/shared_cpu_list: 8,72
>> /sys/devices/system/cpu/cpu8/cache/index1/shared_cpu_list: 8,72
>> /sys/devices/system/cpu/cpu8/cache/index2/shared_cpu_list: 8,72
>> /sys/devices/system/cpu/cpu8/cache/index3/shared_cpu_list: 0,2,4,6,8,10,12,14,16,18,20,22,24,26,
>> 28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,66,68,70,72,74,76,78,80,82,84,86,88,90,
>> 92,94,96,98,100,102,104,106,108,110,112,114,116,118,120,122,124,126
>>
>> # echo 0 > /sys/devices/system/cpu/cpu8/online
>> # echo 1 > /sys/devices/system/cpu/cpu8/online
>>
>> # for i in /sys/devices/system/cpu/cpu8/cache/index*/shared_cpu_list; do echo -n "$i: "; cat $i; done
>> /sys/devices/system/cpu/cpu8/cache/index0/shared_cpu_list: 8,72
>> /sys/devices/system/cpu/cpu8/cache/index1/shared_cpu_list: 8,72
>> /sys/devices/system/cpu/cpu8/cache/index2/shared_cpu_list: 8,72
>> /sys/devices/system/cpu/cpu8/cache/index3/shared_cpu_list: 0,2,4,6,8,10,12,14,16,18,20,22,24,26,
>> 28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,66,68,70,72,74,76,78,80,82,84,86,88,90,
>> 92,94,96,98,100,102,104,106,108,110,112,114,116,118,120,122,124,126
>>
>> # cat /sys/devices/system/cpu/cpu72/cache/index0/shared_cpu_list
>> 8,72
>>
>> # cat /sys/devices/system/cpu/cpu72/cache/index3/shared_cpu_list
>> 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,66,
>> 68,70,72,74,76,78,80,82,84,86,88,90,92,94,96,98,100,102,104,106,108,110,112,114,116,118,120,122,
>> 124,126
>>
>> Running "grep -r 'cpu_map_populated' arch/" shows MIPS and loongarch too
>> set the cpu_cacheinfo->cpu_map_populated who might also be affected by
>> the changes in commit 5c2712387d48 ("cacheinfo: Fix LLC is not exported
>> through sysfs") and this series. Changes from Patch 1 might also affect
>> RISC-V since Yong-Xuan Wang <yongxuan.wang@...ive.com> from SiFive last
>> made changes to cache_shared_cpu_map_setup() and
>> cache_shared_cpu_map_remove() in commit 198102c9103f ("cacheinfo: Fix
>> shared_cpu_map to handle shared caches at different levels").
>
> I think they may be affected as well, it is just that it is not caught
> in the testing.
Yes, you are right, since those platforms support CPU hotplug as well.
>
> Thanks for the detailed explanation and output logs. Not sure how much
> time it took you to write but saved lot of time by making it so simple
> to understand the exact issue. The changes look good.
Glad to know it helped during the review!
>
> Reviewed-by: Sudeep Holla <sudeep.holla@....com>
Thank you for reviewing the changes :)
>
> Hi Greg,
>
> Can you please pick up these fixes in your next cycle of fixes for v6.4 ?
>
--
Thanks and Regards,
Prateek
Powered by blists - more mailing lists