[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <d5bd6275-ab86-439a-887f-17c04a586716@intel.com>
Date: Wed, 3 Jul 2024 13:51:03 -0700
From: Reinette Chatre <reinette.chatre@...el.com>
To: Maciej Wieczór-Retman <maciej.wieczor-retman@...el.com>,
<shuah@...nel.org>, <fenghua.yu@...el.com>
CC: <linux-kernel@...r.kernel.org>, <linux-kselftest@...r.kernel.org>,
<ilpo.jarvinen@...ux.intel.com>, <tony.luck@...el.com>
Subject: Re: [PATCH v3 2/2] selftests/resctrl: Adjust SNC support messages
Hi Maciej,
On 7/3/24 12:43 AM, Maciej Wieczór-Retman wrote:
> On 3.07.2024 00:21, Reinette Chatre wrote:
>> On 7/1/24 7:18 AM, Maciej Wieczor-Retman wrote:
>>> diff --git a/tools/testing/selftests/resctrl/cache.c b/tools/testing/selftests/resctrl/cache.c
>>> index 1ff1104e6575..9885d64b8a21 100644
>>> --- a/tools/testing/selftests/resctrl/cache.c
>>> +++ b/tools/testing/selftests/resctrl/cache.c
>>> @@ -186,4 +186,7 @@ void show_cache_info(int no_of_bits, __u64 avg_llc_val, size_t cache_span, bool
>>> ksft_print_msg("Average LLC val: %llu\n", avg_llc_val);
>>> ksft_print_msg("Cache span (%s): %zu\n", lines ? "lines" : "bytes",
>>> cache_span);
>>> + if (snc_unreliable)
>>> + ksft_print_msg("SNC detection unreliable due to offline CPUs!\n");
>>
>> The message abour SNC detection being unreliable is already printed at beginning of every
>> test so I do not think it is necessary to print it again at this point.
>>
>
> The "SNC detection was unreliable" only gets printed on the first execution of run_single_test().
There is more about this later, but this can be printed at start of each test.
> That's what the global snc_mode was for mostly, it starts initialized to 0, and then on the first
> run of run_single_test() it is set so other tests don't call get_snc_mode(). And then the local static
> variable inside get_snc_mode() prevents the detection from running more than once when other places
> call get_snc_mode() (like when the cache size is adjusted).
The shadowing of variables can get confusing. I think the global snc_mode is not necessary, having the
local static variable within snc_nodes_per_l3_cache() should be sufficient and run_single_test()
can just do a:
int snc_mode; /* new name welcome */
snc_mode = snc_nodes_per_l3_cache();
if (snc_mode > 1)
ksft_print_msg("SNC-%d mode discovered\n", snc_mode);
else if (snc_unreliable)
ksft_print_msg("SNC detection unreliable due to offline CPUs. Test results may not be accurate if SNC enabled.\n");
>
> And as we discussed last time it's beneficial to put error messages at the end of the test in case the
> user misses the initial warning at the very beginning.
Right. What I found unexpected was that it is done "at the end" but from two places, from the show*info()
as well as from run*test(). I expect "the end" to be a single place.
>>> }
>>> diff --git a/tools/testing/selftests/resctrl/cmt_test.c b/tools/testing/selftests/resctrl/cmt_test.c
>>> index 0c045080d808..588543ae2654 100644
>>> --- a/tools/testing/selftests/resctrl/cmt_test.c
>>> +++ b/tools/testing/selftests/resctrl/cmt_test.c
>>> @@ -175,8 +175,8 @@ static int cmt_run_test(const struct resctrl_test *test, const struct user_param
>>> goto out;
>>> ret = check_results(¶m, span, n);
>>> - if (ret && (get_vendor() == ARCH_INTEL))
>>> - ksft_print_msg("Intel CMT may be inaccurate when Sub-NUMA Clustering is enabled. Check BIOS configuration.\n");
>>
>> This message does seem to still be applicable if snc_unreliable == 1.
>
> I was going for this one error message to specifically catch the kernel
> not having snc support for resctrl while snc is enabled. While the
> above message could be true when snc_unreliable == 1, it doesn't have to.
If a test fails when snc_unreliable == 1 then nothing is certain and some generic message
is needed.
> SNC might not be enabled at all so there would be no reason to send the user
> to check their BIOS - instead they can learn they have offline CPUs and they can
> work on fixing that. In my opinion it could be beneficial to have more specialized
> messages in the selftests to help users diagnose problems quicker.
My goal is indeed to have specialized messages. There cannot be a specialized message
if snc_reliable == 1. In this case it needs to be generic since SNC may or may not be
enabled and it is up to the user to investigate further.
>
> Having only this one message wihtout the "if snc unreliable" messages would
> mean nothing would get printed at the end on success with unreliable SNC detection.
Having a pass/fail is what user will focus on. If the test passes then SNC detection
should not matter. The messages are just there to help user root cause where a failure
may be.
...
>>
>>> volatile int *value_sink = &sink_target;
>>> static struct resctrl_test *resctrl_tests[] = {
>>> @@ -123,6 +124,12 @@ static void run_single_test(const struct resctrl_test *test, const struct user_p
>>> if (test->disabled)
>>> return;
>>> + if (!snc_mode) {
>>> + snc_mode = get_snc_mode();
>>> + if (snc_mode > 1)
>>
>>
>> From what I can tell this is the only place the global is used and this can just be:
>> if (get_snc_mode() > 1)
>
> I wanted to print the message below only on the first call to run_single_test() and then
> print relevant warnings at the very end of each test. I thought that was your intention
> when we discussed what messages are supposed to be printed and when in v2 of this series.
>
> Do you think it would be better to just print this message at the start of each test?
Yes. If there is a problem with a test the user could be expected to start tracing back
messages printed from beginning of failing test.
> Or should I make "snc_mode" into local static inside run_single_test()? Or maybe add
> a second local static variable into get_snc_mode() that would control whether or not
> the message should be printed?
I do not see where more local static variables may be needed.
Reinette
Powered by blists - more mailing lists