linux-kernel - Re: [PATCH v3 2/2] selftests/resctrl: Adjust SNC support messages

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <d5bd6275-ab86-439a-887f-17c04a586716@intel.com>
Date: Wed, 3 Jul 2024 13:51:03 -0700
From: Reinette Chatre <reinette.chatre@...el.com>
To: Maciej Wieczór-Retman <maciej.wieczor-retman@...el.com>,
	<shuah@...nel.org>, <fenghua.yu@...el.com>
CC: <linux-kernel@...r.kernel.org>, <linux-kselftest@...r.kernel.org>,
	<ilpo.jarvinen@...ux.intel.com>, <tony.luck@...el.com>
Subject: Re: [PATCH v3 2/2] selftests/resctrl: Adjust SNC support messages

Hi Maciej,

On 7/3/24 12:43 AM, Maciej Wieczór-Retman wrote:
> On 3.07.2024 00:21, Reinette Chatre wrote:
>> On 7/1/24 7:18 AM, Maciej Wieczor-Retman wrote:

>>> diff --git a/tools/testing/selftests/resctrl/cache.c b/tools/testing/selftests/resctrl/cache.c
>>> index 1ff1104e6575..9885d64b8a21 100644
>>> --- a/tools/testing/selftests/resctrl/cache.c
>>> +++ b/tools/testing/selftests/resctrl/cache.c
>>> @@ -186,4 +186,7 @@ void show_cache_info(int no_of_bits, __u64 avg_llc_val, size_t cache_span, bool
>>>        ksft_print_msg("Average LLC val: %llu\n", avg_llc_val);
>>>        ksft_print_msg("Cache span (%s): %zu\n", lines ? "lines" : "bytes",
>>>                   cache_span);
>>> +    if (snc_unreliable)
>>> +        ksft_print_msg("SNC detection unreliable due to offline CPUs!\n");
>>
>> The message abour SNC detection being unreliable is already printed at beginning of every
>> test so I do not think it is necessary to print it again at this point.
>>
> 
> The "SNC detection was unreliable" only gets printed on the first execution of run_single_test().

There is more about this later, but this can be printed at start of each test.

> That's what the global snc_mode was for mostly, it starts initialized to 0, and then on the first
> run of run_single_test() it is set so other tests don't call get_snc_mode(). And then the local static
> variable inside get_snc_mode() prevents the detection from running more than once when other places
> call get_snc_mode() (like when the cache size is adjusted).

The shadowing of variables can get confusing. I think the global snc_mode is not necessary, having the
local static variable within snc_nodes_per_l3_cache() should be sufficient and run_single_test()
can just do a:

	int snc_mode; /* new name welcome */

	snc_mode = snc_nodes_per_l3_cache();
	if (snc_mode > 1)
		ksft_print_msg("SNC-%d mode discovered\n", snc_mode);
	else if (snc_unreliable)
		ksft_print_msg("SNC detection unreliable due to offline CPUs. Test results may not be accurate if SNC enabled.\n");

> 
> And as we discussed last time it's beneficial to put error messages at the end of the test in case the
> user misses the initial warning at the very beginning.

Right. What I found unexpected was that it is done "at the end" but from two places, from the show*info()
as well as from run*test(). I expect "the end" to be a single place.

>>>    }
>>> diff --git a/tools/testing/selftests/resctrl/cmt_test.c b/tools/testing/selftests/resctrl/cmt_test.c
>>> index 0c045080d808..588543ae2654 100644
>>> --- a/tools/testing/selftests/resctrl/cmt_test.c
>>> +++ b/tools/testing/selftests/resctrl/cmt_test.c
>>> @@ -175,8 +175,8 @@ static int cmt_run_test(const struct resctrl_test *test, const struct user_param
>>>            goto out;
>>>          ret = check_results(&param, span, n);
>>> -    if (ret && (get_vendor() == ARCH_INTEL))
>>> -        ksft_print_msg("Intel CMT may be inaccurate when Sub-NUMA Clustering is enabled. Check BIOS configuration.\n");
>>
>> This message does seem to still be applicable if snc_unreliable == 1.
> 
> I was going for this one error message to specifically catch the kernel
> not having snc support for resctrl while snc is enabled. While the
> above message could be true when snc_unreliable == 1, it doesn't have to.

If a test fails when snc_unreliable == 1 then nothing is certain and some generic message
is needed.

> SNC might not be enabled at all so there would be no reason to send the user
> to check their BIOS - instead they can learn they have offline CPUs and they can
> work on fixing that. In my opinion it could be beneficial to have more specialized
> messages in the selftests to help users diagnose problems quicker.

My goal is indeed to have specialized messages. There cannot be a specialized message
if snc_reliable == 1. In this case it needs to be generic since SNC may or may not be
enabled and it is up to the user to investigate further.

> 
> Having only this one message wihtout the "if snc unreliable" messages would
> mean nothing would get printed at the end on success with unreliable SNC detection.

Having a pass/fail is what user will focus on. If the test passes then SNC detection
should not matter. The messages are just there to help user root cause where a failure
may be.

...
>>
>>>    volatile int *value_sink = &sink_target;
>>>      static struct resctrl_test *resctrl_tests[] = {
>>> @@ -123,6 +124,12 @@ static void run_single_test(const struct resctrl_test *test, const struct user_p
>>>        if (test->disabled)
>>>            return;
>>>    +    if (!snc_mode) {
>>> +        snc_mode = get_snc_mode();
>>> +        if (snc_mode > 1)
>>
>>
>>  From what I can tell this is the only place the global is used and this can just be:
>>          if (get_snc_mode() > 1)
> 
> I wanted to print the message below only on the first call to run_single_test() and then
> print relevant warnings at the very end of each test. I thought that was your intention
> when we discussed what messages are supposed to be printed and when in v2 of this series.
> 
> Do you think it would be better to just print this message at the start of each test?

Yes. If there is a problem with a test the user could be expected to start tracing back
messages printed from beginning of failing test.

> Or should I make "snc_mode" into local static inside run_single_test()? Or maybe add
> a second local static variable into get_snc_mode() that would control whether or not
> the message should be printed?

I do not see where more local static variables may be needed.

Reinette