lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <xfhv744t2ol274w6lddon77rn5dkf7dzbwqcoknok4kk4guehz@hwjsusvhoerb>
Date: Thu, 4 Jul 2024 09:23:37 +0200
From: Maciej Wieczor-Retman <maciej.wieczor-retman@...el.com>
To: Reinette Chatre <reinette.chatre@...el.com>
CC: <shuah@...nel.org>, <fenghua.yu@...el.com>,
	<linux-kernel@...r.kernel.org>, <linux-kselftest@...r.kernel.org>,
	<ilpo.jarvinen@...ux.intel.com>, <tony.luck@...el.com>
Subject: Re: [PATCH v3 2/2] selftests/resctrl: Adjust SNC support messages

Hi Reinette,

On 2024-07-03 at 13:51:03 -0700, Reinette Chatre wrote:
>Hi Maciej,
>
>On 7/3/24 12:43 AM, Maciej Wieczór-Retman wrote:
>> On 3.07.2024 00:21, Reinette Chatre wrote:
>> > On 7/1/24 7:18 AM, Maciej Wieczor-Retman wrote:
>
>> > > diff --git a/tools/testing/selftests/resctrl/cache.c b/tools/testing/selftests/resctrl/cache.c
>> > > index 1ff1104e6575..9885d64b8a21 100644
>> > > --- a/tools/testing/selftests/resctrl/cache.c
>> > > +++ b/tools/testing/selftests/resctrl/cache.c
>> > > @@ -186,4 +186,7 @@ void show_cache_info(int no_of_bits, __u64 avg_llc_val, size_t cache_span, bool
>> > >        ksft_print_msg("Average LLC val: %llu\n", avg_llc_val);
>> > >        ksft_print_msg("Cache span (%s): %zu\n", lines ? "lines" : "bytes",
>> > >                   cache_span);
>> > > +    if (snc_unreliable)
>> > > +        ksft_print_msg("SNC detection unreliable due to offline CPUs!\n");
>> > 
>> > The message abour SNC detection being unreliable is already printed at beginning of every
>> > test so I do not think it is necessary to print it again at this point.
>> > 
>> 
>> The "SNC detection was unreliable" only gets printed on the first execution of run_single_test().
>
>There is more about this later, but this can be printed at start of each test.

Okay, thanks.

>
>> That's what the global snc_mode was for mostly, it starts initialized to 0, and then on the first
>> run of run_single_test() it is set so other tests don't call get_snc_mode(). And then the local static
>> variable inside get_snc_mode() prevents the detection from running more than once when other places
>> call get_snc_mode() (like when the cache size is adjusted).
>
>The shadowing of variables can get confusing. I think the global snc_mode is not necessary, having the
>local static variable within snc_nodes_per_l3_cache() should be sufficient and run_single_test()
>can just do a:
>
>	int snc_mode; /* new name welcome */
>
>	snc_mode = snc_nodes_per_l3_cache();
>	if (snc_mode > 1)
>		ksft_print_msg("SNC-%d mode discovered\n", snc_mode);
>	else if (snc_unreliable)
>		ksft_print_msg("SNC detection unreliable due to offline CPUs. Test results may not be accurate if SNC enabled.\n");

This sounds good, thanks.

>
>> 
>> And as we discussed last time it's beneficial to put error messages at the end of the test in case the
>> user misses the initial warning at the very beginning.
>
>Right. What I found unexpected was that it is done "at the end" but from two places, from the show*info()
>as well as from run*test(). I expect "the end" to be a single place.

Okay, I'll remove messages from show*info().

>
>> > >    }
>> > > diff --git a/tools/testing/selftests/resctrl/cmt_test.c b/tools/testing/selftests/resctrl/cmt_test.c
>> > > index 0c045080d808..588543ae2654 100644
>> > > --- a/tools/testing/selftests/resctrl/cmt_test.c
>> > > +++ b/tools/testing/selftests/resctrl/cmt_test.c
>> > > @@ -175,8 +175,8 @@ static int cmt_run_test(const struct resctrl_test *test, const struct user_param
>> > >            goto out;
>> > >          ret = check_results(&param, span, n);
>> > > -    if (ret && (get_vendor() == ARCH_INTEL))
>> > > -        ksft_print_msg("Intel CMT may be inaccurate when Sub-NUMA Clustering is enabled. Check BIOS configuration.\n");
>> > 
>> > This message does seem to still be applicable if snc_unreliable == 1.
>> 
>> I was going for this one error message to specifically catch the kernel
>> not having snc support for resctrl while snc is enabled. While the
>> above message could be true when snc_unreliable == 1, it doesn't have to.
>
>If a test fails when snc_unreliable == 1 then nothing is certain and some generic message
>is needed.

Right

>
>> SNC might not be enabled at all so there would be no reason to send the user
>> to check their BIOS - instead they can learn they have offline CPUs and they can
>> work on fixing that. In my opinion it could be beneficial to have more specialized
>> messages in the selftests to help users diagnose problems quicker.
>
>My goal is indeed to have specialized messages. There cannot be a specialized message
>if snc_reliable == 1. In this case it needs to be generic since SNC may or may not be
>enabled and it is up to the user to investigate further.

How about this in cmt_run_test() for example:

	if (snc_unreliable)
		ksft_print_msg("Intel CMT may be inaccurate or inefficient when Sub-NUMA Clustering is enabled and not properly detected.\n");
	else if (ret && (get_vendor() == ARCH_INTEL) && !snc_kernel_support())
		ksft_print_msg("Kernel doesn't support Sub-NUMA Clustering but it is enabled. Check BIOS configuration.\n");

This way there is a generic message when snc_unreliable == 1.

And as you mentioned at the end of this email, the user can be expected to
backtrack to the beginning of the test if there are any problems so they can
discover the exact source of the issue - offline cpus.

>
>> 
>> Having only this one message wihtout the "if snc unreliable" messages would
>> mean nothing would get printed at the end on success with unreliable SNC detection.
>
>Having a pass/fail is what user will focus on. If the test passes then SNC detection
>should not matter. The messages are just there to help user root cause where a failure
>may be.

My train of thought was that if test passes with broken SNC detection it means
SNC was used inefficiently right? (either the portion of L3 used was bigger or
smaller than that allocated for one cluster)

It's not exactly a failure but I thought it deserves a warning at the very end
to alert the user.

If you don't think the warning should be printed on success I guess the
condition can be:
	if (ret && snc_unreliable)
and the user can look at the start of the test if they care about more
information. And the message can lose the "inefficient" word since it would only
happen on error.

>
>...
>> > 
>> > >    volatile int *value_sink = &sink_target;
>> > >      static struct resctrl_test *resctrl_tests[] = {
>> > > @@ -123,6 +124,12 @@ static void run_single_test(const struct resctrl_test *test, const struct user_p
>> > >        if (test->disabled)
>> > >            return;
>> > >    +    if (!snc_mode) {
>> > > +        snc_mode = get_snc_mode();
>> > > +        if (snc_mode > 1)
>> > 
>> > 
>> >  From what I can tell this is the only place the global is used and this can just be:
>> >          if (get_snc_mode() > 1)
>> 
>> I wanted to print the message below only on the first call to run_single_test() and then
>> print relevant warnings at the very end of each test. I thought that was your intention
>> when we discussed what messages are supposed to be printed and when in v2 of this series.
>> 
>> Do you think it would be better to just print this message at the start of each test?
>
>Yes. If there is a problem with a test the user could be expected to start tracing back
>messages printed from beginning of failing test.

I tried to reply to this comment with the suggestion above.

>
>> Or should I make "snc_mode" into local static inside run_single_test()? Or maybe add
>> a second local static variable into get_snc_mode() that would control whether or not
>> the message should be printed?
>
>I do not see where more local static variables may be needed.

Since you agreed with the previous paragraph this one doesn't matter.

>
>Reinette

-- 
Kind regards
Maciej Wieczór-Retman

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ