lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <2048036f-332b-49d1-a753-3653136d728c@intel.com>
Date: Mon, 12 Aug 2024 16:40:10 -0700
From: Reinette Chatre <reinette.chatre@...el.com>
To: Maciej Wieczor-Retman <maciej.wieczor-retman@...el.com>,
	<fenghua.yu@...el.com>, <shuah@...nel.org>
CC: <linux-kselftest@...r.kernel.org>, <linux-kernel@...r.kernel.org>,
	<ilpo.jarvinen@...ux.intel.com>, <tony.luck@...el.com>
Subject: Re: [PATCH v4 2/2] selftests/resctrl: Adjust SNC support messages

Hi Maciej,

On 7/12/24 2:04 AM, Maciej Wieczor-Retman wrote:
> Resctrl selftest prints a message on test failure that Sub-Numa
> Clustering (SNC) could be enabled and points the user to check their BIOS
> settings. No actual check is performed before printing that message so
> it is not very accurate in pinpointing a problem.
> 
> Figuring out if SNC is enabled is only one part of the problem, the
> others being whether the detected SNC mode is reliable and whether the
> kernel supports SNC in resctrl.
> 
> When there is SNC support for kernel's resctrl subsystem and SNC is
> enabled then sub node files are created for each node in the resctrlfs.
> The sub node files exist in each regular node's L3 monitoring directory.
> The reliable path to check for existence of sub node files is
> /sys/fs/resctrl/mon_data/mon_L3_00/mon_sub_L3_00.
> 
> To check if SNC detection is reliable one can check the
> /sys/devices/system/cpu/offline file. If it's empty, it means all cores
> are operational and the ratio should be calculated correctly. If it has
> any contents, it means the detected SNC mode can't be trusted and should
> be disabled.
>

This belongs in previous patch.

  
> Add helpers for all operations mentioned above.
> 
> Detect SNC mode once and let other tests inherit that information.

This belongs to previous patch.

> 
> Add messages to alert the user when SNC detection could return incorrect
> results. Correct old messages to account for kernel support of SNC in
> resctrl.
>
> Signed-off-by: Maciej Wieczor-Retman <maciej.wieczor-retman@...el.com>
> ---
> Changelog v4:
> - Change messages at the end of tests and at the start of
>    run_single_test. (Reinette)
> - Add messages at the end of CAT since it can also fail due to enabled
>    SNC + lack of kernel support.
> - Remove snc_mode global variable. (Reinette)
> - Fix wrong description of snc_kernel_support(). (Reinette)
> - Move call to cpus_offline_empty() into snc_nodes_per_l3_cache() so the
>    whole detection flow is in one place as discussed. (Reinette)
> 
> Changelog v3:
> - Change snc_ways() to snc_nodes_per_l3_cache(). (Reinette)
> - Add printing the discovered SNC mode. (Reinette)
> - Change method of kernel support discovery from cache sizes to
>    existance of sub node files.
> - Check if SNC detection is unreliable.
> - Move SNC detection to only the first run_single_test() instead on
>    error at the end of test runs.
> - Add global value to remind user at the end of relevant tests if SNC
>    detection was found to be unreliable.
> - Redo the patch message after the changes.
> 
> Changelog v2:
> - Move snc_ways() checks from individual tests into
>    snc_kernel_support().
> - Write better comment for snc_kernel_support().
> 
>   tools/testing/selftests/resctrl/cat_test.c    |  8 +++
>   tools/testing/selftests/resctrl/cmt_test.c    | 10 +++-
>   tools/testing/selftests/resctrl/mba_test.c    |  7 +++
>   tools/testing/selftests/resctrl/mbm_test.c    |  9 ++-
>   tools/testing/selftests/resctrl/resctrl.h     |  3 +
>   .../testing/selftests/resctrl/resctrl_tests.c |  8 ++-
>   tools/testing/selftests/resctrl/resctrlfs.c   | 57 +++++++++++++++++++
>   7 files changed, 97 insertions(+), 5 deletions(-)
> 
> diff --git a/tools/testing/selftests/resctrl/cat_test.c b/tools/testing/selftests/resctrl/cat_test.c
> index d4dffc934bc3..a8bb49f56755 100644
> --- a/tools/testing/selftests/resctrl/cat_test.c
> +++ b/tools/testing/selftests/resctrl/cat_test.c
> @@ -285,6 +285,14 @@ static int cat_run_test(const struct resctrl_test *test, const struct user_param
>   
>   	ret = check_results(&param, test->resource,
>   			    cache_total_size, full_cache_mask, start_mask);
> +	if (ret && (get_vendor() == ARCH_INTEL) && !snc_kernel_support())
> +		ksft_print_msg("Kernel doesn't support Sub-NUMA Clustering but it is enabled on the system.\n");

The kernel support only applies to monitoring, there is no kernel support/changes related to CAT when SNC
is enabled.

> +
> +	if ((get_vendor() == ARCH_INTEL) && snc_unreliable) {
> +		ksft_print_msg("Sub-NUMA Clustering could not be detected properly (see earlier messages for details).\n");
> +		ksft_print_msg("Intel CAT may be inaccurate.\n");
> +	}

This is still relevant but unclear why previous message checked "ret" but above does not.

> +
>   	return ret;
>   }
>   
> diff --git a/tools/testing/selftests/resctrl/cmt_test.c b/tools/testing/selftests/resctrl/cmt_test.c
> index 0c045080d808..471e134face0 100644
> --- a/tools/testing/selftests/resctrl/cmt_test.c
> +++ b/tools/testing/selftests/resctrl/cmt_test.c
> @@ -175,8 +175,14 @@ static int cmt_run_test(const struct resctrl_test *test, const struct user_param
>   		goto out;
>   
>   	ret = check_results(&param, span, n);
> -	if (ret && (get_vendor() == ARCH_INTEL))
> -		ksft_print_msg("Intel CMT may be inaccurate when Sub-NUMA Clustering is enabled. Check BIOS configuration.\n");
> +	if (ret && (get_vendor() == ARCH_INTEL) && !snc_kernel_support())
> +		ksft_print_msg("Kernel doesn't support Sub-NUMA Clustering but it is enabled on the system.\n");
> +
> +	if ((get_vendor() == ARCH_INTEL) && snc_unreliable) {
> +		ksft_print_msg("Sub-NUMA Clustering could not be detected properly (see earlier messages for details).\n");
> +		ksft_print_msg("Intel CMT may be inaccurate.\n");
> +	}
> +

CMT may be inaccurate in both scenarios (no kernel support or unreliable detection). Why only
check "ret" in case there is no kernel support?

>   
>   out:
>   	free(span_str);
> diff --git a/tools/testing/selftests/resctrl/mba_test.c b/tools/testing/selftests/resctrl/mba_test.c
> index ab8496a4925b..a805c14fe04b 100644
> --- a/tools/testing/selftests/resctrl/mba_test.c
> +++ b/tools/testing/selftests/resctrl/mba_test.c
> @@ -179,6 +179,13 @@ static int mba_run_test(const struct resctrl_test *test, const struct user_param
>   		return ret;
>   
>   	ret = check_results();
> +	if (ret && (get_vendor() == ARCH_INTEL) && !snc_kernel_support())
> +		ksft_print_msg("Kernel doesn't support Sub-NUMA Clustering but it is enabled on the system.\n");
> +
> +	if ((get_vendor() == ARCH_INTEL) && snc_unreliable) {
> +		ksft_print_msg("Sub-NUMA Clustering could not be detected properly (see earlier messages for details).\n");
> +		ksft_print_msg("Intel MBA may be inaccurate.\n");
> +	}

As I understand there is no change to MBA when SNC is enabled. These additions thus seem unnecessary.

>   
>   	return ret;
>   }
> diff --git a/tools/testing/selftests/resctrl/mbm_test.c b/tools/testing/selftests/resctrl/mbm_test.c
> index 6b5a3b52d861..ce3c86989f8b 100644
> --- a/tools/testing/selftests/resctrl/mbm_test.c
> +++ b/tools/testing/selftests/resctrl/mbm_test.c
> @@ -147,8 +147,13 @@ static int mbm_run_test(const struct resctrl_test *test, const struct user_param
>   		return ret;
>   
>   	ret = check_results(DEFAULT_SPAN);
> -	if (ret && (get_vendor() == ARCH_INTEL))
> -		ksft_print_msg("Intel MBM may be inaccurate when Sub-NUMA Clustering is enabled. Check BIOS configuration.\n");
> +	if (ret && (get_vendor() == ARCH_INTEL) && !snc_kernel_support())
> +		ksft_print_msg("Kernel doesn't support Sub-NUMA Clustering but it is enabled on the system.\n");
> +
> +	if ((get_vendor() == ARCH_INTEL) && snc_unreliable) {
> +		ksft_print_msg("Sub-NUMA Clustering could not be detected properly (see earlier messages for details).\n");
> +		ksft_print_msg("Intel MBM may be inaccurate.\n");
> +	}
>   
>   	return ret;
>   }
> diff --git a/tools/testing/selftests/resctrl/resctrl.h b/tools/testing/selftests/resctrl/resctrl.h
> index 851b37c9c38a..488bdca01e4f 100644
> --- a/tools/testing/selftests/resctrl/resctrl.h
> +++ b/tools/testing/selftests/resctrl/resctrl.h
> @@ -121,6 +121,8 @@ struct perf_event_read {
>    */
>   extern volatile int *value_sink;
>   
> +extern int snc_unreliable;
> +
>   extern char llc_occup_path[1024];
>   
>   int snc_nodes_per_l3_cache(void);
> @@ -167,6 +169,7 @@ void ctrlc_handler(int signum, siginfo_t *info, void *ptr);
>   int signal_handler_register(const struct resctrl_test *test);
>   void signal_handler_unregister(void);
>   unsigned int count_bits(unsigned long n);
> +int snc_kernel_support(void);
>   
>   void perf_event_attr_initialize(struct perf_event_attr *pea, __u64 config);
>   void perf_event_initialize_read_format(struct perf_event_read *pe_read);
> diff --git a/tools/testing/selftests/resctrl/resctrl_tests.c b/tools/testing/selftests/resctrl/resctrl_tests.c
> index ecbb7605a981..4b84d6199a36 100644
> --- a/tools/testing/selftests/resctrl/resctrl_tests.c
> +++ b/tools/testing/selftests/resctrl/resctrl_tests.c
> @@ -118,11 +118,17 @@ static bool test_vendor_specific_check(const struct resctrl_test *test)
>   
>   static void run_single_test(const struct resctrl_test *test, const struct user_params *uparams)
>   {
> -	int ret;
> +	int ret, snc_mode;
>   
>   	if (test->disabled)
>   		return;
>   
> +	snc_mode = snc_nodes_per_l3_cache();
> +	if (snc_mode > 1)
> +		ksft_print_msg("SNC-%d mode discovered.\n", snc_mode);
> +	else if (snc_unreliable)
> +		ksft_print_msg("SNC detection unreliable due to offline CPUs. Test results may not be accurate if SNC enabled.\n");
> +
>   	if (!test_vendor_specific_check(test)) {
>   		ksft_test_result_skip("Hardware does not support %s\n", test->name);
>   		return;
> diff --git a/tools/testing/selftests/resctrl/resctrlfs.c b/tools/testing/selftests/resctrl/resctrlfs.c
> index 803dd415984c..4d0dbb332b8f 100644
> --- a/tools/testing/selftests/resctrl/resctrlfs.c
> +++ b/tools/testing/selftests/resctrl/resctrlfs.c
> @@ -13,6 +13,8 @@
>   
>   #include "resctrl.h"
>   
> +int snc_unreliable;
> +
>   static int find_resctrl_mount(char *buffer)
>   {
>   	FILE *mounts;
> @@ -186,6 +188,25 @@ static unsigned int count_sys_bitmap_bits(char *name)
>   	return count;
>   }
>   
> +static bool cpus_offline_empty(void)
> +{
> +	char offline_cpus_str[64];
> +	FILE *fp;
> +
> +	fp = fopen("/sys/devices/system/cpu/offline", "r");
> +	if (fscanf(fp, "%s", offline_cpus_str) < 0) {
> +		if (!errno) {
> +			fclose(fp);
> +			return 1;
> +		}
> +		ksft_perror("Could not read offline CPUs file!");

No need to scream.

I think it will be more useful to replace "offline CPUs file" with
specific "/sys/devices/system/cpu/offline".

> +	}
> +
> +	fclose(fp);
> +
> +	return 0;
> +}
> +
>   /*
>    * Detect SNC by comparing #CPUs in node0 with #CPUs sharing LLC with CPU0.
>    * If some CPUs are offline the numbers may not be exact multiples of each
> @@ -199,6 +220,13 @@ int snc_nodes_per_l3_cache(void)
>   	static int snc_mode;
>   
>   	if (!snc_mode) {
> +		if (!cpus_offline_empty()) {
> +			ksft_print_msg("Runtime SNC detection unreliable due to offline CPUs.\n");
> +			ksft_print_msg("Setting SNC mode to disabled.\n");
> +			snc_mode = 1;
> +			snc_unreliable = 1;
> +			return snc_mode;
> +		}

This can be moved to previous patch and that for loop simplified/removed.

>   		node_cpus = count_sys_bitmap_bits("/sys/devices/system/node/node0/cpumap");
>   		cache_cpus = count_sys_bitmap_bits("/sys/devices/system/cpu/cpu0/cache/index3/shared_cpu_map");
>   
> @@ -942,3 +970,32 @@ unsigned int count_bits(unsigned long n)
>   
>   	return count;
>   }
> +
> +/**
> + * snc_kernel_support - Check for existence of mon_sub_L3_00 file that indicates
> + * SNC resctrl support on the kernel side.
> + *
> + * Return: 0 if not supported, 1 if SNC is disabled or SNC is both enabled and
> + * supported.
> + */
> +int snc_kernel_support(void)
> +{
> +	char node_path[PATH_MAX];
> +	struct stat statbuf;
> +	int ret;
> +
> +	ret = snc_nodes_per_l3_cache();
> +	/*
> +	 * If SNC is disabled then its kernel support isn't important.

Please expand comment that this does not take unreliable SNC detection into account and
needs to be done separately.

> +	 */
> +	if (ret == 1)
> +		return ret;
> +
> +	snprintf(node_path, sizeof(node_path), "%s/%s/%s", RESCTRL_PATH, "mon_data",
> +		 "mon_L3_00/mon_sub_L3_00");
> +
> +	if (!stat(node_path, &statbuf))
> +		return 1;
> +
> +	return 0;
> +}

Reinette

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ