lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <3fb83b18-c9cc-42f6-813b-c5cfa526e91c@intel.com>
Date:   Mon, 20 Nov 2023 14:23:29 -0800
From:   Reinette Chatre <reinette.chatre@...el.com>
To:     Tony Luck <tony.luck@...el.com>, Fenghua Yu <fenghua.yu@...el.com>,
        "Peter Newman" <peternewman@...gle.com>,
        Jonathan Corbet <corbet@....net>,
        "Shuah Khan" <skhan@...uxfoundation.org>, <x86@...nel.org>
CC:     Shaopeng Tan <tan.shaopeng@...itsu.com>,
        James Morse <james.morse@....com>,
        Jamie Iles <quic_jiles@...cinc.com>,
        Babu Moger <babu.moger@....com>,
        Randy Dunlap <rdunlap@...radead.org>,
        <linux-kernel@...r.kernel.org>, <linux-doc@...r.kernel.org>,
        <patches@...ts.linux.dev>
Subject: Re: [PATCH v11 6/8] x86/resctrl: Introduce snc_nodes_per_l3_cache

Hi Tony,

On 11/9/2023 3:09 PM, Tony Luck wrote:
> Intel Sub-NUMA Cluster (SNC) is a feature that subdivides the CPU cores
> and memory controllers on a socket into two or more groups. These are
> presented to the operating system as NUMA nodes.
> 
> This may enable some workloads to have slightly lower latency to memory
> as the memory controller(s) in an SNC node are electrically closer to the
> CPU cores on that SNC node. This cost may be offset by lower bandwidth
> since the memory accesses for each core can only be interleaved between
> the memory controllers on the same SNC node.
> 
> Resctrl monitoring on an Intel system depends upon attaching RMIDs to tasks
> to track L3 cache occupancy and memory bandwidth. There is an MSR that
> controls how the RMIDs are shared between SNC nodes.
> 
> The default mode divides them numerically. E.g. when there are two SNC
> nodes on a socket the lower number half of the RMIDs are given to the
> first node, the remainder to the second node. This would be difficult
> to use with the Linux resctrl interface as specific RMID values assigned
> to resctrl groups are not visible to users.
> 
> The other mode divides the RMIDs and renumbers the ones on the second
> SNC node to start from zero.
> 
> Even with this renumbering SNC mode requires several changes in resctrl
> behavior for correct operation.
> 
> Add a global integer "snc_nodes_per_l3_cache" that will show how many
> SNC nodes share each L3 cache. When this is "1", SNC mode is either
> not implemented, or not enabled, but all places that need to check
> it are updated to take appropriate action when SNC mode is enabled.
> 
> Code that needs to take action when SNC is enabled is:
> 1) The number of logical RMIDs per L3 cache available for use is the
>    number of physical RMIDs divided by the number of SNC nodes.
> 2) Likewise the "mon_scale" value must be divided by the number of SNC
>    nodes.
> 3) The RMID renumbering operates when using the value from the
>    IA32_PQR_ASSOC MSR to count accesses by a task. When reading an RMID
>    counter, code must adjust from the logical RMID used to the physical
>    RMID value for the SNC node that it wishes to read and load the
>    adjusted value into the IA32_QM_EVTSEL MSR.
> 4) The L3 cache is divided between the SNC nodes. So the value
>    reported in the resctrl "size" file is divided by the number of SNC
>    nodes because the effective amount of cache that can be allocated
>    is reduced by that factor.
> 5) The "-o mba_MBps" mount option must be disabled in SNC mode
>    because the monitoring is being done per SNC node, while the
>    bandwidth allocation is still done at the L3 cache scope.
>    Trying to use this feedback loop might result in contradictory
>    changes to the throttling level coming from each of the SNC
>    node bandwidth measurements.
> 

The latter part of this changelog stopped being in imperative mood.
To reduce confusion I slightly reworked it below to address the
parts I noticed. I often get it wrong myself so please check again.

	Add a global integer "snc_nodes_per_l3_cache" that shows how many
	SNC nodes share each L3 cache. When "snc_nodes_per_l3_cache" is "1",
	SNC mode is either not implemented, or not enabled.

	Update all places to take appropriate action when SNC mode is enabled:
	1) The number of logical RMIDs per L3 cache available for use is the
	   number of physical RMIDs divided by the number of SNC nodes.
	2) Likewise the "mon_scale" value must be divided by the number of SNC
	   nodes.
	3) The RMID renumbering operates when using the value from the
	   IA32_PQR_ASSOC MSR to count accesses by a task. When reading an RMID
	   counter, adjust from the logical RMID to the physical
	   RMID value for the SNC node that it wishes to read and load the
	   adjusted value into the IA32_QM_EVTSEL MSR.
	4) Divide the L3 cache between the SNC nodes. Divide the value
	   reported in the resctrl "size" file by the number of SNC
	   nodes because the effective amount of cache that can be allocated
	   is reduced by that factor.
	5) Disable the "-o mba_MBps" mount option in SNC mode
	   because the monitoring is being done per SNC node, while the
	   bandwidth allocation is still done at the L3 cache scope.
	   Trying to use this feedback loop might result in contradictory
	   changes to the throttling level coming from each of the SNC
	   node bandwidth measurements.

> Reviewed-by: Peter Newman <peternewman@...gle.com>
> Signed-off-by: Tony Luck <tony.luck@...el.com>
> ---

(same comment as previous patch about commit tag ordering)

Reviewed-by: Reinette Chatre <reinette.chatre@...el.com>

Reinette

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ