linux-kernel - Re: About add an A64FX cache control function into resctrl

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <26ffe50f-7ff4-2c4e-534c-edf23cb88df1@intel.com>
Date:   Mon, 19 Jul 2021 16:25:23 -0700
From:   Reinette Chatre <reinette.chatre@...el.com>
To:     "tan.shaopeng@...itsu.com" <tan.shaopeng@...itsu.com>,
        "'fenghua.yu@...el.com'" <fenghua.yu@...el.com>
CC:     "'linux-kernel@...r.kernel.org'" <linux-kernel@...r.kernel.org>,
        "'linux-arm-kernel@...ts.infradead.org'" 
        <linux-arm-kernel@...ts.infradead.org>,
        'James Morse' <james.morse@....com>,
        "misono.tomohiro@...itsu.com" <misono.tomohiro@...itsu.com>,
        "Luck, Tony" <tony.luck@...el.com>
Subject: Re: About add an A64FX cache control function into resctrl

Hi Tan Shaopeng,

On 7/7/2021 4:26 AM, tan.shaopeng@...itsu.com wrote:
>>> Sorry, I have not explained A64FX's sector cache function well yet.
>>> I think I need explain this function from different perspective.
>>
>> You have explained the A64FX's sector cache function well. I have also read
>> both specs to understand it better. It appears to me that you are not considering
>> the resctrl architecture as part of your solution but instead just forcing your
>> architecture onto the resctrl filesystem. For example, in resctrl the resource
>> groups are not just a directory structure but has significance in what is being
>> represented within the directory (a class of service). The files within a resource
>> group's directory build on that. From your side I have not seen any effort in
>> aligning the sector cache function with the resctrl architecture but instead you
>> are just changing resctrl interface to match the A64FX architecture.
>>
>> Could you please take a moment to understand what resctrl is and how it could
>> be mapped to A64FX in a coherent way?
> 
> Previously, my idea is based on how to make instructions use different
> sectors in one task. After I studied resctrl, to utilize resctrl
> architecture on A64FX, I think it’s better to assign one sector to
> one task. Thanks for your idea that "sectors" could be considered the
> same as the resctrl "classes of service".
> 
> Based on your idea, I am considering the implementation details.
> In this email, I will explain the outline of new proposal, and then
> please allow me to confirm a few technologies about resctrl.
> 
> The outline of my proposal is as follows.
> - Add a sector function equivalent to Intel's CAT function into resctrl.
>    (divide shared L2 cache into multiple partitions for multiple cores use)
> - Allocate one sector to one resource group (one CLOSID). Since one
>    core can only be assigned to one resource group, on A64FX each core
>    only uses one sector at a time.

ok, so a sector is a portion of cache and matches with what can be 
represented with a resource group.

The second part of your comment is not clear to me. In the first part 
you mention: "one core can only be assigned to one resource group" - 
this seems to indicate some static assignment between cores and sectors 
and if this is the case this needs more thinking since the current 
implementation assumes that any core that can access the cache can 
access all resource groups associated with that cache. On the other 
hand, you mention "on A64FX each core only uses one sector at a time" - 
this now sounds dynamic and is how resctrl works since the CPU is 
assigned a single class of service to indicate all resources accessible 
to it.

> - Disable A64FX's HPC tag address override function. We only set each
>    core's default sector value according to closid(default sector ID=CLOSID).
> - No L1 cache control since L1 cache is not shared for cores. It is not
>    necessary to add L1 cache interface for schemata file.
> - No need to update schemata interface. Resctrl's L2 cache interface
>    (L2: <cache_id0> = <cbm>; <cache_id1> = <cbm>; ...)
>    will be used as it is. However, on A64FX, <cbm> does not indicate
>    the position of cache partition, only indicate the number of
>    cache ways (size).

 From what I understand the upcoming MPAM support would make this easier 
to do.

> 
> This is the smallest start of incorporating sector cache function into
> resctrl. I will consider if we could add more sector cache features
> into resctrl (e.g. selecting different sectors from one task) after
> finishing this.
> 
> (some questions are below)
> 
>>>
>>>> On 5/17/2021 1:31 AM, tan.shaopeng@...itsu.com wrote:
>>
>>> --------
>>> A64FX NUMA-PE-Cache Architecture:
>>> NUMA0:
>>>     PE0:
>>>       L1sector0,L1sector1,L1sector2,L1sector3
>>>     PE1:
>>>       L1sector0,L1sector1,L1sector2,L1sector3
>>>     ...
>>>     PE11:
>>>       L1sector0,L1sector1,L1sector2,L1sector3
>>>
>>>     L2sector0,1/L2sector2,3
>>> NUMA1:
>>>     PE0:
>>>       L1sector0,L1sector1,L1sector2,L1sector3
>>>     ...
>>>     PE11:
>>>       L1sector0,L1sector1,L1sector2,L1sector3
>>>
>>>     L2sector0,1/L2sector2,3
>>> NUMA2:
>>>     ...
>>> NUMA3:
>>>     ...
>>> --------
>>> In A64FX processor, one L1 sector cache capacity setting register is
>>> only for one PE and not shared among PEs. L2 sector cache maximum
>>> capacity setting registers are shared among PEs in same NUMA, and it
>>> is to be noted that changing these registers in one PE influences other PE.
>>
>> Understood. cache affinity is familiar to resctrl. When a CPU becomes online it
>> is discovered which caches/resources it has affinity to.
>> Resources then have CPU mask associated with them to indicate on which
>> CPU a register could be changed to configure the resource/cache. See
>> domain_add_cpu() and struct rdt_domain.
> 
> Is the following understanding correct?
> Struct rdt_domain is a group of online CPUs that share a same cache
> instance. When a CPU is online(resctrl initialization),
> the domain_add_cpu() function add the online cpu to corresponding
> rdt_domain (in rdt_resource:domains list). For example, if there are
> 4 L2 cache instances, then there will be 4 rdt_domain in the list and
> each CPU is assigned to corresponding rdt_domain.

Correct.

> 
> The set values of cache/memory are stored in the *ctrl_val array
> (indexed by CLOSID) of struct rdt_domain. For example, in CAT function,
> the CBM value of CLOSID=x is stored in ctrl_val [x].
> When we create a resource group and write set values of cache into
> the schemata file, the update_domains() function updates the CBM value
> to ctrl_val [CLOSID = resource group ID] in rdt_domain and updates the
> CBM value to CBM register(MSR_IA32_Lx_CBM_BASE).

For the most part, yes. The only part that I would like to clarify is 
that each CLOSID is represented by a different register, which register 
is updated depends on which CLOSID is changed. Could be written as 
MSR_IA32_L2_CBM_CLOSID/MSR_IA32_L3_CBM_CLOSID. The "BASE" register is 
CLOSID 0, the default, and the other registers are determined as offset 
from it.

Also, the registers have the scope of the resource/cache. So, for 
example, if CPU 0 and CPU 1 share a L2 cache then it is only necessary 
to update the register on one of these CPUs.

> 
>>> The number of ways for L2 Sector ID (0,1 or 2,3) can be set through
>>> any PEs in same NUMA. The sector ID 0,1 and 2,3 are not available at
>>> the same time in same NUMA.
>>>
>>>
>>> I think, in your idea, a resource group will be created for each sector ID.
>>> (> "sectors" could be considered the same as the resctrl "classes of
>>> service") Then, an example of resource group is created as follows.
>>> ・ L1: NUMAX-PEY-L1sector0 (X = 0,1,2,3.Y = 0,1,2 ... 11),
>>> ・ L2: NUMAX-L2sector0 (X = 0,1,2,3)
>>>
>>> In this example, sector with same ID(0) of all PEs is allocated to
>>> resource group. The L1D caches are numbered from
>>> NUMA0_PE0-L1sector0(0) to NUMA4_PE11-L1sector0(47) and the L2
>> caches
>>> numbered from
>>> NUMA0-L2sector0(0) to NUM4-L2sector0(3).
>>> (NUMA number X is from 0-4, PE number Y is from 0-11)
>>> (1) The number of ways of NUMAX-PEY-L1sector0 can be set independently
>>>       for each PEs (0-47). When run a task on this resource group,
>>>       we cannot control on which PE the task is running on and how many
>>>       cache ways the task is using.
>>
>> resctrl does not control the affinity on which PE/CPU a task is run.
>> resctrl is an interface with which to configure how resources are allocated on
>> the system. resctrl could thus provide interface with which each sector of each
>> cache instance is assigned a number of cache ways.
>> resctrl also provides an interface to assign a task with a class of service (sector
>> id?). Through this the task obtains access to all resources that is allocated to
>> the particular class of service (sector id?). Depending on which CPU the task is
>> running it may indeed experience different performance if the sector id it is
>> running with does not have the same allocations on all cache instances. The
>> affinity of the task needs to be managed separately using for example taskset.
>> Please see Documentation/x86/resctrl.rst "Examples for RDT allocation usage"
> 
> In resctrl_sched_in(), there are comments as follow:
>    /*
>   * If this task has a closid/rmid assigned, use it.
>    * Else use the closid/rmid assigned to this cpu.
>    */
> I thought when we write PID to tasks file, this task (PID) will only
> run on the CPUs which are specified in cpus file in the same resource
> group. So, the task_struct's closid and cpu's closid is the same.
> When task's closid is different from cpu's closid?

resctrl does not manage the affinity of tasks.

Tony recently summarized the cpus file very well to me: The actual 
semantics of the CPUs file is to associate a CLOSid for a task that is 
in the default resctrl group – while it is running on one of the listed 
CPUs.

To answer your question the task's closid could be different from the 
CPU's closid if the task's closid is 0 while it is running on a CPU that 
is in the cpus file of a non-default resource group.

You can see a summary of the decision flow in section "Resource 
allocation rules" in Documentation/x86/resctrl.rst

The "cpus" file was created in support of the real-time use cases. In 
these use cases a group of CPUs can be designated as supporting the 
real-time work and with their own resource group and assigned the needed 
resources to do the real-time work. A real-time task can then be started 
with affinity to those CPUs and dynamically any kernel threads (that 
will be started on the same CPU) doing work on behalf of this task would 
be able to use the resources set aside for the real-time work.

Reinette