linux-kernel - Re: [PATCH] x86/resctrl: Clear the staged configs when destroying schemata list

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <e9dc4d2a-e404-990a-3efc-c359799bc986@linux.alibaba.com>
Date:   Wed, 28 Sep 2022 21:37:49 +0800
From:   Shawn Wang <shawnwang@...ux.alibaba.com>
To:     Reinette Chatre <reinette.chatre@...el.com>,
        James Morse <james.morse@....com>, fenghua.yu@...el.com
Cc:     tglx@...utronix.de, mingo@...hat.com, bp@...en8.de,
        dave.hansen@...ux.intel.com, x86@...nel.org, hpa@...or.com,
        linux-kernel@...r.kernel.org
Subject: Re: [PATCH] x86/resctrl: Clear the staged configs when destroying
 schemata list

Hi Reinette and James,

Sorry for some mistakes in the last reply email, so I send it again.

On 9/28/22 5:21 AM, Reinette Chatre wrote:
> Hi James and Shawn,
> 
> On 9/27/2022 6:06 AM, James Morse wrote:
>> On 27/09/2022 03:54, Shawn Wang wrote:
>>> Array staged_config in struct rdt_domain still maintains the original value when
>>> resctrl is unmounted. If resctrl is mounted with cdp option and then remounted
>>> without cdp option, field have_new_ctrl in staged_config[CDP_CODE] and
>>> staged_config[CDP_DATA] will still be true.
>>
>> staged_config[CDP_DATA] is an array - its always 'true'. I think you mean
>> staged_config[CDP_DATA].have_new_ctrl, which will still be true because it is only
>> memset() when the schemata file is written to.
>>
>>
>>> Since resctrl_arch_update_domains()
>>> traverses all resctrl_conf_type, it will continue to update CDP_CODE and
>>> CDP_DATA configurations, which can cause overflow problem.
>>
>> Only if its called with a stale staged config, and it should only be called when the
>> schemata file is written to, which would memset() the staged config first.
>>
>>
>>> The problem can be reproduced by the following commands:
>>>      # A system with 16 usable closids and mba disabled
>>>      mount -t resctrl resctrl -o cdp /sys/fs/resctrl
>>>      mkdir /sys/fs/resctrl/p{1..7}
>>>      umount /sys/fs/resctrl/
>>>      mount -t resctrl resctrl /sys/fs/resctrl
>>>      mkdir /sys/fs/resctrl/p{1..8}
>>
>> Thanks for the reproducer - but I don't see what could set have_new_ctrl in this sequence.
>> You can't call apply_config() to set CPUs in the mask without that being set.
>>
>> Creating a new control group, (your mkdir step) shouldn't touch the hardware at all, as it
>> should be left in its reset state from the last umount(), or setup.
> 
> There is an attempt to configure the hardware in the mkdir path:
> rdtgroup_mkdir()->rdtgroup_mkdir_ctrl_mon()->rdtgroup_init_alloc()->resctrl_arch_update_domains()
> 
> This is required in support of the different resource group modes. When a new
> resource group is created via mkdir the configuration should respect any
> exclusive resource groups that exist at that point.
> 
>>
>> I can't reproduce this on v6.0-rc7.
>> Even if I dirty the configuration by writing to the schemata file, I can't reproduce this.
>>
> 
>  From what I can tell the reproducer is dependent on (a) whether hardware
> supports CDP, and (b) the number of CLOSIDs supported by the system. The reproducer
> works for hardware that has 16 CLOSIDs (see later).
> 
>> (I have mba enabled, but all this should affect is the number of closid available)
>>

I reproduce this on v6.0-rc6. The key to reproduction is to ensure that 
the number of usable groups is different between CDP enabled and CDP 
disabled.

The system I use has 16 CLOSIDs for L3 CAT and 8 CLOSIDs for MBA. MBA 
limits the max number of groups to 8, even if CDP is disabled. This is 
the reason why I disable MBA.

>>
>>> dmesg will generate the following error:
>>
>> Which kernel version is this?
>>
>>>      [ 6180.939345] unchecked MSR access error: WRMSR to 0xca0 (tried to write
>>>      0x00000000000007ff) at rIP: 0xffffffff82249142 (cat_wrmsr+0x32/0x60)
>>
>> Is 0x7ff the default CBM bitmap for this CPU? Or was it written in a step missing from the
>> reproducer above?
> 
> The value of interest here is the register it tries to write to ... 0xca0.
> On a system with 16 CLOSIDs the range of registers available to set the CBM
> would be 0xc90 to 0xc9f that corresponds to CLOSID 0 to CLOSID 15. The error is
> an attempt to write to an unsupported register - there appears to have been an
> attempt to configure non-existent CLOSID 16.
> 
> As Shawn already root-caused, this is because the staged_config contains data from
> the previous run when CDP was enabled and it is never cleared before the resource group
> creation flow (triggered by mkdir).
> 
> 
> CDP enabled:
> mkdir <resource group>
> 	domainX.staged_config[CDP_NONE].have_new_ctrl = false
> 	domainX.staged_config[CDP_NONE].new_ctrl = 0
> 	domainX.staged_config[CDP_CODE].have_new_ctrl = true
> 	domainX.staged_config[CDP_CODE].new_ctrl = 0x7ff
> 	domainX.staged_config[CDP_DATA].have_new_ctrl = true
> 	domainX.staged_config[CDP_CODE].new_ctrl = 0x7ff
> 
> unmount/remount resctrl (CDP disabled):
> mkdir <resource group>
> 	domainX.staged_config[CDP_NONE].have_new_ctrl = true
> 	domainX.staged_config[CDP_NONE].new_ctrl = 0x7ff
> 	domainX.staged_config[CDP_CODE].have_new_ctrl = true /* stale */
> 	domainX.staged_config[CDP_CODE].new_ctrl = 0x7ff     /* stale */
> 	domainX.staged_config[CDP_DATA].have_new_ctrl = true /* stale */
> 	domainX.staged_config[CDP_CODE].new_ctrl = 0x7ff     /* stale */
> 
> Above becomes an issue when the resource group being created is
> for a CLOSID # that is more than half of the CLOSIDs supported.
> In the reproducer the issue was encountered when creating resource
> group for CLOSID 8 on a system that supports 16 CLOSIDs.
> 
> In this case get_config_index() called from
> resctrl_arch_update_domains() will return 16 and 17 when processing
> this resource group and that translated to an invalid register - 0xca0 in this
> scenario.
> 

Thanks for the detailed explanation. That's exactly what I mean.

> 
>> The rest of this splat isn't helpful as its the result of an IPI...
>>
>>>      [ 6180.951983] Call Trace:
>>>      [ 6180.954516]  <IRQ>
>>>      [ 6180.956619]  __flush_smp_call_function_queue+0x11d/0x170
>>>      [ 6180.962028]  __sysvec_call_function+0x24/0xd0
>>>      [ 6180.966485]  sysvec_call_function+0x89/0xc0
>>>      [ 6180.970760]  </IRQ>
>>>      [ 6180.972947]  <TASK>
>>>      [ 6180.975131]  asm_sysvec_call_function+0x16/0x20
>>>      [ 6180.979757] RIP: 0010:cpuidle_enter_state+0xcd/0x400
>>>      [ 6180.984821] Code: 49 89 c5 0f 1f 44 00 00 31 ff e8 1e e5 77 ff 45 84
>>>      ff 74 12 9c 58 f6 c4 02 0f 85 13 03 00 00 31 ff e8 67 70 7d ff fb 45 85
>>>      f6 <0f> 88 75 01 00 00 49 63 c6 4c 2b 2c 24 48 8d 14 40 48 8d 14 90 49
>>>      [ 6181.003710] RSP: 0018:ffffffff83a03e48 EFLAGS: 00000202
>>>      [ 6181.009028] RAX: ffff943400800000 RBX: 0000000000000001 RCX: 000000000000001f
>>>      [ 6181.016261] RDX: 0000000000000000 RSI: ffffffff83795059 RDI: ffffffff837c101e
>>>      [ 6181.023490] RBP: ffff9434c9352000 R08: 0000059f1cb1a05e R09: 0000000000000008
>>>      [ 6181.030717] R10: 0000000000000001 R11: 0000000000005c66 R12: ffffffff83bbf3a0
>>>      [ 6181.037944] R13: 0000059f1cb1a05e R14: 0000000000000001 R15: 0000000000000000
>>>      [ 6181.045202]  ? cpuidle_enter_state+0xb2/0x400
>>>      [ 6181.049678]  cpuidle_enter+0x24/0x40
>>>      [ 6181.053370]  do_idle+0x1dd/0x260
>>>      [ 6181.056713]  cpu_startup_entry+0x14/0x20
>>>      [ 6181.060753]  rest_init+0xbb/0xc0
>>>      [ 6181.064097]  arch_call_rest_init+0x5/0xa
>>>      [ 6181.068137]  start_kernel+0x668/0x691
>>>      [ 6181.071914]  secondary_startup_64_no_verify+0xe0/0xeb
>>>      [ 6181.077086]  </TASK>
>>
>> It would be good to know what triggered this IPI. It may not have been
>> resctrl_arch_update_domains(). This pattern also happens from reset_all_ctrls() which
> 
> I believe this is indeed from resctrl_arch_update_domains() since it calls
> smp_call_function_many() to configure all the domains.
> 
>> happens during umount(). (and that would write the default CBM bitmap)
>>
>> If you can reproduce this easily, could you add dump_stack() to update_config() to see if
>> any path is setting have_new_ctrl. You aren't writing to the schemata file in your reproducer.
>>
>>
>>> We fix this issue by clearing the staged configs when destroying schemata list.
>>
>>
>>> Signed-off-by: Shawn Wang <shawnwang@...ux.alibaba.com>
>>> Suggested-by: Xin Hao <xhao@...ux.alibaba.com>
>>
>> If we can work out why you are seeing this, it would need a Fixes tag.
>>
>> Otherwise I agree it makes sense to make this more robust, but it would need a different
>> commit message.
> 
> What do you think about clearing the staged config within resctrl_arch_update_domains()
> after the configuration is complete and there is no more need for it? That may reduce
> complexity where each caller no longer need to remember to do so.
> I see "staged_config" as a temporary storage and it my help to understand the code better
> if it is treated as such.
> 

I think this is better. I have tested it and will give a new version.

>>> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>>> index f276aff521e8..b4a817ae83ab 100644
>>> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>>> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>>> @@ -2127,8 +2127,15 @@ static int schemata_list_create(void)
>>>   static void schemata_list_destroy(void)
>>>   {
>>>   	struct resctrl_schema *s, *tmp;
>>> +	struct rdt_domain *dom;
>>>   
>>>   	list_for_each_entry_safe(s, tmp, &resctrl_schema_all, list) {
>>> +		/*
>>> +		 * Clear staged_config on each domain before schemata list is
>>> +		 * destroyed.
>>> +		 */
>>> +		list_for_each_entry(dom, &s->res->domains, list)
>>> +			memset(dom->staged_config, 0, sizeof(dom->staged_config));
>>>   		list_del(&s->list);
>>>   		kfree(s);
>>>   	}
>>
> 
> Reinette

Thanks,

Shawn