linux-kernel - Re: [PATCH 2/2] x86/resctrl: Add tracepoint for llc

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <668d1f61-9111-4239-9766-f43bef3a9828@shopee.com>
Date: Mon, 4 Mar 2024 11:07:57 +0800
From: Haifeng Xu <haifeng.xu@...pee.com>
To: James Morse <james.morse@....com>, reinette.chatre@...el.com
Cc: fenghua.yu@...el.com, babu.moger@....com, peternewman@...gle.com,
 x86@...nel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH 2/2] x86/resctrl: Add tracepoint for llc_occupancy
 tracking



On 2024/3/2 01:47, James Morse wrote:
> Hello!
> 
> On 29/02/2024 07:11, Haifeng Xu wrote:
>> In our production environment, after removing monitor groups, those unused
>> RMIDs get stuck in the limbo list forever because their llc_occupancy are
>> always larger than the threshold. But the unused RMIDs can be successfully
>> freed by turning up the threshold.
>>
>> In order to know how much the threshold should be, perf can be used to acquire
>> the llc_occupancy of RMIDs in each rdt domain.
>>
>> Instead of using perf tool to track llc_occupancy and filter the log manually,
>> it is more convenient for users to use tracepoint to do this work. So add a new
>> tracepoint that shows the llc_occupancy of busy RMIDs when scanning the limbo
>> list.
> 
>> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
>> index c34a35ec0f03..ada392ca75b2 100644
>> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
>> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
>> @@ -362,6 +363,13 @@ void __check_limbo(struct rdt_domain *d, bool force_free)
> 
> [expanded the diff hunk for better context]
>>               entry = __rmid_entry(idx);
>>               if (resctrl_arch_rmid_read(r, d, entry->closid, entry->rmid,
>>                                          QOS_L3_OCCUP_EVENT_ID, &val,
>>                                          arch_mon_ctx)) {
>>                       rmid_dirty = true;
>>               } else {
>>                       rmid_dirty = (val >= resctrl_rmid_realloc_threshold);
>>               }
>>
>>               if (force_free || !rmid_dirty) {
>>                       clear_bit(idx, d->rmid_busy_llc);
>>                       if (!--entry->busy)
>>  				limbo_release_entry(entry);
>>  		}
> 
>>  		cur_idx = idx + 1;
> 
> Ideally this would be the last line in the loop, its how the iterator advances to the next
> bit in the bitmap.
> 
> 
>> +		/* x86's CLOSID and RMID are independent numbers, so the entry's
>> +		 * closid is a invalid CLOSID. But on arm64, the RMID value isn't
>> +		 * a unique number for each CLOSID. It's necessary to track both
>> +		 * CLOSID and RMID because there may be dependencies between each
>> +		 * other on some architectures */
>> +		trace_mon_llc_occupancy_limbo(entry->closid, entry->rmid, d->id, val);
> 
> I agree outputting both these values is the right thing to do.
> 
> resctrl_arch_rmid_read() could return an error, in which case val here is either
> uninitialised, or the value of another RMID.
> Could you put the tracepoint in the 'else' section of the if/else after
> resctrl_arch_rmid_read()? This way it will only output a value to user-space if it is correct.
> 
> 
>> diff --git a/arch/x86/kernel/cpu/resctrl/trace.h b/arch/x86/kernel/cpu/resctrl/trace.h
>> index 495fb90c8572..35149a75c951 100644
>> --- a/arch/x86/kernel/cpu/resctrl/trace.h
>> +++ b/arch/x86/kernel/cpu/resctrl/trace.h
>> @@ -35,6 +35,21 @@ TRACE_EVENT(pseudo_lock_l3,
>>  	    TP_printk("hits=%llu miss=%llu",
>>  		      __entry->l3_hits, __entry->l3_miss));
>>  
>> +TRACE_EVENT(mon_llc_occupancy_limbo,
>> +	    TP_PROTO(u32 ctrl_hw_id, u32 mon_hw_id, int id, u64 occupancy),
>> +	    TP_ARGS(ctrl_hw_id, mon_hw_id, id, occupancy),
>> +	    TP_STRUCT__entry(__field(u32, ctrl_hw_id)
>> +			     __field(u32, mon_hw_id)
> 
>> +			     __field(int, id)
> 
> Nit: Could we call this 'domain_id'? We've got two other ids already, so we should be
> clear what each one is!
> 
> 
>> +			     __field(u64, occupancy)),
> 
> Nit: 'occupancy_bytes'? Just to avoid user-space thinking it needs to convert from the
> hardware unit in Intel's SDM ... and just in case we ever have to add another parameter
> that is sort of occupancy too.
> 
> 
>> +	    TP_fast_assign(__entry->ctrl_hw_id = ctrl_hw_id;
>> +			   __entry->mon_hw_id = mon_hw_id;
>> +			   __entry->id = id;
>> +			   __entry->occupancy = occupancy;),
>> +	    TP_printk("ctrl_hw_id=%u mon_hw_id=%u domain=%d llc_occupancy=%llu",
>> +		      __entry->ctrl_hw_id, __entry->mon_hw_id, __entry->id, __entry->occupancy)
>> +	   );
>> +
> 
> Tracepoints always expose some implicit details of the kernel internals which can make
> supporting them a headache. In this case - I've had some discussion with folk about future
> work to defer the limbo work as late as possible - until a new control or monitor group is
> allocated. This would be to avoid interrupting user-space tasks to update the limbo list
> when the result isn't needed until alloc time.
> 
> In this case the tracepoint wouldn't be hit unless user-space made a mkdir() syscall to
> force an update - I think this can just be a documentation problem.
> 
> I also don't think we should commit to this outputting values for all dirty RMID - which
> it does today. If group creation were to fail because there weren't any free RMID you'd
> get all the values, I think this is the only case where user-space would care.

Yes. 

> 
> Could we document the properties of the the trace-point that can be relied on in
> Documentation/arch/x86/resctrl.rst ?
> 
> Something like:
> | This tracepoint gives you the precise occupancy values for a subset of RMID that are not
> | immediately available for allocation. This can't be relied on to produce output every
> | second, it may be necessary to attempt to create an empty monitor group to force an
> | update. Output may only be produced if creation of a control or monitor group fails.
> 
> I think we'll always walk a list of dirty RMID before failing to allocate an RMID, so this
> should be future proof.
> 
> 
> With the val and documentation bits:
> Reviewed-by: James Morse <james.morse@....com>
> 
> 
> Thanks,
> 
> James

Thanks for your suggestions.