[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ecc4420f-5a28-cf1f-2537-b651a31e6db3@intel.com>
Date: Wed, 27 Oct 2021 13:41:47 -0700
From: Reinette Chatre <reinette.chatre@...el.com>
To: James Morse <james.morse@....com>, <x86@...nel.org>,
<linux-kernel@...r.kernel.org>
CC: Fenghua Yu <fenghua.yu@...el.com>,
Thomas Gleixner <tglx@...utronix.de>,
Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>,
H Peter Anvin <hpa@...or.com>,
Babu Moger <Babu.Moger@....com>,
<shameerali.kolothum.thodi@...wei.com>,
Jamie Iles <jamie@...iainc.com>,
"D Scott Phillips OS" <scott@...amperecomputing.com>,
<lcherian@...vell.com>, <bobo.shaobowang@...wei.com>,
<tan.shaopeng@...itsu.com>
Subject: Re: [PATCH v2 14/23] x86/resctrl: Calculate bandwidth from the
previous __mon_event_count() chunks
Hi James,
On 10/27/2021 9:50 AM, James Morse wrote:
> On 15/10/2021 23:28, Reinette Chatre wrote:
>> On 10/1/2021 9:02 AM, James Morse wrote:
>>> mbm_bw_count() is only called by the mbm_handle_overflow() worker once a
>>> second. It reads the hardware register, calculates the bandwidth and
>>> updates m->prev_bw_msr which is used to hold the previous hardware register
>>> value.
>>>
>>> Operating directly on hardware register values makes it difficult to make
>>> this code architecture independent, so that it can be moved to /fs/,
>>> making the mba_sc feature something resctrl supports with no additional
>>> support from the architecture.
>>> Prior to calling mbm_bw_count(), mbm_update() reads from the same hardware
>>> register using __mon_event_count().
>>
>> Looking back I think 06c5fe9b12dd ("x86/resctrl: Fix incorrect local bandwidth when mba_sc
>> is enabled") may explain how the code ended up the way it is.
>>
>>> Change mbm_bw_count() to use the current chunks value from
>>> __mon_event_count() to calculate bandwidth. This means it no longer
>>> operates on hardware register values.
>>
>> ok ... so could the patch just do this as it is stated here? The way it is implemented is
>> very complicated and hard (for me) to verify the correctness (more below).
>
>>> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
>>> index 6c8226987dd6..a1232462db14 100644
>>> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
>>> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
>
>>> static void mbm_bw_count(u32 rmid, struct rmid_read *rr)
>>> {
>>> struct rdt_hw_resource *hw_res = resctrl_to_arch_res(rr->r);
>>> struct mbm_state *m = &rr->d->mbm_local[rmid];
>>> - u64 tval, cur_bw, chunks;
>>> + u64 cur_bw, chunks, cur_chunks;
>>> - tval = __rmid_read(rmid, rr->evtid);
>>> - if (tval & (RMID_VAL_ERROR | RMID_VAL_UNAVAIL))
>>> - return;
>>> + cur_chunks = rr->val;
>>> + chunks = cur_chunks - m->prev_bw_chunks;
>>> + m->prev_bw_chunks = cur_chunks;
>>> - chunks = mbm_overflow_count(m->prev_bw_msr, tval, hw_res->mbm_width);
>>> - cur_bw = (get_corrected_mbm_count(rmid, chunks) * hw_res->mon_scale) >> 20;
>>> + cur_bw = (chunks * hw_res->mon_scale) >> 20;
>
>> I find this quite confusing. What if a new m->prev_chunks is introduced instead and
>> initialized in __mon_event_count() to the value of chunks, and then here in mbm_bw_count
>> it could just retrieve it (chunks = m->prev_chunks).
>
> (I agree the diff is noisy, it may be easier as a new function as this is a replacement
> not a transform of the existing function)
>
> __mon_event_count() can also be triggered by user-space reading the file, so any of its
> 'prev' values should be ignored, as they aren't updated on the 1-second timer needed to
> get this in MB/s.
The resource group's mutex is taken before __mon_event_count() is called
via user-space or via the overflow handler so I think that
mbm_bw_count() can assume that the prev values are from the
__mon_event_count() called just before it.
> __mon_event_count()'s chunks values hasn't been through get_corrected_mbm_count(), so it
> would need to be rr->val, which is what this code starts with for the "number of chunks
> ever read by this counter".
The value could be corrected in mbm_bw_count(), no?
>
>
> The variable 'chunks' has been used too much here, so its lost its meaning. How about:
> | current_chunk_count = rr->val;
> | delta_counter = current_chunk_count - m->prev_chunk_count;
> | cur_bw = (delta_counter * hw_res->mon_scale) >> 20;
> |
> | m->prev_chunk_count = current_chunk_count;
>
>
> The 'delta_counter' step was previously hidden in mbm_overflow_count(), which also had to
> do with overflow of the hardware counter. Because a previously sanitised value is being
> used, the hardware counters resolution doesn't need to be considered.
> (which helps make mba_sc architecture independent)
This is the part that is not obvious to me: is the difference between
the two individually sanitized measurements the same as sanitizing the
difference between the two measurements?
Reinette
Powered by blists - more mailing lists