[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <bcdf2daf-ae8a-9047-e171-3dd36a98a66a@intel.com>
Date: Fri, 29 Oct 2021 15:22:01 -0700
From: Reinette Chatre <reinette.chatre@...el.com>
To: James Morse <james.morse@....com>, <x86@...nel.org>,
<linux-kernel@...r.kernel.org>
CC: Fenghua Yu <fenghua.yu@...el.com>,
Thomas Gleixner <tglx@...utronix.de>,
Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>,
H Peter Anvin <hpa@...or.com>,
Babu Moger <Babu.Moger@....com>,
<shameerali.kolothum.thodi@...wei.com>,
Jamie Iles <jamie@...iainc.com>,
"D Scott Phillips OS" <scott@...amperecomputing.com>,
<lcherian@...vell.com>, <bobo.shaobowang@...wei.com>,
<tan.shaopeng@...itsu.com>
Subject: Re: [PATCH v2 14/23] x86/resctrl: Calculate bandwidth from the
previous __mon_event_count() chunks
Hi James,
On 10/29/2021 8:50 AM, James Morse wrote:
> On 27/10/2021 21:41, Reinette Chatre wrote:
>> On 10/27/2021 9:50 AM, James Morse wrote:
>>> On 15/10/2021 23:28, Reinette Chatre wrote:
>>>> On 10/1/2021 9:02 AM, James Morse wrote:
>>>>> mbm_bw_count() is only called by the mbm_handle_overflow() worker once a
>>>>> second. It reads the hardware register, calculates the bandwidth and
>>>>> updates m->prev_bw_msr which is used to hold the previous hardware register
>>>>> value.
>>>>>
>>>>> Operating directly on hardware register values makes it difficult to make
>>>>> this code architecture independent, so that it can be moved to /fs/,
>>>>> making the mba_sc feature something resctrl supports with no additional
>>>>> support from the architecture.
>>>>> Prior to calling mbm_bw_count(), mbm_update() reads from the same hardware
>>>>> register using __mon_event_count().
>>>>
>>>> Looking back I think 06c5fe9b12dd ("x86/resctrl: Fix incorrect local bandwidth when mba_sc
>>>> is enabled") may explain how the code ended up the way it is.
>>>>
>>>>> Change mbm_bw_count() to use the current chunks value from
>>>>> __mon_event_count() to calculate bandwidth. This means it no longer
>>>>> operates on hardware register values.
>>>>
>>>> ok ... so could the patch just do this as it is stated here? The way it is implemented is
>>>> very complicated and hard (for me) to verify the correctness (more below).
>>>
>>>>> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c
>>>>> b/arch/x86/kernel/cpu/resctrl/monitor.c
>>>>> index 6c8226987dd6..a1232462db14 100644
>>>>> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
>>>>> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
>>>
>>>>> static void mbm_bw_count(u32 rmid, struct rmid_read *rr)
>>>>> {
>>>>> struct rdt_hw_resource *hw_res = resctrl_to_arch_res(rr->r);
>>>>> struct mbm_state *m = &rr->d->mbm_local[rmid];
>>>>> - u64 tval, cur_bw, chunks;
>>>>> + u64 cur_bw, chunks, cur_chunks;
>>>>> - tval = __rmid_read(rmid, rr->evtid);
>>>>> - if (tval & (RMID_VAL_ERROR | RMID_VAL_UNAVAIL))
>>>>> - return;
>>>>> + cur_chunks = rr->val;
>>>>> + chunks = cur_chunks - m->prev_bw_chunks;
>>>>> + m->prev_bw_chunks = cur_chunks;
>>>>> - chunks = mbm_overflow_count(m->prev_bw_msr, tval, hw_res->mbm_width);
>>>>> - cur_bw = (get_corrected_mbm_count(rmid, chunks) * hw_res->mon_scale) >> 20;
>>>>> + cur_bw = (chunks * hw_res->mon_scale) >> 20;
>>>
>>>> I find this quite confusing. What if a new m->prev_chunks is introduced instead and
>>>> initialized in __mon_event_count() to the value of chunks, and then here in mbm_bw_count
>>>> it could just retrieve it (chunks = m->prev_chunks).
>>>
>>> (I agree the diff is noisy, it may be easier as a new function as this is a replacement
>>> not a transform of the existing function)
>>>
>>> __mon_event_count() can also be triggered by user-space reading the file, so any of its
>>> 'prev' values should be ignored, as they aren't updated on the 1-second timer needed to
>>> get this in MB/s.
>
>> The resource group's mutex is taken before __mon_event_count() is called via user-space or
>> via the overflow handler so I think that mbm_bw_count() can assume that the prev values
>> are from the __mon_event_count() called just before it.
>
> That is true. But changing this to work with the overflow+corrected value directly means
> it doesn't need changing again as each of those steps are moved into the architecture
> specific function. Changing this would make the later patches noisier, and we would have
> the same discussion on a later patch.
ok
>
>
>>> __mon_event_count()'s chunks values hasn't been through get_corrected_mbm_count(), so it
>>> would need to be rr->val, which is what this code starts with for the "number of chunks
>>> ever read by this counter".
>
>> The value could be corrected in mbm_bw_count(), no?
>
> It could, but the aim of the series is to move all the architecture specific behaviour
> behind an arch helper.
ok - I am still working on understanding how these helpers are organized
>
> MPAMs counters read in bytes, and when they don't, its up to the MPAM architecture
> specific code to fix the hardware values before resctrl gets them.
>
> There is no reason for the mba_sc code to be architecture specific, it operates on the
> counters and controls.
>
>
>>> The variable 'chunks' has been used too much here, so its lost its meaning. How about:
>>> | current_chunk_count = rr->val;
>>> | delta_counter = current_chunk_count - m->prev_chunk_count;
>>> | cur_bw = (delta_counter * hw_res->mon_scale) >> 20;
>>> |
>>> | m->prev_chunk_count = current_chunk_count;
>>>
>>>
>>> The 'delta_counter' step was previously hidden in mbm_overflow_count(), which also had to
>>> do with overflow of the hardware counter. Because a previously sanitised value is being
>>> used, the hardware counters resolution doesn't need to be considered.
>>> (which helps make mba_sc architecture independent)
>
>> This is the part that is not obvious to me: is the difference between the two individually
>> sanitized measurements the same as sanitizing the difference between the two measurements?
>
> I agree get_corrected_mbm_count()'s rmid check and shift hide what it is doing, but it
> boils down to a multiply. The existing code is (a - b)*cf, which is the same as this a*cf
> - b*cf.
>
> I'm not worried about this going wrong after 18-and-a-bit Exabytes of data is transferred,
> at current memory speeds that would take decades. But: none of the 'cf' values are greater
> than two, and the hardware register has two bits taken for error codes, so there is no a
> or b that hardware can represent, with a cf less than two, that overflows a 64bit unsigned
> long.
Thank you for answering it in this way. This seems fair. Could the
commit message please elaborate more on the changes involved? The
current summary of "Change mbm_bw_count() to use the current chunks
value from __mon_event_count() to calculate bandwidth." is too cryptic
(for me).
Reinette
Powered by blists - more mailing lists