linux-kernel - Re: [PATCH v2 14/23] x86/resctrl: Calculate bandwidth from the previous __mon_event

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <ecc4420f-5a28-cf1f-2537-b651a31e6db3@intel.com>
Date:   Wed, 27 Oct 2021 13:41:47 -0700
From:   Reinette Chatre <reinette.chatre@...el.com>
To:     James Morse <james.morse@....com>, <x86@...nel.org>,
        <linux-kernel@...r.kernel.org>
CC:     Fenghua Yu <fenghua.yu@...el.com>,
        Thomas Gleixner <tglx@...utronix.de>,
        Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>,
        H Peter Anvin <hpa@...or.com>,
        Babu Moger <Babu.Moger@....com>,
        <shameerali.kolothum.thodi@...wei.com>,
        Jamie Iles <jamie@...iainc.com>,
        "D Scott Phillips OS" <scott@...amperecomputing.com>,
        <lcherian@...vell.com>, <bobo.shaobowang@...wei.com>,
        <tan.shaopeng@...itsu.com>
Subject: Re: [PATCH v2 14/23] x86/resctrl: Calculate bandwidth from the
 previous __mon_event_count() chunks

Hi James,

On 10/27/2021 9:50 AM, James Morse wrote:
> On 15/10/2021 23:28, Reinette Chatre wrote:
>> On 10/1/2021 9:02 AM, James Morse wrote:
>>> mbm_bw_count() is only called by the mbm_handle_overflow() worker once a
>>> second. It reads the hardware register, calculates the bandwidth and
>>> updates m->prev_bw_msr which is used to hold the previous hardware register
>>> value.
>>>
>>> Operating directly on hardware register values makes it difficult to make
>>> this code architecture independent, so that it can be moved to /fs/,
>>> making the mba_sc feature something resctrl supports with no additional
>>> support from the architecture.
>>> Prior to calling mbm_bw_count(), mbm_update() reads from the same hardware
>>> register using __mon_event_count().
>>
>> Looking back I think 06c5fe9b12dd ("x86/resctrl: Fix incorrect local bandwidth when mba_sc
>> is enabled") may explain how the code ended up the way it is.
>>
>>> Change mbm_bw_count() to use the current chunks value from
>>> __mon_event_count() to calculate bandwidth. This means it no longer
>>> operates on hardware register values.
>>
>> ok ... so could the patch just do this as it is stated here? The way it is implemented is
>> very complicated and hard (for me) to verify the correctness (more below).
> 
>>> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
>>> index 6c8226987dd6..a1232462db14 100644
>>> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
>>> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> 
>>>    static void mbm_bw_count(u32 rmid, struct rmid_read *rr)
>>>    {
>>>        struct rdt_hw_resource *hw_res = resctrl_to_arch_res(rr->r);
>>>        struct mbm_state *m = &rr->d->mbm_local[rmid];
>>> -    u64 tval, cur_bw, chunks;
>>> +    u64 cur_bw, chunks, cur_chunks;
>>>    -    tval = __rmid_read(rmid, rr->evtid);
>>> -    if (tval & (RMID_VAL_ERROR | RMID_VAL_UNAVAIL))
>>> -        return;
>>> +    cur_chunks = rr->val;
>>> +    chunks = cur_chunks - m->prev_bw_chunks;
>>> +    m->prev_bw_chunks = cur_chunks;
>>>    -    chunks = mbm_overflow_count(m->prev_bw_msr, tval, hw_res->mbm_width);
>>> -    cur_bw = (get_corrected_mbm_count(rmid, chunks) * hw_res->mon_scale) >> 20;
>>> +    cur_bw = (chunks * hw_res->mon_scale) >> 20;
> 
>> I find this quite confusing. What if a new m->prev_chunks is introduced instead and
>> initialized in __mon_event_count() to the value of chunks, and then here in mbm_bw_count
>> it could just retrieve it (chunks = m->prev_chunks).
> 
> (I agree the diff is noisy, it may be easier as a new function as this is a replacement
> not a transform of the existing function)
> 
> __mon_event_count() can also be triggered by user-space reading the file, so any of its
> 'prev' values should be ignored, as they aren't updated on the 1-second timer needed to
> get this in MB/s.

The resource group's mutex is taken before __mon_event_count() is called 
via user-space or via the overflow handler so I think that 
mbm_bw_count() can assume that the prev values are from the 
__mon_event_count() called just before it.

> __mon_event_count()'s chunks values hasn't been through get_corrected_mbm_count(), so it
> would need to be rr->val, which is what this code starts with for the "number of chunks
> ever read by this counter".

The value could be corrected in mbm_bw_count(), no?

> 
> 
> The variable 'chunks' has been used too much here, so its lost its meaning. How about:
> |	current_chunk_count = rr->val;
> |	delta_counter = current_chunk_count - m->prev_chunk_count;
> |	cur_bw = (delta_counter * hw_res->mon_scale) >> 20;
> |
> |	m->prev_chunk_count = current_chunk_count;
> 
> 
> The 'delta_counter' step was previously hidden in mbm_overflow_count(), which also had to
> do with overflow of the hardware counter. Because a previously sanitised value is being
> used, the hardware counters resolution doesn't need to be considered.
> (which helps make mba_sc architecture independent)

This is the part that is not obvious to me: is the difference between 
the two individually sanitized measurements the same as sanitizing the 
difference between the two measurements?

Reinette