linux-kernel - Re: [PATCH v2 2/2] x86/resctrl: Don't workqueue local event counter reads

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <c4ced27a-b3e9-4727-9c39-7d1fd0cb0dd2@intel.com>
Date: Thu, 7 Nov 2024 14:03:05 -0800
From: Reinette Chatre <reinette.chatre@...el.com>
To: "Luck, Tony" <tony.luck@...el.com>, Peter Newman <peternewman@...gle.com>
CC: "Yu, Fenghua" <fenghua.yu@...el.com>, "babu.moger@....com"
	<babu.moger@....com>, "bp@...en8.de" <bp@...en8.de>,
	"dave.hansen@...ux.intel.com" <dave.hansen@...ux.intel.com>, "Eranian,
 Stephane" <eranian@...gle.com>, "hpa@...or.com" <hpa@...or.com>,
	"james.morse@....com" <james.morse@....com>, "linux-kernel@...r.kernel.org"
	<linux-kernel@...r.kernel.org>, "mingo@...hat.com" <mingo@...hat.com>,
	"nert.pinx@...il.com" <nert.pinx@...il.com>, "tan.shaopeng@...itsu.com"
	<tan.shaopeng@...itsu.com>, "tglx@...utronix.de" <tglx@...utronix.de>,
	"x86@...nel.org" <x86@...nel.org>
Subject: Re: [PATCH v2 2/2] x86/resctrl: Don't workqueue local event counter
 reads

Hi Tony,

On 11/7/24 12:58 PM, Luck, Tony wrote:
>>> # cat /sys/fs/resctrl/info/L3_MON/mbm_snapshot/mbm_total_bytes_01
>>> <rdtgroup nameA> <MBM total count> <timestamp> <generation>
>>> <rdtgroup nameB> <MBM total count> <timestamp> <generation>
>>> ...
>>>
>>> Where <timestamp> tracks when this sample was captured. And
>>> <generation> is an integer that is incremented when data
>>> for this event is lost (e.g. due to ABMC counter re-assignment).
> 
> Maintaining separate timestamps for each group may be overkill.
> The overflow function walks through them all quite rapidly. On
> Intel Icelake with 100 groups there is only a 670 usec delta
> between the first and last.

If cached data is presented to the user I think the timestamp is
required to let user space know when the data was collected. This timestamp
would be unique per domain as it reflects the per-domain overflow workers.
As you state, it may be overkill if done for each group but I think it
is valuable to have it for the particular domain.
It sounds as though the use case is for user space to query counters
every second. With the overflow handler and user space thread running
queries at same interval it may help to ensure that user space and kernel
does not get out of sync. For example, a scenario where user space believes it
queries once per second but receives the same cached data in two consecutive
queries.

> 
>> It is not obvious to me how resctrl can provide a reliable
>> "generation" value.
> 
> Keep a generation count for each event in each group. Increment
> the count when taking the h/w counter away.

Since this is a snapshot of the counter, why not pass the exact value or
issue encountered when counter was read? For example, "Error", "Unavailable", or
"Unassigned" instead of a "MBM <total|local> count"? We need to be careful
when presenting cached data to user space since the data becomes stale
if any issue is encountered during its query from hardware because that would
make any cached "MBM <total|local> count" invalid.

A generation value would be of most use if it can be understood by user space.
I think that would require something separate for user space to know which
"generation" a counter is after it is assigned so that a query of cached data
can be matched to it. 

I think maybe the issue you are trying to address is a user assigning a counter
and then reading the cached data and getting cached data from a previous
configuration? Please note that in the current implementation the cached
data is reset directly on counter assignment [1]. If a user assigns a new
counter and then immediately read cached data then the cached data will
reflect the assignment even if the overflow worker thread did not get a chance
to run since the assignment.

> 
>>> Then a monitor application can compute bandwidth for each
>>> group by periodic sampling and for each group:
>>>
>>>     if (thisgeneration == lastgeneration) {
>>>             bw = (thiscount - lastcount) / (thistimestanp - lasttimestamp);
>>
>> If user space needs visibility into these internals then we could also
>> consider adding a trace event that logs the timestamped data right when it
>> is queried by the overflow handler.
> 
> That would provide accurate data at low overhead, assuming that
> the user wants bandwidth data every second. If they only need
> data over longer time intervals all the extra trace events aren't
> needed.

Using tracepoints comes with benefits of features supported by its user space
infrastructure. This is one more tool available to explore what would work best
to address the use cases. The use case presented in this thread is to collect
monitoring data once per second.

Reinette

[1] https://lore.kernel.org/all/3851fbd6ccd1cdc504229e4c7f7d2575c13f5bd6.1730244116.git.babu.moger@amd.com/