linux-kernel - Re: [PATCH 2/6] selftests/resctrl: Ensure measurements skip initialization of default benchmark

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <3325de04-6e59-4c7a-ae44-c245c8edb93f@intel.com>
Date: Fri, 6 Sep 2024 17:05:13 -0700
From: Reinette Chatre <reinette.chatre@...el.com>
To: Ilpo Järvinen <ilpo.jarvinen@...ux.intel.com>
CC: <fenghua.yu@...el.com>, <shuah@...nel.org>, <tony.luck@...el.com>,
	<peternewman@...gle.com>, <babu.moger@....com>,
	Maciej Wieczór-Retman <maciej.wieczor-retman@...el.com>,
	<linux-kselftest@...r.kernel.org>, LKML <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH 2/6] selftests/resctrl: Ensure measurements skip
 initialization of default benchmark

Hi Ilpo,

On 9/6/24 3:00 AM, Ilpo Järvinen wrote:
> On Thu, 5 Sep 2024, Reinette Chatre wrote:
>> On 9/5/24 5:10 AM, Ilpo Järvinen wrote:
>>> On Wed, 4 Sep 2024, Reinette Chatre wrote:
>>>> On 9/4/24 4:57 AM, Ilpo Järvinen wrote:
>>>>> On Fri, 30 Aug 2024, Reinette Chatre wrote:
>>>>>> On 8/30/24 3:56 AM, Ilpo Järvinen wrote:
>>>>>>> On Thu, 29 Aug 2024, Reinette Chatre wrote:
>>>
>>>>>>>> @@ -699,111 +639,80 @@ int resctrl_val(const struct resctrl_test
>>>>>>>> *test,
>>>>>>>>      		return ret;
>>>>>>>>      	}
>>>>>>>>      -	/*
>>>>>>>> -	 * If benchmark wasn't successfully started by child, then
>>>>>>>> child
>>>>>>>> should
>>>>>>>> -	 * kill parent, so save parent's pid
>>>>>>>> -	 */
>>>>>>>>      	ppid = getpid();
>>>>>>>>      -	if (pipe(pipefd)) {
>>>>>>>> -		ksft_perror("Unable to create pipe");
>>>>>>>> +	/* Taskset test to specified CPU. */
>>>>>>>> +	ret = taskset_benchmark(ppid, uparams->cpu, &old_affinity);
>>>>>>>
>>>>>>> Previously only CPU affinity for bm_pid was set but now it's set
>>>>>>> before
>>>>>>> fork(). Quickly checking the Internet, it seems that CPU affinity
>>>>>>> gets
>>>>>>> inherited on fork() so now both processes will have the same
>>>>>>> affinity
>>>>>>> which might make the other process to interfere with the
>>>>>>> measurement.
>>>>>>
>>>>>> Setting the affinity is intended to ensure that the buffer preparation
>>>>>> occurs in the same topology as where the runtime portion will run.
>>>>>> This preparation is done before the work to be measured starts.
>>>>>>
>>>>>> This does tie in with the association with the resctrl group and I
>>>>>> will elaborate more below ...
>>>>>
>>>>> Okay, that's useful to retain but thinking this further, now we're also
>>>>> going do non-trivial amount of work in between the setup and the test by
>>>>
>>>> Could you please elaborate how the amount of work during setup can be an
>>>> issue? I have been focused on the measurements that are done afterwards
>>>> that do have clear boundaries from what I can tell.
>>>
>>> Well, you used it as a justification: "Setting the affinity is intended
>>> to ensure that the buffer preparation occurs in the same topology as where
>>> the runtime portion will run." So I assumed you had some expectations about
>>> "preparations" done outside of those "clear boundaries" but now you seem
>>> to take entirely opposite stance?
>>
>> I do not follow you here. With the "clear boundaries" I meant the
>> measurements and associated variables that have  a clear start/init and
>> stop/read while the other task sleeps. Yes, preparations are done outside
>> of this since that should not be measured.
> 
> Of course the preparations are not measured (at least not after this
> patch :-)).
> 
> But that's not what I meant. You said the preparations are to be done
> using the same topology as the test but if it's outside of the measurement
> period, the topology during preparations only matters if you make some
> hidden assumption that some state carries from preparations to the actual
> test. If no state carry-over is assumed, the topology during preparations
> is not important. So which way it is? Is the topology during preparations
> important or not?

Topology during preparations is important.

In the CMT test this is more relevant with the transition to using
memflush = false. The preparation phase and measure phase uses the same
cache alloc configuration and just as importantly, the same monitoring
configuration. During preparation the cache portion that will be
used during measurement will be filled with the contents of the buffer.
During measurement it will be the same cache portion into which any new reads
will be allocated and measured. In fact, the preparation phase will thus form
part of the measurement. If preparation was done with different
configuration, then I see a problem whether memflush = true as well as when
memflush = false. In the case of memflush = true it will have the familiar
issue of the test needing to "guess" the workload settle time. In the case
of memflush = false the buffer will remain allocated into the cache portion
used during preparation phase, when the workload runs it will read the
data from a cache portion that does not belong to it and since it does
not need to allocate into its own cache portion its LLC occupancy counts will
not be accurate (the LLC occupancy associated with the buffer will be attributed
to prepare portion).

I am not familiar with the details of memory allocation but as I understand
Linux does attempt to satisfy memory requests from the node to which the
requesting CPU is assigned. For the MBM and MBA tests I thus believe it is
important to allocate the memory from where it will be used. I have encountered
systems where CPU0 and CPU1 are on different sockets and by default the workload
is set to run on CPU1. If the preparation phase runs on CPU0 then it may be
that memory could be allocated from a different NUMA node than where the workload will
be running. By doing preparation within the same topology as what the
workload will be running I believe that memory will be allocated appropriate
to workload and thus result in more reliable measurements.

> 
> You used the topology argument to justify why both parent and child are
> now in the same resctrl group unlike before when only the actual test was.
> 
>> You stated "now we're also going
>> do non-trivial amount of work in between the setup and the test" ...
>> could you clarify what the problem is with this? Before this work
>> the "non-trivial amount of work" (for "fill_buf") was done as part of the
>> test with (wrong) guesses about how long it takes. This work aims to improve
>> this.
> 
> I understand why you're trying to do with this change.
> 
> However, I was trying to say that before preparations occurred right
> before the test which is no longer the case because there's fork() in
> between.

If by "test" you mean the measurement phase then in the case of "fill_buf"
preparations only now reliably occur before the measurement. Original behavior
is maintained with user provided benchmark.

> 
> Does that extra work impact the state carry-over from preparations to the
> test?

It is not clear to me what extra work or state you are referring to.

> 
>>> fork() quite heavy operation as it has to copy various things including
>>> the address space which you just made to contain a huge mem blob. :-)
>>
>> As highlighted in a comment found in the patch, the kernel uses copy-on-write
>> and all tests only read from the buffer after fork().
> 
> Write test is possible using -b fill_buf ... as mentioned in comments to
> the other patch.

Yes, it is theoretically possible, but looking closer it is not supported. Note
how measure_mem_bw() is always called with hardcoded "reads". Reading through
the history of commits I do not believe modifying fill_buf parameters was
ever intended to be a use case for the "-b" parameter. It seems intended
to provide an alternate benchmark to fill_buf.

I am not interested in adding support for the write test because I do not see
how it helps us to improve resctrl subsystem health. Considering that
nobody has ever complained about the write test being broken I think all
that dead code should just be removed instead.

I prefer the focus be on the health of the resctrl subsystem instead of additional
hardware performance testing. I do not think it is energy well spent to further
tune these performance tests unless it benefits resctrl subsystem health. These
performance tests are difficult to maintain and I have not yet seen how they
benefit the resctrl subsystem.

>>> BTW, perhaps we could use some lighter weighted fork variant in the
>>> resctrl selftests, the processes don't really need to be that separated
>>> to justify using full fork() (and custom benchmarks will do execvp()).
>>
>> Are you thinking about pthreads? It is not clear to me that this is
>> necessary. It is also not clear to me what problem you are describing that
>> needs this solution. When I understand that better it will be easier to
>> discuss solutions.
> 
> I was trying to say I see advantage of retaining the address space which
> is not possible with fork(), so perhaps using clone() with CLONE_VM would
> be useful and maybe some other flags too. I think this would solve the
> write test case.

clone() brings with it the complexity of needing to manage the child's
stack. This again aims for a performance improvement. What does this fix?
What is the benefit to resctrl health? I would prefer that our energy
instead be focused on improving resctrl subsystem health.

Reinette