linux-kernel - Re: [PATCH v1 0/6] x86/resctrl: Avoid searching tasklist during mongrp

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <98b3144d-4519-4378-9121-c1538b4c5b68@intel.com>
Date: Sun, 7 Apr 2024 12:13:09 -0700
From: Reinette Chatre <reinette.chatre@...el.com>
To: Peter Newman <peternewman@...gle.com>
CC: Fenghua Yu <fenghua.yu@...el.com>, James Morse <james.morse@....com>,
	Stephane Eranian <eranian@...gle.com>, Thomas Gleixner <tglx@...utronix.de>,
	Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>, Dave Hansen
	<dave.hansen@...ux.intel.com>, <x86@...nel.org>, "H. Peter Anvin"
	<hpa@...or.com>, Peter Zijlstra <peterz@...radead.org>, Juri Lelli
	<juri.lelli@...hat.com>, Vincent Guittot <vincent.guittot@...aro.org>,
	Dietmar Eggemann <dietmar.eggemann@....com>, Steven Rostedt
	<rostedt@...dmis.org>, Ben Segall <bsegall@...gle.com>, Mel Gorman
	<mgorman@...e.de>, Daniel Bristot de Oliveira <bristot@...hat.com>, "Valentin
 Schneider" <vschneid@...hat.com>, Uros Bizjak <ubizjak@...il.com>, "Mike
 Rapoport" <rppt@...nel.org>, "Kirill A. Shutemov"
	<kirill.shutemov@...ux.intel.com>, Rick Edgecombe
	<rick.p.edgecombe@...el.com>, Xin Li <xin3.li@...el.com>, Babu Moger
	<babu.moger@....com>, Shaopeng Tan <tan.shaopeng@...itsu.com>, "Maciej
 Wieczor-Retman" <maciej.wieczor-retman@...el.com>, Jens Axboe
	<axboe@...nel.dk>, Christian Brauner <brauner@...nel.org>, Oleg Nesterov
	<oleg@...hat.com>, Andrew Morton <akpm@...ux-foundation.org>, Tycho Andersen
	<tandersen@...flix.com>, Nicholas Piggin <npiggin@...il.com>, Beau Belgrave
	<beaub@...ux.microsoft.com>, "Matthew Wilcox (Oracle)" <willy@...radead.org>,
	<linux-kernel@...r.kernel.org>
Subject: Re: [PATCH v1 0/6] x86/resctrl: Avoid searching tasklist during
 mongrp_reparent

Hi Peter,

On 4/5/2024 2:30 PM, Peter Newman wrote:
> Hi Reinette,
> 
> On Thu, Apr 4, 2024 at 4:09 PM Reinette Chatre
> <reinette.chatre@...el.com> wrote:
>> On 3/25/2024 10:27 AM, Peter Newman wrote:
>>> I've been working with users of the recently-added mongroup rename
>>> operation[1] who have observed the impact of back-to-back operations on
>>> latency-sensitive, thread pool-based services. Because changing a
>>> resctrl group's CLOSID (or RMID) requires all task_structs in the system
>>> to be inspected with the tasklist_lock read-locked, a series of move
>>> operations can block out thread creation for long periods of time, as
>>> creating threads needs to write-lock the tasklist_lock.
>>
>> Could you please give some insight into the delays experienced? "long
>> periods of time" mean different things to different people and this
>> series seems to get more ominous as is progresses with the cover letter
>> starting with "long periods of time" and by the time the final patch
>> appears it has become "disastrous".
> 
> There was an incident where 99.999p tail latencies of a service
> increased from 100 milliseconds to over 2 seconds when the container
> manager switched from our legacy downstream CLOSID-reuse technique[1]
> to mongrp_rename().
> 
> A more focused study benchmarked creating 128 threads with
> pthread_create() on a production host found that moving mongroups
> unrelated to any of the benchmark threads increased the completion cpu
> time from 30ms to 183ms. Profiling the contention on the tasklist_lock
> showed that the average contention time on the tasklist_lock was about
> 70ms when mongroup move operations were taking place.
> 
> It's difficult for me to access real production workloads, but I
> estimated a crude figure by measuring the task time of "wc -l
> /sys/fs/resctrl" with perf stat on a relatively idle Intel(R) Xeon(R)
> Platinum 8273CL CPU @ 2.20GHz. As I increased the thread count, it
> converged to a line where every additional 1000 threads added about 1
> millisecond.

Thank you very much for capturing this. Could you please include this in
next posting? This data motivates this work significantly more than
terms that are not measurable.

> Incorporating kernfs_rename() into the solution for changing a group's
> class of service also contributes a lot of overhead (about 90% of a
> mongroup rename seems to be spent here), but the global impact is far
> less than that of the tasklist_lock contention.

Is the kernfs_rename() overhead in an acceptable range?

>>> context switch. Updating a group's ID then only requires the current
>>> task to be switched back in on all CPUs. On server hosts with very large
>>> thread counts, this is much less disruptive than making thread creation
>>> globally unavailable. However, this is less desirable on CPU-isolated,
>>> realtime workloads, so I am interested in suggestions on how to reach a
>>> compromise for the two use cases.
>>
>> As I understand this only impacts moving a monitor group? To me this sounds
>> like a user space triggered event associated with a particular use case that
>> may not be relevant to the workloads that you refer to. I think this could be
>> something that can be documented for users with this type of workloads.
>> (please see patch #6)
> 
> All of the existing rmdir cases seem to have the same problem, but
> they must not be used frequently enough for any concerns to be raised.
> 
> It seems that it's routine for the workload of hosts to be increased
> until memory bandwidth saturation, so applying and unapplying
> allocation restrictions happens rather frequently.

Thank you. 

Reinette