linux-kernel - Re: [patch 04/12] clockevent unbind: use smp_call_func_single

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <Zc0Naa3pwTyndUvK@tpad>
Date: Wed, 14 Feb 2024 15:58:49 -0300
From: Marcelo Tosatti <mtosatti@...hat.com>
To: Thomas Gleixner <tglx@...utronix.de>
Cc: linux-kernel@...r.kernel.org,
	Daniel Bristot de Oliveira <bristot@...nel.org>,
	Juri Lelli <juri.lelli@...hat.com>,
	Valentin Schneider <vschneid@...hat.com>,
	Frederic Weisbecker <frederic@...nel.org>,
	Leonardo Bras <leobras@...hat.com>,
	Peter Zijlstra <peterz@...radead.org>
Subject: Re: [patch 04/12] clockevent unbind: use smp_call_func_single_fail

On Sun, Feb 11, 2024 at 09:52:35AM +0100, Thomas Gleixner wrote:
> On Wed, Feb 07 2024 at 09:51, Marcelo Tosatti wrote:
> > On Wed, Feb 07, 2024 at 12:55:59PM +0100, Thomas Gleixner wrote:
> >
> > OK, so the problem is the following: due to software complexity, one is
> > often not aware of all operations that might take place.
> 
> The problem is that people throw random crap on their systems and avoid
> proper system engineering and then complain that their realtime
> constraints are violated. So you are proliferating bad engineering
> practices and encourage people not to care.

Its more of a practicality and cost concern: one usually does not have
resources to fully review software before using that software.

> > Now think of all possible paths, from userspace, that lead to kernel
> > code that ends up in smp_call_function_* variants (or other functions
> > that cause IPIs to isolated CPUs).
> 
> So you need to analyze every possible code path and interface and add
> your magic functions there after figuring out whether that's valid or
> not.

"A magic function", yes.

> > The alternative, from blocking this in the kernel, would be to validate all 
> > userspace software involved in your application, to ensure it won't end
> > up in the kernel sending IPIs. Which is impractical, isnt it ?
> 
> It's absolutely not impractical. It's part of proper system
> engineering. The wet dream that you can run random docker containers and
> everything works magically is just a wet dream.

Unfortunately that is what people do.

I understand that "full software review" would be the ideal, but in most
situations it does not seem to happen.

> > (or rather, with such option in the kernel, it would be possible to run 
> > applications which have not been validated, since the kernel would fail
> > the operation that results in IPI to isolated CPU).
> 
> That's a fallacy because you _cannot_ define with a single CPU mask
> which interface is valid in a particular configuration to end up with an
> IPI and which one is not. There are legitimate reasons in realtime or
> latency constraint systems to invoke selective functionality which
> interferes with the overall system constraints.
> 
> How do you cover that with your magic CPU mask? You can't.
> 
> Aside of that there is a decent chance that you are subtly breaking user
> space that way. Just look at that hwmon/coretemp commit you pointed to:
> 
>   "Temperature information from the housekeeping cores should be
>    sufficient to infer die temperature."
> 
> That's just wishful thinking for various reasons:
> 
>   - The die temperature on larger packages is not evenly distributed and
>     you can run into situations where the housekeeping cores are sitting
>     "far" enough away from the worker core which creates the heat spot

I know.

>   - Some monitoring applications just stop to work when they can't read
>     the full data set, which means that they break subtly and you can
>     infer exactly nothing.
> 
> > So the idea would be an additional "isolation mode", which when enabled, 
> > would disallow the IPIs. Its still possible for root user to disable
> > this mode, and retry the operation.
> >
> > So lets say i want to read MSRs on a given CPU, as root.
> >
> > You'd have to: 
> >
> > 1) readmsr on given CPU (returns -EPERM or whatever), since the
> > "block interference" mode is enabled for that CPU.
> >
> > 2) Disable that CPU in the block interference cpumask.
> >
> > 3) readmsr on the given CPU (success).
> >
> > 4) Re-enable CPU in block interference cpumask, if desired.
> 
> That's just wrong. Why?
> 
> Once you enable it just to read the MSR you enable the operation for
> _ALL_ other non-validated crap too. So while the single MSR read might
> be OK under certain circumstances the fact that you open up a window for
> all other interfaces to do far more interfering operations is a red
> flag.
> 
> This whole thing is a really badly defined policy mechanism of very
> dubious value.
> 
> Thanks,

OK, fair enough. From your comments, it seems that per-callsite 
toggling would be ideal, for example:

/sys/kernel/interference_blocking/ directory containing one sub-directory
per call site. 

Inside each sub-directory, a "enabled" file, controlling a boolean 
to enable or disable interference blocking for that particular
callsite.

Also a "cpumask" file on each directory, by default containing the same
cpumask as the nohz_full CPUs, to control to which CPUs to block the
interference for.

How does that sound?