linux-kernel - Re: [RFC 4/5] x86, perf: implements lwp-perf-integration (rc1)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20111220184004.GE8408@elte.hu>
Date:	Tue, 20 Dec 2011 19:40:04 +0100
From:	Ingo Molnar <mingo@...e.hu>
To:	Joerg Roedel <joro@...tes.org>
Cc:	Avi Kivity <avi@...hat.com>,
	Robert Richter <robert.richter@....com>,
	Benjamin Block <bebl@...eta.org>,
	Hans Rosenfeld <hans.rosenfeld@....com>, hpa@...or.com,
	tglx@...utronix.de, suresh.b.siddha@...el.com, eranian@...gle.com,
	brgerst@...il.com, Andreas.Herrmann3@....com, x86@...nel.org,
	linux-kernel@...r.kernel.org,
	Benjamin Block <benjamin.block@....com>,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	Andrew Morton <akpm@...ux-foundation.org>
Subject: Re: [RFC 4/5] x86, perf: implements lwp-perf-integration (rc1)


* Joerg Roedel <joro@...tes.org> wrote:

> Hi Ingo,
> 
> On Tue, Dec 20, 2011 at 11:09:17AM +0100, Ingo Molnar wrote:
> > >
> > > No, it's worse.  They are ring 3 writeable, and ring 3 
> > > configurable.
> > 
> > Avi, i know that very well.
> 
> So you agree that your ideas presented in this thread of 
> integrating LWP into perf have serious security implications?

No, i do not agree at all - you are drastically misrepresending 
my position.

> > > btw, that means that the intended use case - 
> > > self-monitoring with no kernel support - cannot be done. 
> > > [...]
> > 
> > Arguably many years ago the hardware was designed for 
> > brain-dead instrumentation abstractions.
> 
> The point of LWP design is, that it doesn't require 
> abstractions except for the threshold interrupt.
> 
> I am fine with integrating LWP into perf as long as it makes 
> sense and does not break the intended usage scenario for LWP.

That's the wrong way around - in reality we'll integrate LWP 
upstream only once it makes sense and works well with the 
primary instrumentation abstraction we have in the kernel.

Otherwise my "sorry, it's not convincing enough yet" NAK against 
the new feature stands.

In fact as per Linus's rules about new kernel features, 
maintainers don't even have to justify NAK's by offering an 
implementation roadmap that would make the feature acceptable.

Me or PeterZ could just say "this feature is too limited and not 
convincing enough yet, sorry".

*You* who are pushing the feature have to convince the objecting 
maintainer that the feature is worth integrating.

But i'm being nice and helpful here by giving you a rough 
technical outline of how you could overcome my "sorry, this is 
not convincing in its current form yet" rejection of the current 
LWP patches.

> 	[ Because LWP is a user-space feature and designed as such,
> 	  forcing it into an abstraction makes software that uses LWP
> 	  unportable. ]
> 
> But Ingo, the ideas you presented in this thread are clearly 
> no-gos.

Nonsense.

> Having a shared per-cpu buffer for LWP data that is read by 
> perf obviously has very bad security implications, as Avi 
> already pointed out. [...]

Stop this stupidity already!

There's no "security implications" whatsoever. LWP is a ring-3 
hw feature and it can do nothing that the user-space app could 
not already do ...

> [...] It also destroys the intended use-case for LWP because 
> it disturbs any process that is doing self-profiling with LWP.

Why would it destroy that? Self-profiling can install events 
just fine, the kernel will arbitrate the resource.

The 'intended usecase' is meaningless to me - it was done in 
some closed process apparently not talking to anyone who knows a 
bit about Linux instrumentation. If you want this code upstream 
then you need to convince me that the feature makes sense in the 
general and current scheme of things.

I've outlined the (rather workable) technical roadmap for that.

> > Note that as i said user-space *can* acccess the area if it 
> > thinks it can do it better than the kernel (and we could 
> > export that information in a well defined way - we could do 
> > the same for PEBS as well) - i have no particular strong 
> > feelings about allowing that other than i think it's an 
> > obviously inferior model - *as long* as proper, generic, 
> > usable support is added.
> 
> LWP can't be compared in any serious way with PEBS. The only 
> common thing is the hardware-managed ring-buffer. [...]

Which ring-buffer is actually happens to be one of the main 
things that has to be managed ...

> [...] But PEBS is an addition to MSR based performance 
> monitoring resources (for which a kernel abstraction makes a 
> lot of sense) and can only be controlled from ring 0 while LWP 
> is a complete user-space controlled PMU which has no link at 
> all to the MSR-based, ring 0 controlled PMU.

It's a ring-3 controlled PMU feature, not a user-space PMU 
feature. It *can* be controlled by user-space - but it obviously 
can also (and i argue, it should be) - managed by the kernel, 
under Linux.

The kernel is running ring-3 code as well, and it's managing 
ring-3 accessible resources as well, there's nothing new about 
that.

> > From my perspective there's really just one realistic option 
> > to accept this feature: if it's properly fit into existing, 
> > modern instrumentation abstractions. I made that abundantly 
> > clear in my feedback so far.
> 
> The threshold interrupt fits well into the perf-abstraction 
> layer. Even self-monitoring of processes does, and Hans posted 
> patches from Benjamin for that.  What do you think about this 
> approach?

As as i said it's a promising first step - although the 
discussion here convinced me that it needs to be even more 
feature complete, i don't really see that you guys understand 
how such things should be implemented.

You seem to be dead set on supporting a weird special case 
'intended workload' while forgetting the *much* more common 
profilin workloads we have under Linux.

I don't mind supporting weird stuff as well, but you have to 
keep the common case in mind ...

I'd like to see the ring-buffer and the events managed by the 
kernel too, at least so that perf record works fine with this 
PMU feature.

> > User-space can smash it and make it not profile or profile 
> > the wrong thing or into the wrong buffer - but LWP itself 
> > runs with ring3 privileges so it won't do anything the user 
> > couldnt do already.
> 
> The point is, if user-space re-programs LWP it will continue 
> to write its samples to the new ring-buffer virtual-address 
> set up by user-space. It will still use that virtual address 
> in another address-space after a task-switch. This allows 
> processes to corrupt memory of other processes. [...]

That's nonsense. As i said it my previous mail the LWPC should 
be per task and switched on task switch - just like the DS/PEBS 
context is.

> [...] There are ways to hack around that but these have a 
> serious impact on task-switch costs so this is also no way to 
> go.

We are seeing no problems with this approach under PEBS.

> > Lack of protection against self-misconfiguration-damage is a 
> > benign hardware mis-feature - something for LWP v2 to 
> > specify i guess.
> 
> So what you are saying is (not just here, also in other emails 
> in this thread) that every hardware not designed for perf is 
> crap?

No - PMU hardware designed to not allow the profiling of the 
kernel is obviously a crappy aspect of it. Also, PMU hardware 
that does not allow 100% encapsulation by the kernel is 
obviously not very wisely done either.

Those limitations are not a big problem for usable Linux support 
- and future iterations of the LWP hardware can trivially 
address those shortcomings.

> > get_unmapped_area() + install_special_mapping() is probably 
> > better, but yeah.
> 
> get_unmapped_area() only works on current. [...]

Which is a perfectly fine first step to support the
'perf record' inheritance-tree case - which is a very
common profiling method.

> [...] So it can't be used for that purpose too. [...]

Hey, i wrote bits of get_unmapped_area(), way back. I had code 
on my machine that inserted vmas into other tasks's address 
spaces and can confirm that it works. Do you take my word for it 
that it's possible?

Firstly, the perf record case - which is an important, primary 
workflow - can work with the code as-is just fine.

Secondly, for system-wide profiling vmas can be inserted into 
another task's mm context just fine as well: technically we do 
that all the time, when a threaded program is running.

Inserting a vma into another task's mm where that mm is not ours 
is indeed not typical, but not unprecedented either, UML patches 
did that a couple of years ago. (In fact the upcoming uprobes 
patches are doing something far more intrusive.)

The VM modification is trivial AFAICS: an 'mm' parameter has to 
be added to a new do_mmap() variant, that's all - the code is 
already SMP-safe, due to the threaded case.

Otherwise using another task's mm is safe if you acquire it via 
get_task_mm()/mmput().

[ Sidenote: as a bonus this would put infrastructure in place to 
  have user-space accessible trace buffers, insertable via 
  the LWPINS instruction and recoverable via the regular kernel 
  perf event processing facilities. LWP has more potential than
  just self-profiling, if we use the right abstractions... ]

Thanks,

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/