netdev - Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <m1ircvswzz.fsf@ebiederm.dsl.xmission.com>
Date:	Tue, 20 Mar 2007 10:25:20 -0600
From:	ebiederm@...ssion.com (Eric W. Biederman)
To:	Andi Kleen <ak@...e.de>
Cc:	David Miller <davem@...emloft.net>, torvalds@...ux-foundation.org,
	virtualization@...ts.linux-foundation.org, jbeulich@...ell.com,
	jeremy@...p.org, xen-devel@...ts.xensource.com,
	netdev@...r.kernel.org, linux-kernel@...r.kernel.org,
	chrisw@...s-sol.org, virtualization@...ts.osdl.org,
	anthony@...emonkey.ws, akpm@...ux-foundation.org, mingo@...e.hu
Subject: Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable

Andi Kleen <ak@...e.de> writes:

>> I'm conflicted about the dwarf unwinder.  I was off doing other things
>> at the time so I missed the pain, but I do have a distinct recollection of
>> the back traces on x86_64 being distinctly worse the on i386. 
>
> The only case were i386 was better was with frame pointers, which
> was never fully implemented for x86-64. However i find that hilarious: 
> people are spending a lot of time right here in this thread to squeeze
> out the best call sequences for the paravirt ops, but then accept
> losing a full frame pointer register on i386. I never found that
> acceptable, that is why I prefered the unwinder instead. 
>
> This said the big problem with the frame pointers is mostly gone now:
> on older CPUs it tended to cause a pipeline stall early in the function.
> That is now fixed in the latest Intel/upcomming AMD CPUs, but there 
> are still millions and millions of older CPUs out there so I still
> don't consider it acceptable.

What I recall observing is call traces that made no sense.  Not just
extra noise in the stack trace but things like seeing a function that
has exactly one path to it, and not seeing all of the functions on
that path in the call trace.

In my later debugging I have been reasonably able to attribute those
kinds of things to compiler optimizations like inlining and tail call
optimization.

Now I will agree that having fewer or no false positives to weed
through is a good thing, if we can do it reliably.

>> Lately 
>> I haven't seen that so it may be I was misinterpreting what I was
>> seeing, and the compiler optimizations were what gave me such weird
>> back traces. 
>
> The main problem is that subsystems are getting more and more complex
> and especially callbacks seem to multiply far too quickly.
>
> In 2.4 it was often very reasonable to just sort out the false positives,
> but with sometimes 20-30+ level deep call chains in 2.6 with many callbacks that
> just
> gets far too tenuous. 

Hmm.  I haven't seen those traces, but I wonder if the size of those
stack traces indicates potential stack overflow problems.
  
>> But if the quality of our backtraces has gone down and dwarf unwinder
>> could give us better back traces it is likely worth pursuing.  Of
>> course it would need to start with the assumption that it's tables
>> may be borked (the kernel is busted after all) and be much more
>> careful than Andi's last attempt.
>
> The latest version validates the stack always. It was only a few lines
> of change. I doubt it will make much difference though. The few true crashes
> we had were not actually due the unwinder itself, but the buggy fallback code
> (which were fixed quickly). But anyways it should satisfy everybody's paranoia
> now.

Do you also validate the unwind data?

> Although in future it would be good if people did some more analysis in root
> causes for failures before let the paranoia take over and revert patches.
>
> We see a good example here of what I call the JFS/ACPI effect: code gets merged
> too early with some visible problems. It gets a bad name and afterwards people
> never look objectively at it again and just trust their prejudices. 

I don't know.  The impression I got was the root cause analysis stopped 
when it was observed that the code was unsuitable for solving the problem.
When asked about it, it appeared the developer did not understand the
question.  Therefore the root cause was assumed to be the developer.

At least that is how I have read the few little bits I have seen.

> But that's not a good strategy to get good code in the end I think. If there
> is enough evidence the early problems were fixed then prejudices should
> be reevaluated.

Certainly.  However if the developer has lost a certain amount of
initial trust, the burden becomes much higher.

Eric
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html