linux-kernel - Re: Another preempt folding issue?

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20140214131745.GC15586@twins.programming.kicks-ass.net>
Date:	Fri, 14 Feb 2014 14:17:45 +0100
From:	Peter Zijlstra <peterz@...radead.org>
To:	Stefan Bader <stefan.bader@...onical.com>
Cc:	Borislav Petkov <bp@...en8.de>,
	Paolo Bonzini <pbonzini@...hat.com>,
	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
	kvm@...r.kernel.org, Marcelo Tosatti <mtosatti@...hat.com>,
	MASAO TAKAHASHI <masao-takahashi@...no.co.jp>,
	Joerg Roedel <joro@...tes.org>
Subject: Re: Another preempt folding issue?

On Fri, Feb 14, 2014 at 11:55:25AM +0100, Stefan Bader wrote:
> Ok, I think I now got a log of the actual issue. It seems cpu#1 missed out on
> handling a reschedule interrupt but did send one to cpu#0 and on cpu#0 while
> handling the interrupt the tif flag was not set (yet?) but then when it is,
> there is no more interrupt to do the folding... (sorry for the wrapping)...
> 
>  qemu-system-x86-2503  [000] d.h.  1145.887872: default_send_IPI_mask_logical <-native_smp_send_reschedule
>  qemu-system-x86-2503  [000] d...  1145.887888: resched_task <-check_preempt_curr
>           <idle>-0     [000] d.h.  1145.888001: resched_task <-check_preempt_curr

>           <idle>-0     [001] dn..  1145.888788: smp_reschedule_interrupt <-reschedule_interrupt
>           <idle>-0     [001] dn..  1145.888789: scheduler_ipi <-smp_reschedule_interrupt

>   gnome-terminal-2440  [000] d...  1145.888957: resched_task <-check_preempt_curr
>   gnome-terminal-2440  [000] d...  1145.888959: native_smp_send_reschedule <-resched_task
>   gnome-terminal-2440  [000] d...  1145.888959: default_send_IPI_mask_logical <-native_smp_send_reschedule

>           <idle>-0     [001] dn..  1145.889296: smp_reschedule_interrupt <-reschedule_interrupt
>           <idle>-0     [001] dn..  1145.889297: scheduler_ipi <-smp_reschedule_interrupt

>   gnome-terminal-2440  [000] d...  1145.889465: resched_task <-check_preempt_curr
>   gnome-terminal-2440  [000] d...  1145.889467: native_smp_send_reschedule <-resched_task
>   gnome-terminal-2440  [000] d...  1145.889467: default_send_IPI_mask_logical <-native_smp_send_reschedule

>             Xorg-1054  [001] d.h.  1145.889469: resched_task <-check_preempt_wakeup
>             Xorg-1054  [001] d.h.  1145.889469: native_smp_send_reschedule <-resched_task
>             Xorg-1054  [001] d.h.  1145.889469: default_send_IPI_mask_logical <-native_smp_send_reschedule

>  qemu-system-x86-2503  [000] d.s.  1145.889643: smp_reschedule_interrupt <-reschedule_interrupt
>  qemu-system-x86-2503  [000] d.s.  1145.889643: scheduler_ipi <-smp_reschedule_interrupt
>  qemu-system-x86-2503  [000] .n..  1145.889647: kvm_arch_vcpu_ioctl_run: sched inconsistency
>  qemu-system-x86-2503  [000] .n..  1145.889662: kvm_arch_vcpu_ioctl_run: sched inconsistency
>  qemu-system-x86-2503  [000] .n..  1145.889664: kvm_arch_vcpu_ioctl_run: sched inconsistency
>  qemu-system-x86-2503  [000] .n..  1145.889668: kvm_arch_vcpu_ioctl_run: need(1) != should(0)
>  qemu-system-x86-2503  [000] .n..  1145.889669: kvm_arch_vcpu_ioctl_run: exit_reason=2
>  qemu-system-x86-2503  [000] .n..  1145.889669: kvm_arch_vcpu_ioctl_run: preempt_count=80000000
> 

So afaiu you're running this on an AMD fam 0xf which doesn't have
synchronized TSC; so again the timing might be off.. however:

The first block from cpu 0, sends an IPI to cpu 1, cpu 1 sees 'n'.

Then again, cpu 0 sends to cpu 1, we observe 'n'.

So far so good and timing seems to reflect causality.

Then however, cpu 0 sends, but nothing in this trace answers on cpu 1..
odd.

Then cpu 1 sends to cpu 0, cpu 0 does receive the IPI but does _not_
observe 'n', which is mighty odd. However after the IPI is handled cpu 0
magically gets the 'n'.



So one thing I did wonder about; if cpu 0 sends an IPI while cpu 1 is
already processing the same interrupt; cpu 1 should latch the interrupt
and it would re-trigger with when we re-enable interrupts after we're
done with the first.

The only thing that would ruin that is late-ack, eg. ack (clear the
latch) when exiting the interrupt; but I checked and we ack before we
call scheduler_ipi().

Now; if this were to happen (but I don't exactly see how on a 2 CPU
system, nor do I see this in the trace), there might be a hole...

CPU 0				|	CPU 1
--------------------------------+--------------------------------
                                |
/* ! resched_task */		|
smp_send_reschedule(1)		|L
				|L  smp_reschedule_interrupt()
resched_task()			|L
  set_tsk_need_resched()	|L
  smp_mb()			|L
  smp_send_reschedule(1)	|L
				|L    ack_APIC_irq()
				|     scheduler_ipi()
				|       need_resched();

Now strictly speaking that need_resched() load could creep up before
the store and we'd fail to observe 'n'.

However, x86 forbids reads from passing an earlier write to the same
location (it allows passing earlier writes in general). At which point
the only other option is GCC emitting that load far earlier, in fact
before ack_APIC_irq(), but that would be a violation of the C std
because scheduler_ipi() is a (real) function call and acts as a compiler
barrier AFAIK (as opposed to inlining which doesn't act as a compiler
barrier).

So colour me confused..


You could try putting an smp_mb() at the very start of scheduler_ipi()
just to make absolutely sure, but it _should_ not matter.

Of course, I could be completely off base with how interrupts work; and
there's a wide gaping hole in there, in which case, someone please
educate me ;-)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/