linux-kernel - Re: [PATCH v7 2/2] clocksource: add J-Core timer/clocksource driver

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20161008162614.GY19318@brightrain.aerifal.cx>
Date:   Sat, 8 Oct 2016 12:26:14 -0400
From:   Rich Felker <dalias@...c.org>
To:     Thomas Gleixner <tglx@...utronix.de>
Cc:     Daniel Lezcano <daniel.lezcano@...aro.org>,
        devicetree@...r.kernel.org, linux-kernel@...r.kernel.org,
        linux-sh@...r.kernel.org, Rob Herring <robh+dt@...nel.org>,
        Mark Rutland <mark.rutland@....com>,
        "Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>
Subject: Re: [PATCH v7 2/2] clocksource: add J-Core timer/clocksource driver

On Sat, Oct 08, 2016 at 01:32:06PM +0200, Thomas Gleixner wrote:
> On Fri, 7 Oct 2016, Rich Felker wrote:
> > 
> > If I'm not mistaken, the bug is in tick_nohz_restart. According to the
> 
> I think you are mistaken. Let's look at CPU0 only:

OK.

>           <idle>-0     [000] d...   150.829698: __tick_nohz_idle_enter: can stop idle tick
>           <idle>-0     [000] d...   150.829774: __tick_nohz_idle_enter: entering stop_sched_tick
>           <idle>-0     [000] d...   150.830001: tick_stop: success=1 dependency=NONE
>           <idle>-0     [000] d...   150.830127: hrtimer_cancel: hrtimer=109e949c
>           <idle>-0     [000] d...   150.830380: jcore_pit_set_next_event: pit 109eccb0 0 set delta 82124660 en=4410000 cntr=75 throt=82124660
>           <idle>-0     [000] d...   150.830498: hrtimer_start: hrtimer=109e949c function=tick_sched_timer expires=150840000000 softexpires=150840000000
>           <idle>-0     [000] d...   150.830720: jcore_pit_set_next_event: pit 109eccb0 0 set delta 539823 en=4410000 cntr=34 throt=539823
> 
> Tick is programmed to fire at 150.84
> 
>           <idle>-0     [000] d...   150.830822: __tick_nohz_idle_enter: returning from stop_sched_tick
>           <idle>-0     [000] d...   150.830940: rcu_dyntick: Start 140000000000000 0
>           <idle>-0     [000] d...   150.831072: rcu_dyntick: End 0 1
>           <idle>-0     [000] d...   150.831165: cpu_idle: state=0 cpu_id=0
>           <idle>-0     [000] d...   150.831269: rcu_dyntick: Start 1 0
>           <idle>-0     [000] ....   150.831368: cpu_startup_entry: enabled irqs, spinning
> 
> CPU spins and waits for an interrupt to happen
> 
> 
>           <idle>-0     [000] d...   150.841530: rcu_dyntick: End 0 1
> 
> Dropping out of the spin about the time we expect the PIT interrupt to
> fire. And an interrupt is the reason why we drop out because the 'need
> resched' flag is not set and we end up in:

OK, I missed that.

>           <idle>-0     [000] d.s.   150.841724: tick_irq_enter: update jiffies via irq
> 
> which is called from irq_enter()
> 
>           <idle>-0     [000] d.s.   150.841918: tick_do_update_jiffies64: finished do_timer(1)
>           <idle>-0     [000] d.s.   150.842348: tick_do_update_jiffies64: finished updating jiffies
> 
> 
> So here we would expect:
> 
>     	 irq_handler_entry: irq=16 name=jcore_pit
> 	 ...
> 	 irq_handler_exit .....
>     	 
> 
> or any other interrupt being invoked. But we see this here:

According to the trace the 'irq' is soft ('s'). I couldn't find the
code path from the idle loop to softirq but so maybe this is a bug. Is
it possible that in some cases the arch irq entry does not get
identified as a hard irq or traced and then the subsequent tick code
thinks it's running in a softirq and behaves differently? I could add
more tracing around irq entry.

>           <idle>-0     [000] d...   150.842603: __tick_nohz_idle_enter: can stop idle tick
> 
> And that's just wrong. 

Can you explain?

> Now looking at CPU1 at the same time:
> 
>           <idle>-0     [001] d.h.   150.841829: irq_handler_entry: irq=16 name=jcore_
>           <idle>-0     [001] d.h.   150.842054: hrtimer_cancel: hrtimer=109f449c
>           <idle>-0     [001] d.h.   150.842218: hrtimer_expire_entry: hrtimer=109f449c function=tick_sched_timer now=150840579940
> 
> So there the PIT interrupt enters and runs the interrupt handler.
> 
> Now looking deeper at the PIT interrupts:
> 
>      kworker/0:1-208   [000] d.h.   150.641822: irq_handler_entry: irq=16 name=jcore_pit
>      kworker/0:1-208   [000] d.h.   150.644928: irq_handler_exit: irq=16 ret=handled
> 
> 
>           <idle>-0     [001] dnh.   150.645370: irq_handler_entry: irq=16 name=jcore_pit
>           <idle>-0     [001] dnh.   150.647444: irq_handler_exit: irq=16 ret=handled
> 
> Both CPUs have the same interrupt number for their per cpu PIT interrupt.
> 
> IIRC you said earlier when the percpu interrupt discussion happened, that
> your per cpu PITs have distinct interrupt numbers, but that does not seem
> the case here.

No, I said the opposite. They use the same irq number but they're each
wired to their own cpu, and there's no way for them to fire on the
wrong cpu.

Rich