[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20180419013200.wxkzqfdacfsijci5@ideak-desk.fi.intel.com>
Date: Thu, 19 Apr 2018 04:32:00 +0300
From: Imre Deak <imre.deak@...el.com>
To: Thomas Gleixner <tglx@...utronix.de>, linux-kernel@...r.kernel.org
Subject: Early timeouts due to inaccurate jiffies during system suspend/resume
Hi,
while checking bug [1], I noticed that jiffies based timing loops like
expire = jiffies + timeout + 1;
while (!time_after(jiffies, expire))
do_something;
can last shorter than expected (that is less than timeout). This happens
at least on an Intel Geminilake system when running the timing loop from
a driver's system suspend and resume hooks. To me it looks like expire
above is calculated with a stale jiffies value at the beginning and then
jiffies is updated - corresponding to the actual current time - with a >1
increment in the middle of the loop causing an early expiry.
With the following simplified testcase ran from a driver's suspend/resume
hooks
cpu = raw_smp_processor_id();
cpu_clock_start = cpu_clock(cpu); // 1.
jiffies_start = jiffies; // 2.
usleep_range(200, 200);
jiffies_end = jiffies; // 3.
cpu_clock_end = cpu_clock(cpu); // 4.
jiffies_delta = jiffies_end - jiffies_start;
cpu_clock_delta = cpu_clock_end - cpu_clock_start;
if (jiffies_to_nsecs(jiffies_delta) >
cpu_clock_delta + jiffies_to_nsecs(1))
pr_info("cpu %d jiffies-delta %llu ns (%llu->%llu) cpu_clock-delta %llu ns (%llu -> %llu)\n",
cpu,
jiffies_to_nsecs(jiffies_delta), jiffies_start, jiffies_end,
cpu_clock_delta, cpu_clock_start, cpu_clock_end);
and doing suspend/resume to mem cycles in a loop, I can trigger the
following:
[ 42.415713] cpu 1 jiffies-delta 11000000 ns (4294709700->4294709711) cpu_clock-delta 215738 ns (42415489466 -> 42415705204)
So according to jiffies the delay was 11ms while according to cpu_clock()
it was ~216usec. I have CONFIG_HZ=1000, so AFAIU - due to the ordering of
1.,2.,3.,4. - cpu_clock-delta should be bigger than jiffies-delta
minus 1ms.
Are the above timing loops/assumptions incorrect?
After some ftracing it seems like jiffies gets stale due to a missed
LAPIC timer interrupt after the interrupt is armed in
lapic_next_deadline() and before jiffies is sampled at 2. above.
Eventually the interrupt does get delivered, at which point jiffies gets
updated via tick_do_update_jiffies64() with a >1 ticks increment.
Between lapic_next_deadline() and the - late - delivery of the interrupt
the CPU on which the interrupt is armed doesn't go idle.
Booting with nolapic_timer I couldn't yet trigger the problem.
I'm still trying to do a bisect, without high hopes, since triggering
the problem can take rather long and I suspect this could also be some
HW issue.
Any idea what could go wrong or how to debug this further?
I attached my dmesg.
Thanks,
Imre
[1] https://bugs.freedesktop.org/show_bug.cgi?id=105771
View attachment "dmesg.txt" of type "text/plain" (76276 bytes)
Powered by blists - more mailing lists