[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20230406105529.GB392176@hirez.programming.kicks-ass.net>
Date: Thu, 6 Apr 2023 12:55:29 +0200
From: Peter Zijlstra <peterz@...radead.org>
To: Sebastian Andrzej Siewior <bigeasy@...utronix.de>
Cc: linux-kernel@...r.kernel.org, Ingo Molnar <mingo@...hat.com>,
Pavel Tatashin <pasha.tatashin@...cle.com>,
Thomas Gleixner <tglx@...utronix.de>
Subject: Re: [RFC] tick_sched_timer() is not properly aligned, fixed by chance
On Thu, Apr 06, 2023 at 11:57:35AM +0200, Sebastian Andrzej Siewior wrote:
> With HIGHRES enabled tick_sched_timer() is programmed every jiffy to
> expire the timer_list timers. This timer is programmed accurate in
> respect to CLOCK_MONOTONIC so that 0 is the first tick and the next one
> is 1000/CONFIG_HZ ms later. For HZ=250 every 4 ms and so the us/ns part
> of the timer is always 0. This can be observed by enabling hrtimer_start
> events and looking for the tick_sched_timer timer:
>
> | <idle>-0 [009] 137.085041: hrtimer_start: hrtimer=000000002df4f2bc function=tick_sched_timer expires=162784000000 softexpires=162784000000 mode=0xa
> | <idle>-0 [026] 137.085300: hrtimer_start: hrtimer=00000000b0911b5d function=tick_sched_timer expires=316384000000 softexpires=316384000000 mode=0xa
>
> The lower part (us, ns) is zero. This is important for certain work
> loads where it is needed to either align with the timer or hide after
> it.
>
> With commit
> 857baa87b6422 ("sched/clock: Enable sched clock early")
>
> merged into v4.19-rc1 this 0 offset is gone. The problematic part is the
> last hunk of the patch:
>
> |@@ -356,7 +374,7 @@ u64 sched_clock_cpu(int cpu)
> | return sched_clock() + __sched_clock_offset;
> |
> | if (unlikely(!sched_clock_running))
> |- return 0ull;
> |+ return sched_clock();
> |
> | preempt_disable_notrace();
> | scd = cpu_sdc(cpu);
>
> which returns the sched_clock() without any offset. The gain of this is
> the high resolution output of the bootlog very early, so we see something
> like this:
> |[ 0.000000] tsc: Detected 1995.083 MHz processor
> |[ 0.000893] e820: update [mem 0x00000000-0x00000fff] usable ==> reserved
> |[ 0.008179] e820: remove [mem 0x000a0000-0x000fffff] usable
> |[ 0.014395] last_pfn = 0xa40000 max_arch_pfn = 0x400000000
> |[ 0.020667] x86/PAT: Configuration [0-7]: WB WC UC- UC WB WP UC- WT
> |[ 0.029822] last_pfn = 0xbe000 max_arch_pfn = 0x400000000
> |[ 0.050770] found SMP MP-table at [mem 0x000fd2a0-0x000fd2af]
> |[ 0.057015] Kernel/User page tables isolation: disabled on command line.
> |[ 0.064452] Using GB pages for direct mapping
> |[ 0.070337] RAMDISK: [mem 0x32d23000-0x35688fff]
> |[ 0.075298] ACPI: Early table checksum verification disabled
> …
> |[ 1.662577] ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
> |[ 1.687317] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x398412f243e, max_idle_ns: 881590620736 ns
> |[ 1.699071] Calibrating delay loop (skipped), value calculated using timer frequency.. 3990.16 BogoMIPS (lpj=7980332)
> |[ 1.703070] pid_max: default: 32768 minimum: 301
> |[ 1.707238] LSM: Security Framework initializing
>
> but the tick_sched_timer() is no longer properly aligned:
> | <idle>-0 [002] 131.286274: hrtimer_start: hrtimer=000000000b8c5637 function=tick_sched_timer expires=130680348871 softexpires=130680348871 mode=0xa
> | <idle>-0 [006] 131.286275: hrtimer_start: hrtimer=00000000e08369e4 function=tick_sched_timer expires=317880348871 softexpires=317880348871 mode=0xa
But how ?!? Why does sched_clock() affect the tick_sched_timer() offset?
/me reads on
> instead of 0 we have "348871" here. I consider this as a bug given that
> this important.
>
> Then PeterZ had other problems and committed
> 776f22913b8e5 ("sched/clock: Make local_clock() noinstr")
>
> which is part of v6.3-rc1. The important part here is (again) the last
> hunk, the local_clock() implementation:
>
> |noinstr u64 local_clock(void)
> |{
> | u64 clock;
> |
> | if (static_branch_likely(&__sched_clock_stable))
> | return sched_clock() + __sched_clock_offset;
> |
> | preempt_disable_notrace();
> | clock = sched_clock_local(this_scd());
> | preempt_enable_notrace();
> |
> | return clock;
> |}
And this morning I found:
https://lore.kernel.org/all/20230406040847.569970-1-dev@aaront.org/T/#u
that 'fixes' that :-)
> The tick_sched_timer story is that it uses ktime_get() to set the
But but but, ktime_get() does not use sched_clock(), it has it's own TSC
reader.
> initial expire time and then always incremented. If the returned value
> is jiffy-based then it will work as expected. If it has higher precision
> then the time has to be adjusted since it is not expected.
>
> So, what do we do?
> - We would need to provide stable patches for v4.19…v6.1 to address the
> timer issue.
>
> - The high precision timestamps during boot are gone. Is it important to
> bring that back? Or could that new local_clock() implementation be
> backported?
>
> - An alternative is to ensure that the time returned by ktime_get() in
> tick_setup_device() is aligned with zeros.
The last one I think, let me try and figure out how sched_clock()
behaviour makes a different at all -- this is not making sense.
Powered by blists - more mailing lists