[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <d0b42e38-c956-4140-926f-64fad527fbd1@linux.microsoft.com>
Date: Thu, 12 Sep 2024 21:07:16 +0530
From: Naman Jain <namjain@...ux.microsoft.com>
To: Michael Kelley <mhklinux@...look.com>,
"K . Y . Srinivasan" <kys@...rosoft.com>,
Haiyang Zhang <haiyangz@...rosoft.com>, Wei Liu <wei.liu@...nel.org>,
Dexuan Cui <decui@...rosoft.com>, Thomas Gleixner <tglx@...utronix.de>,
Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>,
Dave Hansen <dave.hansen@...ux.intel.com>, "x86@...nel.org"
<x86@...nel.org>, "H . Peter Anvin" <hpa@...or.com>,
Daniel Lezcano <daniel.lezcano@...aro.org>
Cc: "linux-hyperv@...r.kernel.org" <linux-hyperv@...r.kernel.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"stable@...r.kernel.org" <stable@...r.kernel.org>
Subject: Re: [PATCH v2] clocksource: hyper-v: Fix hv tsc page based
sched_clock for hibernation
On 9/12/2024 8:51 PM, Michael Kelley wrote:
> From: Naman Jain <namjain@...ux.microsoft.com> Sent: Wednesday, September 11, 2024 9:51 PM
>>
>> On 9/12/2024 9:09 AM, Michael Kelley wrote:
>>> From: Naman Jain <namjain@...ux.microsoft.com> Sent: Tuesday, September 10,
>> 2024 9:57 PM
>>>>
>>>
>>> This version of the patch looks good to me from the standpoint of
>>> separating the x86 specific functionality from the arch independent
>>> functionality. And I think the patch works as intended. But there
>>> are parts of the description and variable naming that don't align
>>> with my understanding of the problem and the fix. So I've added
>>> some additional comments below.
>>>
>>> Nit: Now that most of the code changes are in mshyperv.c, the
>>> patch Subject: prefix should perhaps be "x86/hyperv:" instead
>>> of "clocksource: hyperv:".
>>
>> Thanks a lot for reviewing Michael. As you rightly pointed out, these
>> comments and variable names made more sense when they were in
>> hyperv_timer.c. I'll change them accordingly in next patch.
>>
>> Will change commit msg subject as well.
>>
>>>
>>>> read_hv_sched_clock_tsc() assumes that the Hyper-V clock counter is
>>>> bigger than the variable hv_sched_clock_offset, which is cached during
>>>> early boot, but depending on the timing this assumption may be false
>>>> when a hibernated VM starts again (the clock counter starts from 0
>>>> again) and is resuming back (Note: hv_init_tsc_clocksource() is not
>>>> called during hibernation/resume); consequently,
>>>> read_hv_sched_clock_tsc() may return a negative integer (which is
>>>> interpreted as a huge positive integer since the return type is u64)
>>>> and new kernel messages are prefixed with huge timestamps before
>>>> read_hv_sched_clock_tsc() grows big enough (which typically takes
>>>> several seconds).
>>>
>>> Just so I'm clear on the sequence, when a new VM is created to
>>> resume the hibernated VM, I think the following happens:
>>>
>>> 1) The VM being used to resume the hibernation image boots a
>>> fresh instance of the Linux kernel. The sched clock and sched clock
>>> offset value are initialized as with any kernel, and kernel messages
>>> are printed with the correct timestamps starting at zero.
>>>
>>> 2) The new Linux kernel then loads the hibernation image and
>>> transfers control to it, whereupon the "resume" callbacks are run
>>> in the context of the hibernation image. At this point, any kernel
>>> timestamps are wrong, and might even be negative, because the
>>> sched clock value is calculated based on the new Hyper-V reference
>>> time (which started again at zero) minus the old sched clock offset.
>>> The goal is that the sched clock value should be continuous with
>>> the sched clock value from the original VM. If the original VM
>>> had been running for 1000 seconds when the hibernation was
>>> done, the sched clock value in the resumed hibernation image
>>> should continue, starting at ~1000 seconds.
>>>
>>> 3) The fix is to adjust the sched clock offset in the resumed
>>> hibernation image, and make it more negative by that ~1000
>>> seconds.
>>>
>>> Is that all correct? If so, then it seems like this patch is doing
>>> more than just cleaning up the negative values for sched clock.
>>> It's also making the sched clock values continuous with the
>>> sched clock values in the original VM rather than restarting
>>> near zero after hibernation image is resumed.
>>>
>>
>> Yes, that's exactly what this patch is trying to do. There was an option
>> to correct in suspend-resume callbacks of original VM in hyperv_timer.c,
>> but these are executed very late, and we still end up getting many logs
>> with these incorrect timestamps. We took reference from the code where
>> tsc clock correction takes place, and thought that similar should be
>> done here.
>
> Yes, agreed. I'm glad the mechanism for the TSC clock correction
> is available to use. :-)
Yes.
"arch/x86/kernel/tsc.c"
void tsc_save_sched_clock_state(void)
/*
* Even on processors with invariant TSC, TSC gets reset in some the
* ACPI system sleep states. And in some systems BIOS seem to reinit TSC to
* arbitrary value (still sync'd across cpu's) during resume from such
sleep
* states. To cope up with this, recompute the cyc2ns_offset for each
cpu so
* that sched_clock() continues from the point where it was left off during
* suspend.
*/
void tsc_restore_sched_clock_state(void)
called from "arch/x86/power/cpu.c" in
__restore_processor_state() -> x86_platform.restore_sched_clock_state();
>
>>
>> We can tweak the commit msg to add this additional detail.
>
> Thanks. I think the change to make the sched clock time (and
> therefore the dmesg log timestamps) continuous with the
> original VM is important to call out. It's a change that will be
> visible to users.
>
> [snip]
>
>>>> +/*
>>>> + * Hyper-V clock counter resets during hibernation. Save and restore clock
>>>> + * offset during suspend/resume, while also considering the time passed
>>>> + * before suspend. This is to make sure that sched_clock using hv tsc page
>>>> + * based clocksource, proceeds from where it left off during suspend and
>>>> + * it shows correct time for the timestamps of kernel messages after resume.
>>>> + */
>>
>> I added it here, but the same should be added in commit msg as well.
>> I'll add it.
>
> Ah, indeed, you did add it in the comment. But yes, it should go in
> the commit message as well.
>
>>
>>>> +static void save_hv_clock_tsc_state(void)
>>>> +{
>>>> + hv_sched_clock_offset_saved = hv_read_reference_counter();
>>>
>>> Naming this variable hv_sched_clock_offset_saved doesn't seem to match
>>> what it actually contains. The saved value is not a sched_clock_offset. It's
>>> the value of the Hyper-V reference counter at the time the original VM
>>> hibernates does "suspend". The sched_clock_offset in the original VM will
>>> typically be a pretty small value (a few seconds or even less). But the
>>> Hyper-V reference counter value might be thousands of seconds if the
>>> VM has been running a while before it hibernates.
>>
>> I'll change it to something that conveys the right information. Thanks
>> for the suggestion.
>>
>>>
>>>> +}
>>>> +
>>>> +static void restore_hv_clock_tsc_state(void)
>>>> +{
>>>> + /*
>>>> + * hv_sched_clock_offset = offset that is used by hyperv_timer clocksource driver
>>>> + * to get time.
>>>> + * Time passed before suspend = hv_sched_clock_offset_saved
>>>> + * - hv_sched_clock_offset (old)
>>>> + *
>>>> + * After Hyper-V clock counter resets, hv_sched_clock_offset needs a correction.
>>>> + *
>>>> + * New time = hv_read_reference_counter() (future) - hv_sched_clock_offset (new)
>>>> + * New time = Time passed before suspend + hv_read_reference_counter() (future)
>>>> + * - hv_read_reference_counter() (now)
>>>> + *
>>>> + * Solving the above two equations gives:
>>>> + *
>>>> + * hv_sched_clock_offset (new) = hv_sched_clock_offset (old)
>>>> + * - hv_sched_clock_offset_saved
>>>> + * + hv_read_reference_counter() (now))
>>>> + */
>>>> + hv_adj_sched_clock_offset(hv_sched_clock_offset_saved - hv_read_reference_counter());
>>>
>>> The argument passed to hv_adj_sched_clock_offset() makes sense to me if I think
>>> of it as:
>>>
>>> hv_ref_time_at_hibernate - hv_read_reference_counter()
>>>
>>> where hv_read_reference_counter() is just "ref time now".
>>>
>>> I think of it like this: The Hyper-V reference counter value changed underneath
>>> the resumed hibernation image when it starts running in the new VM. The adjustment
>>> changes the sched clock offset to compensate for that change so that sched clock
>>> values are continuous across the suspend/resume hibernation sequence.
>>>
>>> I don't completely understand what you've explained with the two equations and
>>> solving them, though the result matches my expectations.
>>
>> Yeah :) it made more sense when we look at it from hyperv_timer.c driver
>> POV because these offsets are nothing but reference counters at
>> different points of time.
>
> Well, yes and no. The value of the Hyper-V reference counter is
> an absolute value at the time it is read. But the hv_sched_clock_offset
> is a "delta" value -- the difference between two absolute time values.
> While hv_sched_clock_offset is initially set to the current value of the
> Hyper-V reference counter, it is used as a delta value. And after
> the adjustment is applied when resuming from hibernation,
> hv_sched_clock_offset is definitely no longer a reference counter
> value from some point in time -- it's clearly only a delta value.
>
Details matter :) Thanks for your feedback. I'll wait for a few days
before posting v3, in case there are some more review comments.
>> Having said that, I think we can go with a
>> comment explaining the intention, and skip adding these equations which
>> may be confusing here as there is no concept of offsets here, as you
>> rightly pointed out in your previous reply as well.
>
> Works for me.
>
> Thanks,
>
> Michael
Regards,
Naman
Powered by blists - more mailing lists