linux-kernel - Re: [Xen-devel] [PATCH v5 1/1] xen/time: do not decrease steal time after live migration on xen

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <bf45fadd-950d-edf4-164b-8b23dbbad89a@oracle.com>
Date:   Tue, 31 Oct 2017 08:14:56 +0800
From:   Dongli Zhang <dongli.zhang@...cle.com>
To:     Boris Ostrovsky <boris.ostrovsky@...cle.com>,
        xen-devel@...ts.xenproject.org, linux-kernel@...r.kernel.org
Cc:     jgross@...e.com, joao.m.martins@...cle.com
Subject: Re: [Xen-devel] [PATCH v5 1/1] xen/time: do not decrease steal time
 after live migration on xen

Hi Boris,

On 10/30/2017 09:34 PM, Boris Ostrovsky wrote:
> On 10/30/2017 04:03 AM, Dongli Zhang wrote:
>> After guest live migration on xen, steal time in /proc/stat
>> (cpustat[CPUTIME_STEAL]) might decrease because steal returned by
>> xen_steal_lock() might be less than this_rq()->prev_steal_time which is
>> derived from previous return value of xen_steal_clock().
>>
>> For instance, steal time of each vcpu is 335 before live migration.
>>
>> cpu  198 0 368 200064 1962 0 0 1340 0 0
>> cpu0 38 0 81 50063 492 0 0 335 0 0
>> cpu1 65 0 97 49763 634 0 0 335 0 0
>> cpu2 38 0 81 50098 462 0 0 335 0 0
>> cpu3 56 0 107 50138 374 0 0 335 0 0
>>
>> After live migration, steal time is reduced to 312.
>>
>> cpu  200 0 370 200330 1971 0 0 1248 0 0
>> cpu0 38 0 82 50123 500 0 0 312 0 0
>> cpu1 65 0 97 49832 634 0 0 312 0 0
>> cpu2 39 0 82 50167 462 0 0 312 0 0
>> cpu3 56 0 107 50207 374 0 0 312 0 0
>>
>> Since runstate times are cumulative and cleared during xen live migration
>> by xen hypervisor, the idea of this patch is to accumulate runstate times
>> to global percpu variables before live migration suspend. Once guest VM is
>> resumed, xen_get_runstate_snapshot_cpu() would always return the sum of new
>> runstate times and previously accumulated times stored in global percpu
>> variables.
>>
>> Similar and more severe issue would impact prior linux 4.8-4.10 as
>> discussed by Michael Las at
>> https://0xstubs.org/debugging-a-flaky-cpu-steal-time-counter-on-a-paravirtualized-xen-guest,
>> which would overflow steal time and lead to 100% st usage in top command
>> for linux 4.8-4.10. A backport of this patch would fix that issue.
>>
>> References: https://0xstubs.org/debugging-a-flaky-cpu-steal-time-counter-on-a-paravirtualized-xen-guest
>> Signed-off-by: Dongli Zhang <dongli.zhang@...cle.com>
>>
>> ---
>> Changed since v1:
>>   * relocate modification to xen_get_runstate_snapshot_cpu
>>
>> Changed since v2:
>>   * accumulate runstate times before live migration
>>
>> Changed since v3:
>>   * do not accumulate times in the case of guest checkpointing
>>
>> Changed since v4:
>>   * allocate array of vcpu_runstate_info to reduce number of memory allocation
>>
>> ---
>>  drivers/xen/manage.c         |  2 ++
>>  drivers/xen/time.c           | 68 ++++++++++++++++++++++++++++++++++++++++++--
>>  include/xen/interface/vcpu.h |  2 ++
>>  include/xen/xen-ops.h        |  1 +
>>  4 files changed, 71 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/xen/manage.c b/drivers/xen/manage.c
>> index c425d03..3dc085d 100644
>> --- a/drivers/xen/manage.c
>> +++ b/drivers/xen/manage.c
>> @@ -72,6 +72,7 @@ static int xen_suspend(void *data)
>>  	}
>>  
>>  	gnttab_suspend();
>> +	xen_accumulate_runstate_time(-1);
>>  	xen_arch_pre_suspend();
>>  
>>  	/*
>> @@ -84,6 +85,7 @@ static int xen_suspend(void *data)
>>                                             : 0);
>>  
>>  	xen_arch_post_suspend(si->cancelled);
>> +	xen_accumulate_runstate_time(si->cancelled);
> 
> I am not convinced that the comment above HYPERVISOR_suspend() is
> correct. The call can return an error code and so if it returns -EPERM
> (which AFAICS it can't now but might in the future) then
> xen_accumulate_runstate_time() will do wrong thing.

I would split xen_accumulate_runstate_time() into two functions to avoid the
-EPERM issue, as one is for saving and another is for accumulation, respectively.

Otherwise, can you use xen_accumulate_runstate_time(2) for saving before suspend
and xen_accumulate_runstate_time(si->cancelled) after resume?

> 
> 
>>  	gnttab_resume();
>>  
>>  	if (!si->cancelled) {
>> diff --git a/drivers/xen/time.c b/drivers/xen/time.c
>> index ac5f23f..cf3afb9 100644
>> --- a/drivers/xen/time.c
>> +++ b/drivers/xen/time.c
>> @@ -19,6 +19,9 @@
>>  /* runstate info updated by Xen */
>>  static DEFINE_PER_CPU(struct vcpu_runstate_info, xen_runstate);
>>  
>> +static DEFINE_PER_CPU(u64[RUNSTATE_max], old_runstate_time);
>> +static struct vcpu_runstate_info *runstate_delta;
> 
> I'd move this inside xen_accumulate_runstate_time() since that's the

If we split xen_accumulate_runstate_time() into two functions, we would leave
runstate_delta as global static.

> only function that uses it. And why does it need to be
> vcpu_runstate_info and not u64[4]?

This was suggested by Juergen to avoid the allocation and reclaim of the second
dimensional array as in v4 of this patch?

Or would you like to allocate sizeof(u64[4]) * num_possible_cpus() and emulate
the 2d array with this 1d array and move the pointer forward sizeof(u64[4]) in
each iteration?

> 
>> +
>>  /* return an consistent snapshot of 64-bit time/counter value */
>>  static u64 get64(const u64 *p)
>>  {
>> @@ -47,8 +50,8 @@ static u64 get64(const u64 *p)
>>  	return ret;
>>  }
>>  
>> -static void xen_get_runstate_snapshot_cpu(struct vcpu_runstate_info *res,
>> -					  unsigned int cpu)
>> +static void xen_get_runstate_snapshot_cpu_delta(
>> +			struct vcpu_runstate_info *res, unsigned int cpu)
>>  {
>>  	u64 state_time;
>>  	struct vcpu_runstate_info *state;
>> @@ -66,6 +69,67 @@ static void xen_get_runstate_snapshot_cpu(struct vcpu_runstate_info *res,
>>  		 (state_time & XEN_RUNSTATE_UPDATE));
>>  }
>>  
>> +static void xen_get_runstate_snapshot_cpu(struct vcpu_runstate_info *res,
>> +					  unsigned int cpu)
>> +{
>> +	int i;
>> +
>> +	xen_get_runstate_snapshot_cpu_delta(res, cpu);
>> +
>> +	for (i = 0; i < RUNSTATE_max; i++)
>> +		res->time[i] += per_cpu(old_runstate_time, cpu)[i];
>> +}
>> +
>> +void xen_accumulate_runstate_time(int action)
>> +{
>> +	struct vcpu_runstate_info state;
>> +	int cpu, i;
>> +
>> +	switch (action) {
>> +	case -1: /* backup runstate time before suspend */
>> +		WARN_ON_ONCE(unlikely(runstate_delta));
> 
> pr_warn_once(), to be consistent with the rest of the file. And then
> should you return if this is true?

I would prefer to not return if it is true but just warn the administrator that
there is memory leakage issue while leaving runstate accumulation works normally.

> 
>> +
>> +		runstate_delta = kcalloc(num_possible_cpus(),
>> +					 sizeof(*runstate_delta),
>> +					 GFP_KERNEL);
>> +		if (unlikely(!runstate_delta)) {
>> +			pr_alert("%s: failed to allocate runstate_delta\n",
>> +				    __func__);
> 
> pr_warn() should be sufficient. Below too.
> 
> Also, as a side question --- can we do kmalloc() at this point?

Yes. kmalloc_array() is better than kcalloc, unless we have 2 dimensional array
and we need to guarantee the value of first dimension is always 0.

> 
>> +			return;
>> +		}
>> +
>> +		for_each_possible_cpu(cpu) {
>> +			xen_get_runstate_snapshot_cpu_delta(&state, cpu);
>> +			memcpy(runstate_delta[cpu].time, state.time,
>> +			      RUNSTATE_max * sizeof(*runstate_delta[cpu].time));
> 
> sizeof(runstate_delta[cpu].time).
> 
>> +		}
>> +
>> +		break;
>> +
>> +	case 0: /* backup runstate time after resume */
>> +		if (unlikely(!runstate_delta)) {
>> +			pr_alert("%s: cannot accumulate runstate time as runstate_delta is NULL\n",
>> +				    __func__);
>> +			return;
>> +		}
>> +
>> +		for_each_possible_cpu(cpu) {
>> +			for (i = 0; i < RUNSTATE_max; i++)
>> +				per_cpu(old_runstate_time, cpu)[i] +=
>> +					runstate_delta[cpu].time[i];
>> +		}
>> +		break;
>> +
>> +	default: /* do not accumulate runstate time for checkpointing */
>> +		break;
>> +	}
>> +
>> +	if (action != -1 && runstate_delta) {
>> +		kfree(runstate_delta);
>> +		runstate_delta = NULL;
>> +	}
>> +}
>> +
>>  /*
>>   * Runstate accounting
>>   */
>> diff --git a/include/xen/interface/vcpu.h b/include/xen/interface/vcpu.h
>> index 98188c8..85e81ce 100644
>> --- a/include/xen/interface/vcpu.h
>> +++ b/include/xen/interface/vcpu.h
>> @@ -110,6 +110,8 @@ DEFINE_GUEST_HANDLE_STRUCT(vcpu_runstate_info);
>>   */
>>  #define RUNSTATE_offline  3
>>  
>> +#define RUNSTATE_max      4
> 
> This file is part of Xen ABI. While this macro technically doesn't
> change anything I'd rather have those updates first appear in Xen code.
> 
> Besides, this change leaves vcpu_runstate_info.time[4] definition
> intact. I think all RUNSTATE_* macros would need to be moved above
> vcpu_runstate_info definition.
> 
> TBH, for purposes of this patch I'd use 4.
> 
> 
> -boris
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@...ts.xen.org
> https://lists.xen.org/xen-devel
> 

Thank you very much for your suggestions!

Dongli Zhang