linux-kernel - Re: [PATCH v2 2/3] vmstat: skip periodic vmstat update for nohz full CPUs

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <ZH2VDaF9uODTqAfV@dhcp22.suse.cz>
Date:   Mon, 5 Jun 2023 09:55:57 +0200
From:   Michal Hocko <mhocko@...e.com>
To:     Marcelo Tosatti <mtosatti@...hat.com>
Cc:     Christoph Lameter <cl@...ux.com>,
        Aaron Tomlin <atomlin@...mlin.com>,
        Frederic Weisbecker <frederic@...nel.org>,
        Andrew Morton <akpm@...ux-foundation.org>,
        linux-kernel@...r.kernel.org, linux-mm@...ck.org,
        Vlastimil Babka <vbabka@...e.cz>
Subject: Re: [PATCH v2 2/3] vmstat: skip periodic vmstat update for nohz full
 CPUs

On Fri 02-06-23 15:57:59, Marcelo Tosatti wrote:
> The interruption caused by vmstat_update is undesirable 
> for certain aplications:
> 
> oslat   1094.456862: sys_mlock(start: 7f7ed0000b60, len: 1000)
> oslat   1094.456971: workqueue_queue_work: ... function=vmstat_update ...
> oslat   1094.456974: sched_switch: prev_comm=oslat ... ==> next_comm=kworker/5:1 ...
> kworker 1094.456978: sched_switch: prev_comm=kworker/5:1 ==> next_comm=oslat ...
> 
> The example above shows an additional 7us for the
> 
>        	oslat -> kworker -> oslat
> 
> switches. In the case of a virtualized CPU, and the vmstat_update  
> interruption in the host (of a qemu-kvm vcpu), the latency penalty
> observed in the guest is higher than 50us, violating the acceptable
> latency threshold.

I personally find the above problem description insufficient. I have
asked several times and only got piece by piece information each time.
Maybe there is a reason to be secretive but it would be great to get at
least some basic expectations described  and what they are based on.

E.g. workloads are running on isolated cpus with nohz full mode to
shield off any kernel interruption. Yet there are operations that update
counters (like mlock, but not mlock alone) that update per cpu counters
that will eventually get flushed and that will cause some interference.
Now the host/guest transition and intereference. How that happens when
the guest is running on an isolated and dedicated cpu?

> Skip periodic updates for nohz full CPUs. Any callers who
> need precise values should use a snapshot of the per-CPU
> counters, or use the global counters with measures to 
> handle errors up to thresholds (see calculate_normal_threshold).

I would rephrase this paragraph. 
In kernel users of vmstat counters either require the precise value and
they are using zone_page_state_snapshot interface or they can live with
an imprecision as the regular flushing can happen at arbitrary time and
cumulative error can grow (see calculate_normal_threshold).

>From that POV the regular flushing can be postponed for CPUs that have
been isolated from the kernel interference withtout critical
infrastructure ever noticing. Skip regular flushing from vmstat_shepherd
for all isolated CPUs to avoid interference with the isolated workload.

> Suggested by Michal Hocko.
> 
> Signed-off-by: Marcelo Tosatti <mtosatti@...hat.com>

Acked-by: Michal Hocko <mhocko@...e.com>

> 
> ---
> 
> v2: use cpu_is_isolated		(Michal Hocko)
> 
> Index: linux-vmstat-remote/mm/vmstat.c
> ===================================================================
> --- linux-vmstat-remote.orig/mm/vmstat.c
> +++ linux-vmstat-remote/mm/vmstat.c
> @@ -28,6 +28,7 @@
>  #include <linux/mm_inline.h>
>  #include <linux/page_ext.h>
>  #include <linux/page_owner.h>
> +#include <linux/sched/isolation.h>
>  
>  #include "internal.h"
>  
> @@ -2022,6 +2023,16 @@ static void vmstat_shepherd(struct work_
>  	for_each_online_cpu(cpu) {
>  		struct delayed_work *dw = &per_cpu(vmstat_work, cpu);
>  
> +		/*
> +		 * Skip periodic updates for isolated CPUs.
> +		 * Any callers who need precise values should use
> +		 * a snapshot of the per-CPU counters, or use the global
> +		 * counters with measures to handle errors up to
> +		 * thresholds (see calculate_normal_threshold).
> +		 */
> +		if (cpu_is_isolated(cpu))
> +			continue;
> +
>  		if (!delayed_work_pending(dw) && need_update(cpu))
>  			queue_delayed_work_on(cpu, mm_percpu_wq, dw, 0);
>  
> 

-- 
Michal Hocko
SUSE Labs