linux-kernel - Re: [PATCH v2 2/3] vmstat: skip periodic vmstat update for nohz full CPUs

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <ZH33BI9//tAbLvz5@tpad>
Date:   Mon, 5 Jun 2023 11:53:56 -0300
From:   Marcelo Tosatti <mtosatti@...hat.com>
To:     Michal Hocko <mhocko@...e.com>
Cc:     Christoph Lameter <cl@...ux.com>,
        Aaron Tomlin <atomlin@...mlin.com>,
        Frederic Weisbecker <frederic@...nel.org>,
        Andrew Morton <akpm@...ux-foundation.org>,
        linux-kernel@...r.kernel.org, linux-mm@...ck.org,
        Vlastimil Babka <vbabka@...e.cz>
Subject: Re: [PATCH v2 2/3] vmstat: skip periodic vmstat update for nohz full
 CPUs

On Mon, Jun 05, 2023 at 09:55:57AM +0200, Michal Hocko wrote:
> On Fri 02-06-23 15:57:59, Marcelo Tosatti wrote:
> > The interruption caused by vmstat_update is undesirable 
> > for certain aplications:
> > 
> > oslat   1094.456862: sys_mlock(start: 7f7ed0000b60, len: 1000)
> > oslat   1094.456971: workqueue_queue_work: ... function=vmstat_update ...
> > oslat   1094.456974: sched_switch: prev_comm=oslat ... ==> next_comm=kworker/5:1 ...
> > kworker 1094.456978: sched_switch: prev_comm=kworker/5:1 ==> next_comm=oslat ...
> > 
> > The example above shows an additional 7us for the
> > 
> >        	oslat -> kworker -> oslat
> > 
> > switches. In the case of a virtualized CPU, and the vmstat_update  
> > interruption in the host (of a qemu-kvm vcpu), the latency penalty
> > observed in the guest is higher than 50us, violating the acceptable
> > latency threshold.
> 
> I personally find the above problem description insufficient. I have
> asked several times and only got piece by piece information each time.
> Maybe there is a reason to be secretive but it would be great to get at
> least some basic expectations described  and what they are based on.

There is no reason to be secretive. 

> 
> E.g. workloads are running on isolated cpus with nohz full mode to
> shield off any kernel interruption. Yet there are operations that update
> counters (like mlock, but not mlock alone) that update per cpu counters
> that will eventually get flushed and that will cause some interference.
> Now the host/guest transition and intereference. How that happens when
> the guest is running on an isolated and dedicated cpu?

Follows the updated changelog. Does it contain the information
requested ?

----

Performance details for the kworker interruption:

With workloads that are running on isolated cpus with nohz full mode to
shield off any kernel interruption. For example, a VM running a
time sensitive application with a 50us maximum acceptable interruption
(use case: soft PLC).

oslat   1094.456862: sys_mlock(start: 7f7ed0000b60, len: 1000)
oslat   1094.456971: workqueue_queue_work: ... function=vmstat_update ...
oslat   1094.456974: sched_switch: prev_comm=oslat ... ==> next_comm=kworker/5:1 ...
kworker 1094.456978: sched_switch: prev_comm=kworker/5:1 ==> next_comm=oslat ...

The example above shows an additional 7us for the

        oslat -> kworker -> oslat

switches. In the case of a virtualized CPU, and the vmstat_update
interruption in the host (of a qemu-kvm vcpu), the latency penalty
observed in the guest is higher than 50us, violating the acceptable
latency threshold.

The isolated vCPU can perform operations that modify per-CPU page counters,
for example to complete I/O operations:

      CPU 11/KVM-9540    [001] dNh1.  2314.248584: mod_zone_page_state <-__folio_end_writeback
      CPU 11/KVM-9540    [001] dNh1.  2314.248585: <stack trace>
 => 0xffffffffc042b083
 => mod_zone_page_state
 => __folio_end_writeback
 => folio_end_writeback
 => iomap_finish_ioend
 => blk_mq_end_request_batch
 => nvme_irq
 => __handle_irq_event_percpu
 => handle_irq_event
 => handle_edge_irq
 => __common_interrupt
 => common_interrupt
 => asm_common_interrupt
 => vmx_do_interrupt_nmi_irqoff
 => vmx_handle_exit_irqoff
 => vcpu_enter_guest
 => vcpu_run
 => kvm_arch_vcpu_ioctl_run
 => kvm_vcpu_ioctl
 => __x64_sys_ioctl
 => do_syscall_64
 => entry_SYSCALL_64_after_hwframe

> > Skip periodic updates for nohz full CPUs. Any callers who
> > need precise values should use a snapshot of the per-CPU
> > counters, or use the global counters with measures to 
> > handle errors up to thresholds (see calculate_normal_threshold).
> 
> I would rephrase this paragraph. 
> In kernel users of vmstat counters either require the precise value and
> they are using zone_page_state_snapshot interface or they can live with
> an imprecision as the regular flushing can happen at arbitrary time and
> cumulative error can grow (see calculate_normal_threshold).

> >From that POV the regular flushing can be postponed for CPUs that have
> been isolated from the kernel interference withtout critical
> infrastructure ever noticing. Skip regular flushing from vmstat_shepherd
> for all isolated CPUs to avoid interference with the isolated workload.
> 
> > Suggested by Michal Hocko.
> > 
> > Signed-off-by: Marcelo Tosatti <mtosatti@...hat.com>
> 
> Acked-by: Michal Hocko <mhocko@...e.com>

OK, updated comment, thanks.