lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [day] [month] [year] [list]
Message-ID: <20170418212756.GC1452@cmadams.net>
Date:   Tue, 18 Apr 2017 16:27:56 -0500
From:   Chris Adams <linux@...dams.net>
To:     linux-kernel@...r.kernel.org
Subject: Latency in logical volume layer?

I am trying to figure out a storage latency issue I am seeing with oVirt
and iSCSI storage, and I am looking for a little help (or to be told
"you're doing it wrong" as usual).

I have an oVirt virtualization cluster running with 7 CentOS 7 servers,
a dedicated storage LAN (separate switches), and iSCSI multipath running
to a SAN.  Occasionally, at times when there's no apparent load spike or
anything, oVirt will report 5+ second latency accessing a storage
domain.  I can't see any network issue or problem at the SAN, so I
started looking at Linux.

oVirt reports this when it tries to read the storage domain metadata.
With iSCSI storage, oVirt access it via multipath, and treats the whole
device as a PV for Linux LVM (no partitioning).  The metadata is a small
LV that each node reads the first 4K from every few seconds (using
O_DIRECT to avoid caching).  I wrote a perl script to replicate this
access pattern (open with O_DIRECT, read the first 4K, close) and report
times.  I do see higher than expected latency sometimes - 50-200ms
latency happens fairly regularly.

I added doing the same open/read/close on the PV (the multipath device),
and I do not see the same latency there.  It is a very consistent
0.25-0.55ms latency.  I put a host in maintenance mode, and disabled
multipath, and I saw similar behavior (comparing reads from the raw SCSI
device and the LV device).

I am testing on a host with no VMs.  I do sometimes (not always) see
similar latency on multiple hosts (others are running VMs)
simultaneously.

That's where I'm lost - how does going up the stack from the multipath
device to the LV add so much latency (but not all the time)?

I recognize that the CentOS 7 kernel is not mainline, but was hoping
that maybe somebody would say "that's a known thing", or "that's
expected", or "you're measuring wrong".

Any suggestions, places to look, etc.?  Thanks.
-- 
Chris Adams <linux@...dams.net>

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ