lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20140327064158.GA17563@redhat.com>
Date:	Thu, 27 Mar 2014 08:41:58 +0200
From:	"Michael S. Tsirkin" <mst@...hat.com>
To:	Alejandro Comisario <alejandro.comisario@...cadolibre.com>
Cc:	Stefan Hajnoczi <stefanha@...il.com>, kvm@...r.kernel.org,
	linux-kernel@...r.kernel.org, Jason Wang <jasowang@...hat.com>,
	ghammer@...hat.com, qemu-devel@...gnu.org
Subject: Re: Massive read only kvm guests when backing file was missing

On Wed, Mar 26, 2014 at 11:08:03PM -0300, Alejandro Comisario wrote:
> Hi List!
> Hope some one can help me, we had a big issue in our cloud the other
> day, a couple of our openstack regions ( +2000 kvm guests with qcow2 )
> went read only filesystem from the guest side because the backing
> files directory (the openstack _base directory) was compromised and
> the data was lost, when we realized the data was lost, it took us 5
> mins to restore the backup of the backing files, but by that time all
> the kvm guests received some kind of IO error from the hypervisor
> layer, and went read only on root filesystem.
> 
> My question would be, is there a way to hold the IO operations against
> the backing files ( i thought that would be 99% READ operations ) for
> a little longer ( im asking this because i dont quite understand what
> is the process and when it raises the error ) in a case the backing
> files are missing (no IO possible) but is recoverable within minutes ?
> 
> Any tip  on how to achieve this if possible, or information about how
> backing files works on kvm, will be amazing.
> Waiting for feedback!
> 
> kindest regards.
> Alejandro Comisario


I'm guessing this is what happened: guests timed out meanwhile.
You can increase the timeout within the guest:
echo 600 > /sys/block/sda/device/timeout
to timeout after 10 minutes.

If you have installed qemu guest agent on your system, you can do this
from the host. Unfortunately by default it's memory can be pushed out to swap
and then on disk error access there might will fail :(
Maybe we should consider mlock on all its memory at least as an option.

You could pause your guests, restart them after the issue is resolved,
and we could I guess add functionality to pause VM on disk errors
automatically.
Stefan?


-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ