[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1935251498.52851607.1532091766703.JavaMail.zimbra@redhat.com>
Date: Fri, 20 Jul 2018 09:02:46 -0400 (EDT)
From: Pankaj Gupta <pagupta@...hat.com>
To: David Hildenbrand <david@...hat.com>
Cc: Stefan Hajnoczi <stefanha@...hat.com>,
Luiz Capitulino <lcapitulino@...hat.com>, kwolf@...hat.com,
dan j williams <dan.j.williams@...el.com>, jack@...e.cz,
xiaoguangrong eric <xiaoguangrong.eric@...il.com>,
kvm@...r.kernel.org, riel@...riel.com, linux-nvdimm@...1.01.org,
ross zwisler <ross.zwisler@...el.com>,
linux-kernel@...r.kernel.org, qemu-devel@...gnu.org,
hch@...radead.org, imammedo@...hat.com, mst@...hat.com,
niteshnarayanlal@...mail.com, pbonzini@...hat.com, nilal@...hat.com
Subject: Re: [Qemu-devel] [RFC v3] qemu: Add virtio pmem device
> >>>> /*
> >>>> * virtio-balloon-pci: This extends VirtioPCIProxy.
> >>>> */
> >>>> diff --git a/hw/virtio/virtio-pmem.c b/hw/virtio/virtio-pmem.c
> >>>> new file mode 100644
> >>>> index 0000000000..08c96d7e80
> >>>> --- /dev/null
> >>>> +++ b/hw/virtio/virtio-pmem.c
> >>>> @@ -0,0 +1,241 @@
> >>>> +/*
> >>>> + * Virtio pmem device
> >>>> + *
> >>>> + * Copyright (C) 2018 Red Hat, Inc.
> >>>> + * Copyright (C) 2018 Pankaj Gupta <pagupta@...hat.com>
> >>>> + *
> >>>> + * This work is licensed under the terms of the GNU GPL, version 2.
> >>>> + * See the COPYING file in the top-level directory.
> >>>> + *
> >>>> + */
> >>>> +
> >>>> +#include "qemu/osdep.h"
> >>>> +#include "qapi/error.h"
> >>>> +#include "qemu-common.h"
> >>>> +#include "qemu/error-report.h"
> >>>> +#include "hw/virtio/virtio-access.h"
> >>>> +#include "hw/virtio/virtio-pmem.h"
> >>>> +#include "hw/mem/memory-device.h"
> >>>> +#include "block/aio.h"
> >>>> +#include "block/thread-pool.h"
> >>>> +
> >>>> +typedef struct VirtIOPMEMresp {
> >>>> + int ret;
> >>>> +} VirtIOPMEMResp;
> >>>> +
> >>>> +typedef struct VirtIODeviceRequest {
> >>>> + VirtQueueElement elem;
> >>>> + int fd;
> >>>> + VirtIOPMEM *pmem;
> >>>> + VirtIOPMEMResp resp;
> >>>> +} VirtIODeviceRequest;
> >>>> +
> >>>> +static int worker_cb(void *opaque)
> >>>> +{
> >>>> + VirtIODeviceRequest *req = opaque;
> >>>> + int err = 0;
> >>>> +
> >>>> + /* flush raw backing image */
> >>>> + err = fsync(req->fd);
> >>>> + if (err != 0) {
> >>>> + err = errno;
> >>>> + }
> >>>> + req->resp.ret = err;
> >>>
> >>> Host question: are you returning the guest errno code to the host?
> >>
> >> No. I am returning error code from the host in-case of host fsync
> >> failure, otherwise returning zero.
> >
> > I think that's what Luiz meant. errno constants are not portable
> > between operating systems and architectures. Therefore they cannot be
> > used in external interfaces in software that expects to communicate with
> > other systems.
> >
> > It will be necessary to define specific constants for virtio-pmem
> > instead of passing errno from the host to guest.
> >
>
> In general, I wonder if we should report errors at all or rather *kill*
> the guest. That might sound harsh, but think about the following scenario:
>
> fsync() fails due to some block that cannot e.g. be written (e.g.
> network connection failed). What happens if our guest tries to
> read/write that mmaped block? (e.g. network connection failed).
>
> I assume we'll get a signal an get killed? So we are trying to optimize
> one special case (fsync()) although every read/write is prone to kill
> the guest. And as soon as the guest will try to access the block that
> made fsync fail, we will crash the guest either way.
>
> I assume the main problem is that we are trying to take a file (with all
> the errors that can happen during read/write/fsync) and make it look
> like memory (dax). On ordinary block access, we can forward errors, but
> not if it's memory (maybe using MCE, but it's complicated and
> architecture specific).
There are two points which you highlighted:
1] Memory hardware errors:
These type of errors will be notified by MCA. If mce is non-recoverable, KVM gets
SIG_BUS when hardware detects such error and injects mce in guest vCPU. If guest
does not recoverable it can decide to kill the user-space process.
Default option for mce is '1':
1: panic or SIGBUS on uncorrected errors, log corrected errors
2] read/write/fsync failure because of (network connection failure):
I assume you are talking about something like NFS mount where read/write/fsync
responsibility is taken care by NFS. This scenario can happen for any application
accessing a network filesystem and return appropriate error or wait. Until 'fsync'
is not performed there is no guarantee ram data is backed. I think its
the responsibility of application to perform fsync after write operation or
a transaction.
>
> So I wonder if we should rather assume that our backend file is placed
> on some stable storage that cannot easily fail.
>
> (we might have the same problem with NVDIMM right now, at least the
> memory reading/writing part)
NVDIMM NFIT handles this handler and checks if any SPA falls in the range
of mce:address. It creates a list of bad blocks(corresponding to nd_region) and handle
in function 'pmem_do_bvec' used by 'pmem_mem_request' & 'pmem_read_write'.
void nfit_mce_register(void)
{
mce_register_decode_chain(&nfit_mce_dec);
}
In 'fake DAX', we bypass NFIT ACPI and using virtio & nvdimm_bus way of registering
memory region. By default it should kill the userspace process or at worst cause guest reboot.
I am thinking how we can integrate the NFIT bad block handling with mce handler approach
for fake DAX. I think we can do this. But I want inputs from NVDIMM guys?
Thanks,
Pankaj
>
> It's complicated and I am not a block level expert :)
Powered by blists - more mailing lists