[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <68a4a96b-9c66-6509-e75d-b1bea6cd55d1@redhat.com>
Date: Tue, 24 May 2022 20:59:09 +0200
From: David Hildenbrand <david@...hat.com>
To: zhenwei pi <pizhenwei@...edance.com>, akpm@...ux-foundation.org,
naoya.horiguchi@....com, mst@...hat.com
Cc: linux-mm@...ck.org, linux-kernel@...r.kernel.org,
jasowang@...hat.com, virtualization@...ts.linux-foundation.org,
pbonzini@...hat.com, peterx@...hat.com, qemu-devel@...gnu.org
Subject: Re: [PATCH 0/3] recover hardware corrupted page by virtio balloon
On 20.05.22 09:06, zhenwei pi wrote:
> Hi,
>
> I'm trying to recover hardware corrupted page by virtio balloon, the
> workflow of this feature like this:
>
> Guest 5.MF -> 6.RVQ FE 10.Unpoison page
> / \ /
> -------------------+-------------+----------+-----------
> | | |
> 4.MCE 7.RVQ BE 9.RVQ Event
> QEMU / \ /
> 3.SIGBUS 8.Remap
> /
> ----------------+------------------------------------
> |
> +--2.MF
> Host /
> 1.HW error
>
> 1, HardWare page error occurs randomly.
> 2, host side handles corrupted page by Memory Failure mechanism, sends
> SIGBUS to the user process if early-kill is enabled.
> 3, QEMU handles SIGBUS, if the address belongs to guest RAM, then:
> 4, QEMU tries to inject MCE into guest.
> 5, guest handles memory failure again.
>
> 1-5 is already supported for a long time, the next steps are supported
> in this patch(also related driver patch):
>
> 6, guest balloon driver gets noticed of the corrupted PFN, and sends
> request to host side by Recover VQ FrontEnd.
> 7, QEMU handles request from Recover VQ BackEnd, then:
> 8, QEMU remaps the corrupted HVA fo fix the memory failure, then:
> 9, QEMU acks the guest side the result by Recover VQ.
> 10, guest unpoisons the page if the corrupted page gets recoverd
> successfully.
>
> Test:
> This patch set can be tested with QEMU(also in developing):
> https://github.com/pizhenwei/qemu/tree/balloon-recover
>
> Emulate MCE by QEMU(guest RAM normal page only, hugepage is not supported):
> virsh qemu-monitor-command vm --hmp mce 0 9 0xbd000000000000c0 0xd 0x61646678 0x8c
>
> The guest works fine(on Intel Platinum 8260):
> mce: [Hardware Error]: Machine check events logged
> Memory failure: 0x61646: recovery action for dirty LRU page: Recovered
> virtio_balloon virtio5: recovered pfn 0x61646
> Unpoison: Unpoisoned page 0x61646 by virtio-balloon
> MCE: Killing stress:24502 due to hardware memory corruption fault at 7f5be2e5a010
>
> And the 'HardwareCorrupted' in /proc/meminfo also shows 0 kB.
>
> About the protocol of virtio balloon recover VQ, it's undefined and in
> developing currently:
> - 'struct virtio_balloon_recover' defines the structure which is used to
> exchange message between guest and host.
> - '__le32 corrupted_pages' in struct virtio_balloon_config is used in the next
> step:
> 1, a VM uses RAM of 2M huge page, once a MCE occurs, the 2M becomes
> unaccessible. Reporting 512 * 4K 'corrupted_pages' to the guest, the guest
> has a chance to isolate the 512 pages ahead of time.
>
> 2, after migrating to another host, the corrupted pages are actually recovered,
> once the guest gets the 'corrupted_pages' with 0, then the guest could
> unpoison all the poisoned pages which are recorded in the balloon driver.
>
Hi,
I'm still on vacation this week, I'll try to have a look when I'm back
(and flushed out my overflowing inbox :D).
--
Thanks,
David / dhildenb
Powered by blists - more mailing lists