linux-kernel - Re: [PATCH v1 1/5] KVM: arm64: Enable ring-based dirty memory tracking

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <78c613e8-b600-119e-0d33-b049dd7c35ce@redhat.com>
Date:   Fri, 26 Aug 2022 16:05:28 +1000
From:   Gavin Shan <gshan@...hat.com>
To:     Marc Zyngier <maz@...nel.org>, Peter Xu <peterx@...hat.com>
Cc:     kvmarm@...ts.cs.columbia.edu, linux-arm-kernel@...ts.infradead.org,
        kvm@...r.kernel.org, linux-kernel@...r.kernel.org,
        linux-doc@...r.kernel.org, linux-kselftest@...r.kernel.org,
        pbonzini@...hat.com, corbet@....net, james.morse@....com,
        alexandru.elisei@....com, suzuki.poulose@....com,
        oliver.upton@...ux.dev, catalin.marinas@....com, will@...nel.org,
        shuah@...nel.org, seanjc@...gle.com, dmatlack@...gle.com,
        bgardon@...gle.com, ricarkol@...gle.com, zhenyzha@...hat.com,
        shan.gavin@...il.com
Subject: Re: [PATCH v1 1/5] KVM: arm64: Enable ring-based dirty memory
 tracking

Hi Marc,

On 8/25/22 6:57 AM, Marc Zyngier wrote:
> On Wed, 24 Aug 2022 17:21:50 +0100,
> Peter Xu <peterx@...hat.com> wrote:
>>
>> On Wed, Aug 24, 2022 at 03:45:11PM +0100, Marc Zyngier wrote:
>>> On Wed, 24 Aug 2022 00:19:04 +0100,
>>> Peter Xu <peterx@...hat.com> wrote:
>>>>
>>>> On Tue, Aug 23, 2022 at 11:47:03PM +0100, Marc Zyngier wrote:
>>>>> Atomicity doesn't guarantee ordering, unfortunately.
>>>>
>>>> Right, sorry to be misleading.  The "atomicity" part I was trying to say
>>>> the kernel will always see consistent update on the fields.
>>>>
>>>> The ordering should also be guaranteed, because things must happen with
>>>> below sequence:
>>>>
>>>>    (1) kernel publish dirty GFN data (slot, offset)
>>>>    (2) kernel publish dirty GFN flag (set to DIRTY)
>>>>    (3) user sees DIRTY, collects (slots, offset)
>>>>    (4) user sets it to RESET
>>>>    (5) kernel reads RESET
>>>
>>> Maybe. Maybe not. The reset could well be sitting in the CPU write
>>> buffer for as long as it wants and not be seen by the kernel if the
>>> read occurs on another CPU. And that's the crucial bit: single-CPU is
>>> fine, but cross CPU isn't. Unfortunately, the userspace API is per-CPU
>>> on collection, and global on reset (this seems like a bad decision,
>>> but it is too late to fix this).
>>
>> Regarding the last statement, that's something I had question too and
>> discussed with Paolo, even though at that time it's not a outcome of
>> discussing memory ordering issues.
>>
>> IIUC the initial design was trying to avoid tlb flush flood when vcpu
>> number is large (each RESET per ring even for one page will need all vcpus
>> to flush, so O(N^2) flushing needed). With global RESET it's O(N).  So
>> it's kind of a trade-off, and indeed until now I'm not sure which one is
>> better.  E.g., with per-ring reset, we can have locality too in userspace,
>> e.g. the vcpu thread might be able to recycle without holding global locks.
> 
> I don't get that. On x86, each CPU must perform the TLB invalidation
> (there is an IPI for that). So whether you do a per-CPU scan of the
> ring or a global scan is irrelevant: each entry you find in any of the
> rings must result in a global invalidation, since you've updated the
> PTE to make the page RO.
> 
> The same is true on ARM, except that the broadcast is done in HW
> instead of being tracked in SW.
> 
> Buy anyway, this is all moot. The API is what it is, and it isn't
> going to change any time soon. All we can do is add some
> clarifications to the API for the more relaxed architectures, and make
> sure the kernel behaves accordingly.
> 
> [...]
> 
>>> It may be safe, but it isn't what the userspace API promises.
>>
>> The document says:
>>
>>    After processing one or more entries in the ring buffer, userspace calls
>>    the VM ioctl KVM_RESET_DIRTY_RINGS to notify the kernel about it, so that
>>    the kernel will reprotect those collected GFNs.  Therefore, the ioctl
>>    must be called *before* reading the content of the dirty pages.
>>
>> I'd say it's not an explicit promise, but I think I agree with you that at
>> least it's unclear on the behavior.
> 
> This is the least problematic part of the documentation. The bit I
> literally choke on is this:
> 
> <quote>
> It's not necessary for userspace to harvest the all dirty GFNs at once.
> However it must collect the dirty GFNs in sequence, i.e., the userspace
> program cannot skip one dirty GFN to collect the one next to it.
> </quote>
> 
> This is the core of the issue. Without ordering rules, the consumer on
> the other side cannot observe the updates correctly, even if userspace
> follows the above to the letter. Of course, the kernel itself must do
> the right thing (I guess it amounts to the kernel doing a
> load-acquire, and userspace doing a store-release -- effectively
> emulating x86...).
> 
>> Since we have a global recycle mechanism, most likely the app (e.g. current
>> qemu impl) will use the same thread to collect/reset dirty GFNs, and
>> trigger the RESET ioctl().  In that case it's safe, IIUC, because no
>> cross-core ops.
>>
>> QEMU even guarantees this by checking it (kvm_dirty_ring_reap_locked):
>>
>>      if (total) {
>>          ret = kvm_vm_ioctl(s, KVM_RESET_DIRTY_RINGS);
>>          assert(ret == total);
>>      }
>>
>> I think the assert() should never trigger as mentioned above.  But ideally
>> maybe it should just be a loop until cleared gfns match total.
> 
> Right. If userspace calls the ioctl on every vcpu, then things should
> work correctly. It is only that the overhead is higher than what it
> should be if multiple vcpus perform a reset at the same time.
> 
>>
>>> In other words, without further straightening of the API, this doesn't
>>> work as expected on relaxed memory architectures. So before this gets
>>> enabled on arm64, this whole ordering issue must be addressed.
>>
>> How about adding some more documentation for KVM_RESET_DIRTY_RINGS on the
>> possibility of recycling partial of the pages, especially when collection
>> and the ioctl() aren't done from the same thread?
> 
> I'd rather tell people about the ordering rules. That will come at
> zero cost on x86.
> 
>> Any suggestions will be greatly welcomed.
> 
> I'll write a couple of patch when I get the time, most likely next
> week. Gavin will hopefully be able to take them as part of his series.
> 

Thanks, Marc. Please let me know where I can check out the patches
when they're ready. I can include the patches into this series in
next revision :)

Thanks,
Gavin