[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <14c2c6d3-00fc-1507-9dd3-c25605717d3d@redhat.com>
Date: Wed, 25 Dec 2019 11:23:23 +0800
From: Jason Wang <jasowang@...hat.com>
To: Peter Xu <peterx@...hat.com>
Cc: kvm@...r.kernel.org, linux-kernel@...r.kernel.org,
"Dr . David Alan Gilbert" <dgilbert@...hat.com>,
Christophe de Dinechin <dinechin@...hat.com>,
Sean Christopherson <sean.j.christopherson@...el.com>,
Paolo Bonzini <pbonzini@...hat.com>,
"Michael S . Tsirkin" <mst@...hat.com>,
Vitaly Kuznetsov <vkuznets@...hat.com>,
Lei Cao <lei.cao@...atus.com>
Subject: Re: [PATCH RESEND v2 08/17] KVM: X86: Implement ring-based dirty
memory tracking
On 2019/12/24 下午11:08, Peter Xu wrote:
> On Tue, Dec 24, 2019 at 02:16:04PM +0800, Jason Wang wrote:
>>> +struct kvm_dirty_ring {
>>> + u32 dirty_index;
>>
>> Does this always equal to indices->avail_index?
> Yes, but here we keep dirty_index as the internal one, so we never
> need to worry about illegal userspace writes to avail_index (then we
> never read it from kernel).
I get you. But I'm not sure it's wroth to bother. We meet similar issue
in virtio, the used_idx is not expected to write by userspace. We simply
add checks.
But anyway, I'm fine if you want to keep it (maybe with a comment to
explain).
>
>>
>>> + u32 reset_index;
>>> + u32 size;
>>> + u32 soft_limit;
>>> + struct kvm_dirty_gfn *dirty_gfns;
>>> + struct kvm_dirty_ring_indices *indices;
>>
>> Any reason to keep dirty gfns and indices in different places? I guess it is
>> because you want to map dirty_gfns as readonly page but I couldn't find such
>> codes...
> That's a good point! We should actually map the dirty gfns as read
> only. I've added the check, something like this:
>
> static int kvm_vcpu_mmap(struct file *file, struct vm_area_struct *vma)
> {
> struct kvm_vcpu *vcpu = file->private_data;
> unsigned long pages = (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
>
> /* If to map any writable page within dirty ring, fail it */
> if ((kvm_page_in_dirty_ring(vcpu->kvm, vma->vm_pgoff) ||
> kvm_page_in_dirty_ring(vcpu->kvm, vma->vm_pgoff + pages - 1)) &&
> vma->vm_flags & VM_WRITE)
> return -EINVAL;
>
> vma->vm_ops = &kvm_vcpu_vm_ops;
> return 0;
> }
>
> I also changed the test code to cover this case.
>
> [...]
Looks good.
>
>>> +struct kvm_dirty_ring_indices {
>>> + __u32 avail_index; /* set by kernel */
>>> + __u32 fetch_index; /* set by userspace */
>>
>> Is this better to make those two cacheline aligned?
> Yes, Paolo should have mentioned that but I must have missed it! I
> hope I didn't miss anything else.
>
> [...]
>
>>> +int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring)
>>> +{
>>> + u32 cur_slot, next_slot;
>>> + u64 cur_offset, next_offset;
>>> + unsigned long mask;
>>> + u32 fetch;
>>> + int count = 0;
>>> + struct kvm_dirty_gfn *entry;
>>> + struct kvm_dirty_ring_indices *indices = ring->indices;
>>> + bool first_round = true;
>>> +
>>> + fetch = READ_ONCE(indices->fetch_index);
>>> +
>>> + /*
>>> + * Note that fetch_index is written by the userspace, which
>>> + * should not be trusted. If this happens, then it's probably
>>> + * that the userspace has written a wrong fetch_index.
>>> + */
>>> + if (fetch - ring->reset_index > ring->size)
>>> + return -EINVAL;
>>> +
>>> + if (fetch == ring->reset_index)
>>> + return 0;
>>> +
>>> + /* This is only needed to make compilers happy */
>>> + cur_slot = cur_offset = mask = 0;
>>> + while (ring->reset_index != fetch) {
>>> + entry = &ring->dirty_gfns[ring->reset_index & (ring->size - 1)];
>>> + next_slot = READ_ONCE(entry->slot);
>>> + next_offset = READ_ONCE(entry->offset);
>>> + ring->reset_index++;
>>> + count++;
>>> + /*
>>> + * Try to coalesce the reset operations when the guest is
>>> + * scanning pages in the same slot.
>>> + */
>>> + if (!first_round && next_slot == cur_slot) {
>>
>> initialize cur_slot to -1 then we can drop first_round here?
> cur_slot is unsigned. We can force cur_slot to be s64 but maybe we
> can also simply keep the first_round to be clear from its name.
>
> [...]
Sure.
>
>>> +int kvm_dirty_ring_push(struct kvm_dirty_ring *ring, u32 slot, u64 offset)
>>> +{
>>> + struct kvm_dirty_gfn *entry;
>>> + struct kvm_dirty_ring_indices *indices = ring->indices;
>>> +
>>> + /*
>>> + * Note: here we will start waiting even soft full, because we
>>> + * can't risk making it completely full, since vcpu0 could use
>>> + * it right after us and if vcpu0 context gets full it could
>>> + * deadlock if wait with mmu_lock held.
>>> + */
>>> + if (kvm_get_running_vcpu() == NULL &&
>>> + kvm_dirty_ring_soft_full(ring))
>>> + return -EBUSY;
>>> +
>>> + /* It will never gets completely full when with a vcpu context */
>>> + WARN_ON_ONCE(kvm_dirty_ring_full(ring));
>>> +
>>> + entry = &ring->dirty_gfns[ring->dirty_index & (ring->size - 1)];
>>> + entry->slot = slot;
>>> + entry->offset = offset;
>>> + smp_wmb();
>>
>> Better to add comment to explain this barrier. E.g pairing.
> Will do.
>
>>
>>> + ring->dirty_index++;
>>> + WRITE_ONCE(indices->avail_index, ring->dirty_index);
>>
>> Is WRITE_ONCE() a must here?
> I think not, but seems to be clearer that we're publishing something
> explicilty to userspace. Since asked, I'm actually curious on whether
> immediate memory writes like this could start to affect perf from any
> of your previous perf works?
I never measure the impact for a specific WRITE_ONCE(). But we don't do
this in virtio/vhost. Maybe the maintainers can give more comments on this.
Thanks
>
> Thanks,
>
Powered by blists - more mailing lists