lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Fri, 16 Jun 2017 06:19:22 +0300
From:   "Michael S. Tsirkin" <mst@...hat.com>
To:     Wei Wang <wei.w.wang@...el.com>
Cc:     virtio-dev@...ts.oasis-open.org, linux-kernel@...r.kernel.org,
        qemu-devel@...gnu.org, virtualization@...ts.linux-foundation.org,
        kvm@...r.kernel.org, linux-mm@...ck.org, david@...hat.com,
        dave.hansen@...el.com, cornelia.huck@...ibm.com,
        akpm@...ux-foundation.org, mgorman@...hsingularity.net,
        aarcange@...hat.com, amit.shah@...hat.com, pbonzini@...hat.com,
        liliang.opensource@...il.com, Matthew Wilcox <willy@...radead.org>
Subject: Re: [virtio-dev] Re: [PATCH v11 3/6] virtio-balloon:
 VIRTIO_BALLOON_F_PAGE_CHUNKS

On Thu, Jun 15, 2017 at 04:10:17PM +0800, Wei Wang wrote:
> On 06/14/2017 01:56 AM, Michael S. Tsirkin wrote:
> > On Fri, Jun 09, 2017 at 06:41:38PM +0800, Wei Wang wrote:
> > > Add a new feature, VIRTIO_BALLOON_F_PAGE_CHUNKS, which enables
> > > the transfer of the ballooned (i.e. inflated/deflated) pages in
> > > chunks to the host.
> > so now these chunks are just s/g list entry.
> > So let's rename this VIRTIO_BALLOON_F_SG with a comment:
> > * Use standard virtio s/g instead of PFN lists *
> 
> Actually, it's not using the standard s/g list in the implementation,
> because:
> using the standard s/g will need kmalloc() the indirect table on
> demand (i.e. when virtqueue_add() converts s/g to indirect table);
> 
> The implementation directly pre-allocates an indirect desc table,
> and uses a entry (i.e. vring_desc) to describe a chunk. This
> avoids the overhead of kmalloc() the indirect table.

It's a separate API but the host/guest interface is standard.

> 
> > > +/*
> > > + * Callulates how many pfns can a page_bmap record. A bit corresponds to a
> > > + * page of PAGE_SIZE.
> > > + */
> > > +#define VIRTIO_BALLOON_PFNS_PER_PAGE_BMAP \
> > > +	(VIRTIO_BALLOON_PAGE_BMAP_SIZE * BITS_PER_BYTE)
> > > +
> > > +/* The number of page_bmap to allocate by default. */
> > > +#define VIRTIO_BALLOON_PAGE_BMAP_DEFAULT_NUM	1
> > It's not by default, it's at probe time, right?
> It is the number of page bitmap being kept throughout the whole
> lifecycle of the driver. The page bmap will be temporarily extended
> due to insufficiency during a ballooning process, but when that
> ballooning finishes, the extended part will be freed.
> > > +/* The maximum number of page_bmap that can be allocated. */
> > Not really, this is the size of the array we use to keep them.
> 
> This is the max number of the page bmap that can be
> extended temporarily.

That's just a confusing way to say the same.

> > > +#define VIRTIO_BALLOON_PAGE_BMAP_MAX_NUM	32
> > > +
> > So you still have a home-grown bitmap. I'd like to know why
> > isn't xbitmap suggested for this purpose by Matthew Wilcox
> > appropriate. Please add a comment explaining the requirements
> > from the data structure.
> 
> I didn't find his xbitmap being upstreamed, did you?

It's from dax tree - Matthew?

> > > +/*
> > > + * QEMU virtio implementation requires the desc table size less than
> > > + * VIRTQUEUE_MAX_SIZE, so minus 1 here.
> > I think it doesn't, the issue is probably that you add a header
> > as a separate s/g. In any case see below.
> > 
> > > + */
> > > +#define VIRTIO_BALLOON_MAX_PAGE_CHUNKS (VIRTQUEUE_MAX_SIZE - 1)
> > This is wrong, virtio spec says s/g size should not exceed VQ size.
> > If you want to support huge VQ sizes, you can add a fallback to
> > smaller sizes until it fits in 1 page.
> 
> Probably no need for huge VQ size, 1024 queue size should be
> enough. And we can have 1024 descriptors in the indirect
> table, so the above size doesn't exceed the vq size, right?

You need to look at vq size, you shouldn't assume it's > 1024.


> 
> > +static unsigned int extend_page_bmap_size(struct virtio_balloon *vb,
> > +					  unsigned long pfn_num)
> > what's this API doing?  Pls add comments. this seems to assume
> > it will only be called once.
> OK, I will add some comments here. This is the function to extend
> the number of page bitmap when the original 1 page bmap is
> not sufficient during a ballooning process. As mentioned above,
> at the end of this ballooning process, the extended part will be freed.
> 
> > it would be better to avoid making
> > this assumption, just look at what has been allocated
> > and extend it.
> Actually it's not an assumption. The rule here is that we always keep
> "1" page bmap. "1" is defined by the
> VIRTIO_BALLOON_PAGE_BMAP_DEFAULT_NUM. So when freeing, it also
> references VIRTIO_BALLOON_PAGE_BMAP_DEFAULT_NUM (not assuming
> any number)

When allocating, why don't you check what's allocated already?
why assume VIRTIO_BALLOON_PAGE_BMAP_DEFAULT_NUM was allocated?
Then calling extend_page_bmap_size many times would be idempotent.

> > +}
> > +
> > +/* Add a chunk to the buffer. */
> > +static void add_one_chunk(struct virtio_balloon *vb, struct virtqueue *vq,
> > +			  u64 base_addr, u32 size)
> > +{
> > +	unsigned int *num = &vb->balloon_page_chunk.chunk_num;
> > +	struct vring_desc *desc = &vb->balloon_page_chunk.desc_table[*num];
> > +
> > +	desc->addr = cpu_to_virtio64(vb->vdev, base_addr);
> > +	desc->len = cpu_to_virtio32(vb->vdev, size);
> > +	*num += 1;
> > +	if (*num == VIRTIO_BALLOON_MAX_PAGE_CHUNKS)
> > +		send_page_chunks(vb, vq);
> > +}
> > +
> > Poking at virtio internals like this is not nice. Pls move to virtio
> > code.  Also, pages must be read descriptors as host might modify them.
> > 
> > This also lacks viommu support but this is not mandatory as
> > that is borken atm anyway. I'll send a patch to at least fail cleanly.
> OK, thanks.
> 
> > > +static void convert_bmap_to_chunks(struct virtio_balloon *vb,
> > > +				   struct virtqueue *vq,
> > > +				   unsigned long *bmap,
> > > +				   unsigned long pfn_start,
> > > +				   unsigned long size)
> > > +{
> > > +	unsigned long next_one, next_zero, pos = 0;
> > > +	u64 chunk_base_addr;
> > > +	u32 chunk_size;
> > > +
> > > +	while (pos < size) {
> > > +		next_one = find_next_bit(bmap, size, pos);
> > > +		/*
> > > +		 * No "1" bit found, which means that there is no pfn
> > > +		 * recorded in the rest of this bmap.
> > > +		 */
> > > +		if (next_one == size)
> > > +			break;
> > > +		next_zero = find_next_zero_bit(bmap, size, next_one + 1);
> > > +		/*
> > > +		 * A bit in page_bmap corresponds to a page of PAGE_SIZE.
> > > +		 * Convert it to be pages of 4KB balloon page size when
> > > +		 * adding it to a chunk.
> > This looks wrong. add_one_chunk assumes size in bytes. So should be just
> > PAGE_SIZE.
> 
> It's intended to be "chunk size", which is the number of pfns. The benefit
> is
> that the 32-bit desc->len won't be overflow, as you mentioned below.
> 

You can safely assume PAGE_SIZE >= 4K. Just pass # of pages.

> > 
> > > +		 */
> > > +		chunk_size = (next_zero - next_one) *
> > > +			     VIRTIO_BALLOON_PAGES_PER_PAGE;
> > How do you know this won't overflow a 32 bit integer? Needs a comment.
> 
> If it stores size in bytes, it has the possibility to overflow.
> If storing number of pfns, the 32-bit value can support 2^32*4KB=8TB
> memory, unlikely to overflow.

Do you put in len in the descriptor in 4K chunks then? That
needs some thought. Also, processors support up to 256TB now,
I don't think we can just assume it won't overflow anymore.
All in all I'd prefer we just split everything up to 2G chunks.
We can discuss extending len to more bits or specifying alignment
in the descriptor separately down the road.

> > +
> > +static int balloon_page_chunk_init(struct virtio_balloon *vb)
> > +{
> > +	int i;
> > +
> > +	vb->balloon_page_chunk.desc_table = alloc_indirect(vb->vdev,
> > +						VIRTIO_BALLOON_MAX_PAGE_CHUNKS,
> > +						GFP_KERNEL);
> > This one's problematic, you aren't supposed to use APIs when device
> > is not inited yet. Seems to work by luck here. I suggest moving
> > this to probe, that's where we do a bunch of inits.
> > And then you can move private init back to allocate too.
> 
> This is just to allocate an indirect desc table. If allocation fails, we
> need to clear
> the related feature bit in ->validate(), right?

Failing probe on OOM is ok too.

-- 
MST

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ