linux-kernel - Re: [RFC PATCH v2 1/1] mm/vmap: keep track of free blocks for vmap allocation

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20190322215413.GA15943@tower.DHCP.thefacebook.com>
Date:   Fri, 22 Mar 2019 21:54:17 +0000
From:   Roman Gushchin <guro@...com>
To:     "Uladzislau Rezki (Sony)" <urezki@...il.com>
CC:     Andrew Morton <akpm@...ux-foundation.org>,
        Michal Hocko <mhocko@...e.com>,
        Matthew Wilcox <willy@...radead.org>,
        "linux-mm@...ck.org" <linux-mm@...ck.org>,
        LKML <linux-kernel@...r.kernel.org>,
        Thomas Garnier <thgarnie@...gle.com>,
        Oleksiy Avramchenko <oleksiy.avramchenko@...ymobile.com>,
        Steven Rostedt <rostedt@...dmis.org>,
        Joel Fernandes <joelaf@...gle.com>,
        Thomas Gleixner <tglx@...utronix.de>,
        Ingo Molnar <mingo@...e.hu>, Tejun Heo <tj@...nel.org>
Subject: Re: [RFC PATCH v2 1/1] mm/vmap: keep track of free blocks for vmap
 allocation

On Thu, Mar 21, 2019 at 08:03:27PM +0100, Uladzislau Rezki (Sony) wrote:
> Currently an allocation of the new vmap area is done over busy
> list iteration(complexity O(n)) until a suitable hole is found
> between two busy areas. Therefore each new allocation causes
> the list being grown. Due to over fragmented list and different
> permissive parameters an allocation can take a long time. For
> example on embedded devices it is milliseconds.
> 
> This patch organizes the KVA memory layout into free areas of the
> 1-ULONG_MAX range. It uses an augment red-black tree that keeps
> blocks sorted by their offsets in pair with linked list keeping
> the free space in order of increasing addresses.
> 
> Each vmap_area object contains the "subtree_max_size" that reflects
> a maximum available free block in its left or right sub-tree. Thus,
> that allows to take a decision and traversal toward the block that
> will fit and will have the lowest start address, i.e. sequential
> allocation.
> 
> Allocation: to allocate a new block a search is done over the
> tree until a suitable lowest(left most) block is large enough
> to encompass: the requested size, alignment and vstart point.
> If the block is bigger than requested size - it is split.
> 
> De-allocation: when a busy vmap area is freed it can either be
> merged or inserted to the tree. Red-black tree allows efficiently
> find a spot whereas a linked list provides a constant-time access
> to previous and next blocks to check if merging can be done. In case
> of merging of de-allocated memory chunk a large coalesced area is
> created.
> 
> Complexity: ~O(log(N))

Hi, Uladzislau!

Definitely a clever idea and very promising numbers!

The overall approach makes total sense to me.

I tried to go through the code, but it was a bit challenging.
I wonder, if you can split it into smaller parts to simplify the review?

Idk how easy is to split the core (maybe the area merging code can be
separated?), but at least the optionally compiled debug code can
be moved into separate patches.

Some small nits/questions below.

> 
> Signed-off-by: Uladzislau Rezki (Sony) <urezki@...il.com>
> ---
>  include/linux/vmalloc.h |    6 +-
>  mm/vmalloc.c            | 1109 ++++++++++++++++++++++++++++++++++++-----------
>  2 files changed, 871 insertions(+), 244 deletions(-)
> 
> diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
> index 398e9c95cd61..ad483378fdd1 100644
> --- a/include/linux/vmalloc.h
> +++ b/include/linux/vmalloc.h
> @@ -45,12 +45,16 @@ struct vm_struct {
>  struct vmap_area {
>  	unsigned long va_start;
>  	unsigned long va_end;
> +
> +	/*
> +	 * Largest available free size in subtree.
> +	 */
> +	unsigned long subtree_max_size;
>  	unsigned long flags;
>  	struct rb_node rb_node;         /* address sorted rbtree */
>  	struct list_head list;          /* address sorted list */
>  	struct llist_node purge_list;    /* "lazy purge" list */
>  	struct vm_struct *vm;
> -	struct rcu_head rcu_head;
>  };
>  
>  /*
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 755b02983d8d..29e9786299cf 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -31,6 +31,7 @@
>  #include <linux/compiler.h>
>  #include <linux/llist.h>
>  #include <linux/bitops.h>
> +#include <linux/rbtree_augmented.h>
>  
>  #include <linux/uaccess.h>
>  #include <asm/tlbflush.h>
> @@ -320,8 +321,9 @@ unsigned long vmalloc_to_pfn(const void *vmalloc_addr)
>  }
>  EXPORT_SYMBOL(vmalloc_to_pfn);
>  
> -
>  /*** Global kva allocator ***/
> +#define DEBUG_AUGMENT_PROPAGATE_CHECK 0
> +#define DEBUG_AUGMENT_LOWEST_MATCH_CHECK 0
>  
>  #define VM_LAZY_FREE	0x02
>  #define VM_VM_AREA	0x04
> @@ -331,14 +333,76 @@ static DEFINE_SPINLOCK(vmap_area_lock);
>  LIST_HEAD(vmap_area_list);
>  static LLIST_HEAD(vmap_purge_list);
>  static struct rb_root vmap_area_root = RB_ROOT;
> +static bool vmap_initialized __read_mostly;
> +
> +/*
> + * This kmem_cache is used for vmap_area objects. Instead of
> + * allocating from slab we reuse an object from this cache to
> + * make things faster. Especially in "no edge" splitting of
> + * free block.
> + */
> +static struct kmem_cache *vmap_area_cachep;
> +
> +/*
> + * This linked list is used in pair with free_vmap_area_root.
> + * It gives O(1) access to prev/next to perform fast coalescing.
> + */
> +static LIST_HEAD(free_vmap_area_list);
> +
> +/*
> + * This augment red-black tree represents the free vmap space.
> + * All vmap_area objects in this tree are sorted by va->va_start
> + * address. It is used for allocation and merging when a vmap
> + * object is released.
> + *
> + * Each vmap_area node contains a maximum available free block
> + * of its sub-tree, right or left. Therefore it is possible to
> + * find a lowest match of free area.
> + */
> +static struct rb_root free_vmap_area_root = RB_ROOT;
> +
> +static inline unsigned long
> +__va_size(struct vmap_area *va)
> +{
> +	return (va->va_end - va->va_start);
> +}
> +
> +static unsigned long
> +get_subtree_max_size(struct rb_node *node)
> +{
> +	struct vmap_area *va;
> +
> +	va = rb_entry_safe(node, struct vmap_area, rb_node);
> +	return va ? va->subtree_max_size : 0;
> +}
> +
> +/*
> + * Gets called when remove the node and rotate.
> + */
> +static unsigned long
> +compute_subtree_max_size(struct vmap_area *va)
> +{
> +	unsigned long max_size = __va_size(va);
> +	unsigned long child_max_size;
> +
> +	child_max_size = get_subtree_max_size(va->rb_node.rb_right);
> +	if (child_max_size > max_size)
> +		max_size = child_max_size;
>  
> -/* The vmap cache globals are protected by vmap_area_lock */
> -static struct rb_node *free_vmap_cache;
> -static unsigned long cached_hole_size;
> -static unsigned long cached_vstart;
> -static unsigned long cached_align;
> +	child_max_size = get_subtree_max_size(va->rb_node.rb_left);
> +	if (child_max_size > max_size)
> +		max_size = child_max_size;
>  
> -static unsigned long vmap_area_pcpu_hole;
> +	return max_size;
> +}
> +
> +RB_DECLARE_CALLBACKS(static, free_vmap_area_rb_augment_cb,
> +	struct vmap_area, rb_node, unsigned long, subtree_max_size,
> +	compute_subtree_max_size)
> +
> +static void purge_vmap_area_lazy(void);
> +static BLOCKING_NOTIFIER_HEAD(vmap_notify_list);
> +static unsigned long lazy_max_pages(void);
>  
>  static struct vmap_area *__find_vmap_area(unsigned long addr)
>  {
> @@ -359,41 +423,623 @@ static struct vmap_area *__find_vmap_area(unsigned long addr)
>  	return NULL;
>  }
>  
> -static void __insert_vmap_area(struct vmap_area *va)
> +/*
> + * This function returns back addresses of parent node
> + * and its left or right link for further processing.
> + */
> +static inline void
> +__find_va_links(struct vmap_area *va,
> +	struct rb_root *root, struct rb_node *from,
> +	struct rb_node **parent, struct rb_node ***link)

This function returns a pointer to parent, and it doesn't use
the initial value of (*parent). Can't it just return it?
I mean something like:

static struct rb_node *__find_va_links(struct vmap_area *va,
	struct rb_root *root, struct rb_node *from,
	struct rb_node ***link) { ... }

It would simplify things a bit.

Also this triple pointer looks scary.
If it returns a parent and one of two children, can't it return
a desired child instead? Or it can be NULL with a non-NULL parent?

>  {
> -	struct rb_node **p = &vmap_area_root.rb_node;
> -	struct rb_node *parent = NULL;
> -	struct rb_node *tmp;
> +	struct vmap_area *tmp_va;
>  
> -	while (*p) {
> -		struct vmap_area *tmp_va;
> +	if (root) {
> +		*link = &root->rb_node;
> +		if (unlikely(!**link)) {
> +			*parent = NULL;
> +			return;
> +		}
> +	} else {
> +		*link = &from;
> +	}
>  
> -		parent = *p;
> -		tmp_va = rb_entry(parent, struct vmap_area, rb_node);
> -		if (va->va_start < tmp_va->va_end)
> -			p = &(*p)->rb_left;
> -		else if (va->va_end > tmp_va->va_start)
> -			p = &(*p)->rb_right;
> +	/*
> +	 * Go to the bottom of the tree.
> +	 */
> +	do {
> +		tmp_va = rb_entry(**link, struct vmap_area, rb_node);
> +
> +		/*
> +		 * During the traversal we also do some sanity check.
> +		 * Trigger the BUG() if there are sides(left/right)
> +		 * or full overlaps.
> +		 */
> +		if (va->va_start < tmp_va->va_end &&
> +				va->va_end <= tmp_va->va_start)
> +			*link = &(**link)->rb_left;
> +		else if (va->va_end > tmp_va->va_start &&
> +				va->va_start >= tmp_va->va_end)
> +			*link = &(**link)->rb_right;
>  		else
>  			BUG();
> +	} while (**link);
> +
> +	*parent = &tmp_va->rb_node;
> +}
> +
> +static inline void
> +__find_va_free_siblings(struct rb_node *parent, struct rb_node **link,
> +	struct list_head **prev, struct list_head **next)
> +{
> +	struct list_head *list;
> +
> +	if (likely(parent)) {
> +		list = &rb_entry(parent, struct vmap_area, rb_node)->list;
> +		if (&parent->rb_right == link) {
> +			*next = list->next;
> +			*prev = list;
> +		} else {
> +			*prev = list->prev;
> +			*next = list

So, does it mean that this function always returns two following elements?
Can't it return a single element using the return statement instead?
The second one can be calculated as ->next?

> +		}
> +	} else {
> +		/*
> +		 * The red-black tree where we try to find VA neighbors
> +		 * before merging or inserting is empty, i.e. it means
> +		 * there is no free vmap space. Normally it does not
> +		 * happen but we handle this case anyway.
> +		 */
> +		*prev = *next = &free_vmap_area_list;

And for example, return NULL in this case.

>  	}
> +}
>  
> -	rb_link_node(&va->rb_node, parent, p);
> -	rb_insert_color(&va->rb_node, &vmap_area_root);
> +static inline void
> +__link_va(struct vmap_area *va, struct rb_root *root,
> +	struct rb_node *parent, struct rb_node **link, struct list_head *head)
> +{
> +	/*
> +	 * VA is still not in the list, but we can
> +	 * identify its future previous list_head node.
> +	 */
> +	if (likely(parent)) {
> +		head = &rb_entry(parent, struct vmap_area, rb_node)->list;
> +		if (&parent->rb_right != link)
> +			head = head->prev;
> +	}
>  
> -	/* address-sort this list */
> -	tmp = rb_prev(&va->rb_node);
> -	if (tmp) {
> -		struct vmap_area *prev;
> -		prev = rb_entry(tmp, struct vmap_area, rb_node);
> -		list_add_rcu(&va->list, &prev->list);
> -	} else
> -		list_add_rcu(&va->list, &vmap_area_list);
> +	/* Insert to the rb-tree */
> +	rb_link_node(&va->rb_node, parent, link);
> +	if (root == &free_vmap_area_root) {
> +		/*
> +		 * Some explanation here. Just perform simple insertion
> +		 * to the tree. We do not set va->subtree_max_size to
> +		 * its current size before calling rb_insert_augmented().
> +		 * It is because of we populate the tree from the bottom
> +		 * to parent levels when the node _is_ in the tree.
> +		 *
> +		 * Therefore we set subtree_max_size to zero after insertion,
> +		 * to let __augment_tree_propagate_from() puts everything to
> +		 * the correct order later on.
> +		 */
> +		rb_insert_augmented(&va->rb_node,
> +			root, &free_vmap_area_rb_augment_cb);
> +		va->subtree_max_size = 0;
> +	} else {
> +		rb_insert_color(&va->rb_node, root);
> +	}
> +
> +	/* Address-sort this list */
> +	list_add(&va->list, head);
>  }
>  
> -static void purge_vmap_area_lazy(void);
> +static inline void
> +__unlink_va(struct vmap_area *va, struct rb_root *root)
> +{
> +	/*
> +	 * During merging a VA node can be empty, therefore
> +	 * not linked with the tree nor list. Just check it.
> +	 */
> +	if (!RB_EMPTY_NODE(&va->rb_node)) {
> +		if (root == &free_vmap_area_root)
> +			rb_erase_augmented(&va->rb_node,
> +				root, &free_vmap_area_rb_augment_cb);
> +		else
> +			rb_erase(&va->rb_node, root);
>  
> -static BLOCKING_NOTIFIER_HEAD(vmap_notify_list);
> +		list_del(&va->list);
> +	}
> +}
> +
> +#if DEBUG_AUGMENT_PROPAGATE_CHECK
> +static void
> +augment_tree_propagate_do_check(struct rb_node *n)
> +{
> +	struct vmap_area *va;
> +	struct rb_node *node;
> +	unsigned long size;
> +	bool found = false;
> +
> +	if (n == NULL)
> +		return;
> +
> +	va = rb_entry(n, struct vmap_area, rb_node);
> +	size = va->subtree_max_size;
> +	node = n;
> +
> +	while (node) {
> +		va = rb_entry(node, struct vmap_area, rb_node);
> +
> +		if (get_subtree_max_size(node->rb_left) == size) {
> +			node = node->rb_left;
> +		} else {
> +			if (__va_size(va) == size) {
> +				found = true;
> +				break;
> +			}
> +
> +			node = node->rb_right;
> +		}
> +	}
> +
> +	if (!found) {
> +		va = rb_entry(n, struct vmap_area, rb_node);
> +		pr_emerg("tree is corrupted: %lu, %lu\n",
> +			__va_size(va), va->subtree_max_size);
> +	}
> +
> +	augment_tree_propagate_do_check(n->rb_left);
> +	augment_tree_propagate_do_check(n->rb_right);
> +}
> +
> +static void augment_tree_propagate_from_check(void)
> +{
> +	augment_tree_propagate_do_check(free_vmap_area_root.rb_node);
> +}
> +#endif
> +
> +/*
> + * This function populates subtree_max_size from bottom to upper
> + * levels starting from VA point. The propagation must be done
> + * when VA size is modified by changing its va_start/va_end. Or
> + * in case of newly inserting of VA to the tree.
> + *
> + * It means that __augment_tree_propagate_from() must be called:
> + * - After VA has been inserted to the tree(free path);
> + * - After VA has been shrunk(allocation path);
> + * - After VA has been increased(merging path).
> + *
> + * Please note that, it does not mean that upper parent nodes
> + * and their subtree_max_size are recalculated all the time up
> + * to the root node.
> + *
> + *       4--8
> + *        /\
> + *       /  \
> + *      /    \
> + *    2--2  8--8
> + *
> + * For example if we modify the node 4, shrinking it to 2, then
> + * no any modification is required. If we shrink the node 2 to 1
> + * its subtree_max_size is updated only, and set to 1. If we shrink
> + * the node 8 to 6, then its subtree_max_size is set to 6 and parent
> + * node becomes 4--6.
> + */
> +static inline void
> +__augment_tree_propagate_from(struct vmap_area *va)
> +{
> +	struct rb_node *node = &va->rb_node;
> +	unsigned long new_va_sub_max_size;
> +
> +	while (node) {
> +		va = rb_entry(node, struct vmap_area, rb_node);
> +		new_va_sub_max_size = compute_subtree_max_size(va);
> +
> +		/*
> +		 * If the newly calculated maximum available size of the
> +		 * subtree is equal to the current one, then it means that
> +		 * the tree is propagated correctly. So we have to stop at
> +		 * this point to save cycles.
> +		 */
> +		if (va->subtree_max_size == new_va_sub_max_size)
> +			break;
> +
> +		va->subtree_max_size = new_va_sub_max_size;
> +		node = rb_parent(&va->rb_node);
> +	}
> +
> +#if DEBUG_AUGMENT_PROPAGATE_CHECK
> +	augment_tree_propagate_from_check();
> +#endif
> +}
> +
> +static void
> +__insert_vmap_area(struct vmap_area *va,
> +	struct rb_root *root, struct list_head *head)
> +{
> +	struct rb_node **link;
> +	struct rb_node *parent;
> +
> +	__find_va_links(va, root, NULL, &parent, &link);
> +	__link_va(va, root, parent, link, head);
> +}
> +
> +static void
> +__insert_vmap_area_augment(struct vmap_area *va,
> +	struct rb_node *from, struct rb_root *root,
> +	struct list_head *head)
> +{
> +	struct rb_node **link;
> +	struct rb_node *parent;
> +
> +	if (from)
> +		__find_va_links(va, NULL, from, &parent, &link);
> +	else
> +		__find_va_links(va, root, NULL, &parent, &link);
> +
> +	__link_va(va, root, parent, link, head);
> +	__augment_tree_propagate_from(va);
> +}
> +
> +static inline void
> +__remove_vmap_area_common(struct vmap_area *va,
> +	struct rb_root *root)
> +{
> +	__unlink_va(va, root);
> +}
> +
> +/*
> + * Merge de-allocated chunk of VA memory with previous
> + * and next free blocks. If coalesce is not done a new
> + * free area is inserted. If VA has been merged, it is
> + * freed.
> + */
> +static inline void
> +__merge_or_add_vmap_area(struct vmap_area *va,
> +	struct rb_root *root, struct list_head *head)
> +{
> +	struct vmap_area *sibling;
> +	struct list_head *next, *prev;
> +	struct rb_node **link;
> +	struct rb_node *parent;
> +	bool merged = false;
> +
> +	/*
> +	 * Find a place in the tree where VA potentially will be
> +	 * inserted, unless it is merged with its sibling/siblings.
> +	 */
> +	__find_va_links(va, root, NULL, &parent, &link);
> +
> +	/*
> +	 * Get next/prev nodes of VA to check if merging can be done.
> +	 */
> +	__find_va_free_siblings(parent, link, &prev, &next);
> +
> +	/*
> +	 * start            end
> +	 * |                |
> +	 * |<------VA------>|<-----Next----->|
> +	 *                  |                |
> +	 *                  start            end
> +	 */
> +	if (next != head) {
> +		sibling = list_entry(next, struct vmap_area, list);
> +		if (sibling->va_start == va->va_end) {
> +			sibling->va_start = va->va_start;
> +
> +			/* Check and update the tree if needed. */
> +			__augment_tree_propagate_from(sibling);
> +
> +			/* Remove this VA, it has been merged. */
> +			__remove_vmap_area_common(va, root);
> +
> +			/* Free vmap_area object. */
> +			kmem_cache_free(vmap_area_cachep, va);
> +
> +			/* Point to the new merged area. */
> +			va = sibling;
> +			merged = true;
> +		}
> +	}
> +
> +	/*
> +	 * start            end
> +	 * |                |
> +	 * |<-----Prev----->|<------VA------>|
> +	 *                  |                |
> +	 *                  start            end
> +	 */
> +	if (prev != head) {
> +		sibling = list_entry(prev, struct vmap_area, list);
> +		if (sibling->va_end == va->va_start) {
> +			sibling->va_end = va->va_end;
> +
> +			/* Check and update the tree if needed. */
> +			__augment_tree_propagate_from(sibling);
> +
> +			/* Remove this VA, it has been merged. */
> +			__remove_vmap_area_common(va, root);
> +
> +			/* Free vmap_area object. */
> +			kmem_cache_free(vmap_area_cachep, va);
> +
> +			return;
> +		}
> +	}
> +
> +	if (!merged) {
> +		__link_va(va, root, parent, link, head);
> +		__augment_tree_propagate_from(va);
> +	}
> +}
> +
> +static inline bool
> +is_within_this_va(struct vmap_area *va, unsigned long size,
> +	unsigned long align, unsigned long vstart)
> +{
> +	unsigned long nva_start_addr;
> +
> +	if (va->va_start > vstart)
> +		nva_start_addr = ALIGN(va->va_start, align);
> +	else
> +		nva_start_addr = ALIGN(vstart, align);
> +
> +	/* Can be overflowed due to big size or alignment. */
> +	if (nva_start_addr + size < nva_start_addr ||
> +			nva_start_addr < vstart)
> +		return false;
> +
> +	return (nva_start_addr + size <= va->va_end);
> +}
> +
> +/*
> + * Find the first free block(lowest start address) in the tree,
> + * that will accomplish the request corresponding to passing
> + * parameters.
> + */
> +static inline struct vmap_area *
> +__find_vmap_lowest_match(unsigned long size,
> +	unsigned long align, unsigned long vstart)
> +{
> +	struct vmap_area *va;
> +	struct rb_node *node;
> +	unsigned long length;
> +
> +	/* Start from the root. */
> +	node = free_vmap_area_root.rb_node;
> +
> +	/* Adjust the search size for alignment overhead. */
> +	length = size + align - 1;
> +
> +	while (node) {
> +		va = rb_entry(node, struct vmap_area, rb_node);
> +
> +		if (get_subtree_max_size(node->rb_left) >= length &&
> +				vstart < va->va_start) {
> +			node = node->rb_left;
> +		} else {
> +			if (is_within_this_va(va, size, align, vstart))
> +				return va;
> +
> +			/*
> +			 * Does not make sense to go deeper towards the right
> +			 * sub-tree if it does not have a free block that is
> +			 * equal or bigger to the requested search length.
> +			 */
> +			if (get_subtree_max_size(node->rb_right) >= length) {
> +				node = node->rb_right;
> +				continue;
> +			}
> +
> +			/*
> +			 * OK. We roll back and find the fist right sub-tree,
> +			 * that will satisfy the search criteria. It can happen
> +			 * only once due to "vstart" restriction.
> +			 */
> +			while ((node = rb_parent(node))) {
> +				va = rb_entry(node, struct vmap_area, rb_node);
> +				if (is_within_this_va(va, size, align, vstart))
> +					return va;
> +
> +				if (get_subtree_max_size(node->rb_right) >= length &&
> +						vstart <= va->va_start) {
> +					node = node->rb_right;
> +					break;
> +				}
> +			}
> +		}
> +	}
> +
> +	return NULL;
> +}
> +
> +#if DEBUG_AUGMENT_LOWEST_MATCH_CHECK
> +#include <linux/random.h>
> +
> +static struct vmap_area *
> +__find_vmap_lowest_linear_match(unsigned long size,
> +	unsigned long align, unsigned long vstart)
> +{
> +	struct vmap_area *va;
> +
> +	list_for_each_entry(va, &free_vmap_area_list, list) {
> +		if (!is_within_this_va(va, size, align, vstart))
> +			continue;
> +
> +		return va;
> +	}
> +
> +	return NULL;
> +}
> +
> +static void
> +__find_vmap_lowest_match_check(unsigned long size)
> +{
> +	struct vmap_area *va_1, *va_2;
> +	unsigned long vstart;
> +	unsigned int rnd;
> +
> +	get_random_bytes(&rnd, sizeof(rnd));
> +	vstart = VMALLOC_START + rnd;
> +
> +	va_1 = __find_vmap_lowest_match(size, 1, vstart);
> +	va_2 = __find_vmap_lowest_linear_match(size, 1, vstart);
> +
> +	if (va_1 != va_2)
> +		pr_emerg("not lowest: t: 0x%p, l: 0x%p, v: 0x%lx\n",
> +			va_1, va_2, vstart);
> +}
> +#endif
> +
> +enum alloc_fit_type {
> +	NOTHING_FIT = 0,
> +	FL_FIT_TYPE = 1,	/* full fit */
> +	LE_FIT_TYPE = 2,	/* left edge fit */
> +	RE_FIT_TYPE = 3,	/* right edge fit */
> +	NE_FIT_TYPE = 4		/* no edge fit */
> +};
> +
> +static inline u8
> +__classify_va_fit_type(struct vmap_area *va,
> +	unsigned long nva_start_addr, unsigned long size)
> +{
> +	u8 fit_type;
> +
> +	/* Check if it is within VA. */
> +	if (nva_start_addr < va->va_start ||
> +			nva_start_addr + size > va->va_end)
> +		return NOTHING_FIT;
> +
> +	/* Now classify. */
> +	if (va->va_start == nva_start_addr) {
> +		if (va->va_end == nva_start_addr + size)
> +			fit_type = FL_FIT_TYPE;
> +		else
> +			fit_type = LE_FIT_TYPE;
> +	} else if (va->va_end == nva_start_addr + size) {
> +		fit_type = RE_FIT_TYPE;
> +	} else {
> +		fit_type = NE_FIT_TYPE;
> +	}
> +
> +	return fit_type;
> +}
> +
> +static inline int
> +__adjust_va_to_fit_type(struct vmap_area *va,
> +	unsigned long nva_start_addr, unsigned long size, u8 fit_type)
> +{
> +	struct vmap_area *lva;
> +
> +	if (fit_type == FL_FIT_TYPE) {
> +		/*
> +		 * No need to split VA, it fully fits.
> +		 *
> +		 * |               |
> +		 * V      NVA      V
> +		 * |---------------|
> +		 */
> +		__remove_vmap_area_common(va, &free_vmap_area_root);
> +		kmem_cache_free(vmap_area_cachep, va);
> +	} else if (fit_type == LE_FIT_TYPE) {
> +		/*
> +		 * Split left edge of fit VA.
> +		 *
> +		 * |       |
> +		 * V  NVA  V   R
> +		 * |-------|-------|
> +		 */
> +		va->va_start += size;
> +	} else if (fit_type == RE_FIT_TYPE) {
> +		/*
> +		 * Split right edge of fit VA.
> +		 *
> +		 *         |       |
> +		 *     L   V  NVA  V
> +		 * |-------|-------|
> +		 */
> +		va->va_end = nva_start_addr;
> +	} else if (fit_type == NE_FIT_TYPE) {
> +		/*
> +		 * Split no edge of fit VA.
> +		 *
> +		 *     |       |
> +		 *   L V  NVA  V R
> +		 * |---|-------|---|
> +		 */
> +		lva = kmem_cache_alloc(vmap_area_cachep, GFP_NOWAIT);
> +		if (unlikely(!lva))
> +			return -1;
> +
> +		/*
> +		 * Build the remainder.
> +		 */
> +		lva->va_start = va->va_start;
> +		lva->va_end = nva_start_addr;
> +
> +		/*
> +		 * Shrink this VA to remaining size.
> +		 */
> +		va->va_start = nva_start_addr + size;
> +	} else {
> +		return -1;
> +	}
> +
> +	if (fit_type != FL_FIT_TYPE) {
> +		__augment_tree_propagate_from(va);
> +
> +		if (fit_type == NE_FIT_TYPE)
> +			__insert_vmap_area_augment(lva, &va->rb_node,
> +				&free_vmap_area_root, &free_vmap_area_list);
> +	}
> +
> +	return 0;
> +}
> +
> +/*
> + * Returns a start address of the newly allocated area, if success.
> + * Otherwise a vend is returned that indicates failure.
> + */
> +static inline unsigned long
> +__alloc_vmap_area(unsigned long size, unsigned long align,
> +	unsigned long vstart, unsigned long vend, int node)
> +{
> +	unsigned long nva_start_addr;
> +	struct vmap_area *va;
> +	u8 fit_type;
> +	int ret;
> +
> +	va = __find_vmap_lowest_match(size, align, vstart);
> +	if (unlikely(!va))
> +		return vend;
> +
> +	if (va->va_start > vstart)
> +		nva_start_addr = ALIGN(va->va_start, align);
> +	else
> +		nva_start_addr = ALIGN(vstart, align);
> +
> +	/* Check the "vend" restriction. */
> +	if (nva_start_addr + size > vend)
> +		return vend;
> +
> +	/* Classify what we have found. */
> +	fit_type = __classify_va_fit_type(va, nva_start_addr, size);
> +	if (unlikely(fit_type == NOTHING_FIT)) {
> +		WARN_ON_ONCE(true);

Nit: WARN_ON_ONCE() has unlikely() built-in and returns the value,
so it can be something like:

  if (WARN_ON_ONCE(fit_type == NOTHING_FIT))
  	return vend;

The same comment applies for a couple of other place, where is a check
for NOTHING_FIT.


I do not see any issues so far, just some places in the code are a bit
hard to follow. I'll try to spend more time on the patch in the next
couple of weeks.

Thank you!

Roman