linux-kernel - Re: [PATCH v3 04/11] mm: vmalloc: Remove global vmap_area

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <238e63cd-e0e8-4fbf-852f-bc4d5bc35d5a@linux.alibaba.com>
Date: Fri, 5 Jan 2024 16:10:32 +0800
From: Wen Gu <guwen@...ux.alibaba.com>
To: "Uladzislau Rezki (Sony)" <urezki@...il.com>
Cc: shaozhengchao <shaozhengchao@...wei.com>, linux-mm@...ck.org,
 LKML <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH v3 04/11] mm: vmalloc: Remove global vmap_area_root
 rb-tree


On 2024/01/03 02:46, Uladzislau Rezki wrote:

> Store allocated objects in a separate nodes. A va->va_start
> address is converted into a correct node where it should
> be placed and resided. An addr_to_node() function is used
> to do a proper address conversion to determine a node that
> contains a VA.
> 
> Such approach balances VAs across nodes as a result an access
> becomes scalable. Number of nodes in a system depends on number
> of CPUs.
> 
> Please note:
> 
> 1. As of now allocated VAs are bound to a node-0. It means the
>    patch does not give any difference comparing with a current
>    behavior;
> 
> 2. The global vmap_area_lock, vmap_area_root are removed as there
>    is no need in it anymore. The vmap_area_list is still kept and
>    is _empty_. It is exported for a kexec only;
> 
> 3. The vmallocinfo and vread() have to be reworked to be able to
>    handle multiple nodes.
> 
> Reviewed-by: Baoquan He <bhe@...hat.com>
> Signed-off-by: Uladzislau Rezki (Sony) <urezki@...il.com>
> ---

<...>

>  struct vmap_area *find_vmap_area(unsigned long addr)
>  {
> +	struct vmap_node *vn;
>  	struct vmap_area *va;
> +	int i, j;
>  
> -	spin_lock(&vmap_area_lock);
> -	va = __find_vmap_area(addr, &vmap_area_root);
> -	spin_unlock(&vmap_area_lock);
> +	/*
> +	 * An addr_to_node_id(addr) converts an address to a node index
> +	 * where a VA is located. If VA spans several zones and passed
> +	 * addr is not the same as va->va_start, what is not common, we
> +	 * may need to scan an extra nodes. See an example:
> +	 *
> +	 *      <--va-->
> +	 * -|-----|-----|-----|-----|-
> +	 *     1     2     0     1
> +	 *
> +	 * VA resides in node 1 whereas it spans 1 and 2. If passed
> +	 * addr is within a second node we should do extra work. We
> +	 * should mention that it is rare and is a corner case from
> +	 * the other hand it has to be covered.
> +	 */
> +	i = j = addr_to_node_id(addr);
> +	do {
> +		vn = &vmap_nodes[i];
>  
> -	return va;
> +		spin_lock(&vn->busy.lock);
> +		va = __find_vmap_area(addr, &vn->busy.root);
> +		spin_unlock(&vn->busy.lock);
> +
> +		if (va)
> +			return va;
> +	} while ((i = (i + 1) % nr_vmap_nodes) != j);
> +
> +	return NULL;
>  }
>  

Hi Uladzislau Rezki,

I really like your work, it is great and helpful!

Currently, I am working on using shared memory communication (SMC [1])
to transparently accelerate TCP communication between two peers within
the same OS instance[2].

In this scenario, a vzalloced kernel buffer acts as a shared memory and
will be simultaneous read or written by two SMC sockets, thus forming an
SMC connection.


           socket1                    socket2
              |                          ^
              |                          |    userspace
       ---- write -------------------- read ------
              |    +-----------------+   |    kernel
              +--->|  shared memory  |---+
                   | (vzalloced now) |
                   +-----------------+

Then I encountered the performance regression caused by lock contention
in find_vmap_area() when multiple threads transfer data through multiple
SMC connections on machines with many CPUs[3].

According to perf, the performance bottleneck is caused by the global
vmap_area_lock contention[4]:

- writer:

   smc_tx_sendmsg
   -> memcpy_from_msg
      -> copy_from_iter
         -> check_copy_size
            -> check_object_size
               -> if (CONFIG_HARDENED_USERCOPY is set) check_heap_object
                  -> if(vm) find_vmap_area
                     -> try to hold vmap_area_lock lock
- reader:

   smc_rx_recvmsg
    -> memcpy_to_msg
       -> copy_to_iter
          -> check_copy_size
             -> check_object_size
                -> if (CONFIG_HARDENED_USERCOPY is set) check_heap_object
                   -> if(vm) find_vmap_area
                      -> try to hold vmap_area_lock lock

Fortunately, thank you for this patch set, the global vmap_area_lock was
removed and per node lock vn->busy.lock is introduced. it is really helpful:

In 48 CPUs qemu environment, the Requests/s increased by 5 times:
- nginx
- wrk -c 1000 -t 96 -d 30 http://127.0.0.1:80

                 vzalloced shmem      vzalloced shmem(with this patch set)
Requests/sec          113536.56            583729.93


But it also has some overhead, compared to using kzalloced shared memory
or unsetting CONFIG_HARDENED_USERCOPY, which won't involve finding vmap area:

                 kzalloced shmem      vzalloced shmem(unset CONFIG_HARDENED_USERCOPY)
Requests/sec          831950.39            805164.78


So, as a newbie in Linux-mm, I would like to ask for some suggestions:

Is it possible to further eliminate the overhead caused by lock contention
in find_vmap_area() in this scenario (maybe this is asking too much), or the
only way out is not setting CONFIG_HARDENED_USERCOPY or not using vzalloced
buffer in the situation where cocurrent kernel-userspace-copy happens?

Any feedback will be appreciated. Thanks again for your time.


[1] Shared Memory Communications (SMC) enables two SMC capable peers to
     communicate by using memory buffers that each peer allocates for the
     partner's use. It improves throughput, lowers latency and cost, and
     maintains existing functions. See details in https://www.ibm.com/support/pages/node/7009315

[2] https://lore.kernel.org/netdev/1702214654-32069-1-git-send-email-guwen@linux.alibaba.com/

[3] issues: https://lore.kernel.org/all/1fbd6b74-1080-923a-01c1-689c3d65f880@huawei.com/
     analysis: https://lore.kernel.org/all/3189e342-c38f-6076-b730-19a6efd732a5@linux.alibaba.com/

[4] Some flamegraphs are attached,
     - SMC using vzalloced buffer: vzalloc_t96.svg
     - SMC using vzalloced buffer and with this patchset: vzalloc_t96_improve.svg
     - SMC using vzalloced buffer and unset CONFIG_HARDENED_USERCOPY: vzalloc_t96_nocheck.svg
     - SMC using kzalloced buffer: kzalloc_t96.svg


Best regards,
Wen Gu
Download attachment "vzalloc_t96.svg" of type "image/svg+xml" (202182 bytes)

Download attachment "vzalloc_t96_improve.svg" of type "image/svg+xml" (276330 bytes)

Download attachment "vzalloc_t96_nocheck.svg" of type "image/svg+xml" (283784 bytes)

Download attachment "kzalloc_t96.svg" of type "image/svg+xml" (296847 bytes)