linux-kernel - [RFC] Another Para-Virtualization page recycler. Empty Guest OS free pages every few seconds

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [day] [month] [year] [list]

Message-ID: <3c1478e4-dd75-44f1-82f8-179ba9a17806.ljy@baibantech.com.cn>
Date:   Wed, 20 Sep 2017 11:25:23 +0800
From:   "XaviLi" <ljy@...bantech.com.cn>
To:     "linux-kernel" <linux-kernel@...r.kernel.org>
Subject: [RFC] Another Para-Virtualization page recycler. Empty Guest OS free pages every few seconds

PPR (Per Page Recycler) is a para virtualization driver currently available for KVM hosts and Linux/Windows guests. With PPR, every page freed to Guest OS can be recycled in seconds by hypervisor. Therefore, VMs can dynamical allocate/free pages from hypervisor according to application’s request. This issue can result in a memory density boost in many cases with little CPU cost. PPR is a component of an open source solution named Dynamic VM:
https://github.com/baibantech/dynamic_vm.git

patch can be found here: https://github.com/baibantech/dynamic_vm/tree/master/dynamic_vm_0.5

The guest side is very simple. PPR have a hook at the entry of page-free interface of Guest OS to mark every freed pages. Marked pages can be recycled and replaced by read-only zero page in one scan period. If the page was allocated again before the recycling happens, the mark was simply canceled by PPR’s page-allocation interface hook.

The guest driver works in a per-page way other than VirtIO balloon’s batch mode. The recycling issue is based on a lockless operation. The guest and host does not send any interrupt to each other for this.

We ran a test on our machine. Host use about 6500+ CPU cycles averagely for recycling each page. In which 3900 cycles is cost by the Linux kernel page-table operation. The guest hook cost about 300 cycles for mark or unmark each page. It is cheap enough in many cases. The daemon cost 5% of one CPU when the scanning found nothing to recycle. Other threads only move when there is work to do.

Saving memory often leads to a CPU-memory tradeoff topic. Here thing is different. In memory intensive usage we often see page-deduplication applied. It is very much cheaper to recycle a page than dedup it. In that case we can use PPR to win CPU and memory both.

We have a test to demonstrate how density was boosted. In which PPR works with a multithread page deduplication engine named PageONE rather than KSM. PageONE is going to be discussed in another topic.

Test environment
DELL R730
Intel® Xeon® E5-2650 v4 (2.20 GHz, of Cores 12, threads 24); 
256GB RAM
Host OS: Ubuntu server 14.04 Host kernel: 4.4.1
Qemu: 2.9.0
Guest OS: Ubuntu server 16.04 Guest kernel: 4.4.76

Test VMs’ memory size was configured to be 4GB. Each VM ran an application allocate/free memory every 30 minutes. The top memory allocation size of each period is 3GB and average size is 150MB. We ran many such VMs in the same time same machine to see the total host-side memory cost.

Raw KVM
Result: 67 VMs used 230GB memory. 

KVM + KSM 
Configuration: sleep_millisecs = 0, pages_to_scan = 1000000
Result: 73 VMs used 230GB memory. KSM daemon cost 100% of one CPU

KVM + PPR + PageONE:
Configuration: 
work threads =12 reclaim_period = 10 secs merge_period = 600 secs 
(Which means freed pages was recycled each 10 secs, pages were merged if not modified in 2 merge_periods)
Result: 516 VMs run together. The total used memory goes up and down between 218GB and 139GB, averagely 182GB. 

PPR and PageONE shared a daemon thread and 12 work threads. We test their CPU cost together because the two features are sharing threads while offloading for each other. 

The average CPU cost of work threads is 6.1%. Daemon thread’s cost is 22.1%

Here we tried other combinations in the same test case:
reclaim_period(sec) merge_period(sec) average memory(GB) work threads' average CPU daemon's CPU
5 200 164 6.8% 34%
5 600 176 6.3% 35%
10 200 171 6.6% 22%
10 600 182 6.1% 22%
NOTE: It is a high pressure test. PPR was recycling 3.2TB memory per hour. In normal case PPR costs far less CPU than that.