linux-kernel - [RFC] [Resend] Another Para-Virtualization page recycler -- Code details, Trap-less way to return free pages to kernel

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <29c2bb7e-8299-4942-8189-a8914c7c03aa.ljy@baibantech.com.cn>
Date:   Thu, 21 Sep 2017 16:25:39 +0800
From:   "XaviLi" <ljy@...bantech.com.cn>
To:     "kvm" <kvm@...r.kernel.org>,
        "linux-kernel" <linux-kernel@...r.kernel.org>,
        "Jan Kiszka" <jan.kiszka@...mens.com>
Cc:     "杨泽昕" <yzx@...bantech.com.cn>,
        "王斌" <wb@...bantech.com.cn>,
        "李珅" <lishen@...bantech.com.cn>
Subject: [RFC] [Resend] Another Para-Virtualization page recycler -- Code details, Trap-less way to return free pages to kernel

We raised a topic about PPR (Per Page Recycler) and thank to Jan Kiszka for advises. We are here to break up patch codes and explain the code in detail. There are too many things to explain in one topic. We would like to do it part by part. Content of original mails and patches can be found below in the end.

1.  Why another page recycler?

Freed memory always be returned to kernel in groups. User mode applications use munmap when the freed chunk is accumulated to be big enough. In VM world, balloon driver is triggered when free memory is worth to be collected. PPR offers a way to make a reclaim for each free-able page because it cost less CPU and trap-less.

The APPs or VMs release any uncopied pages to kernel instead of reserve them means we can use memory more efficiently. We start test from virtual machine scenario because the effect here is most obvious. In our experiment we can run 516 VMs with PPR ,in contrast to 60+ without PPR. This issue is also work for normal applications. Here we call VM or applications as APP for simple.

2.  Basic Method:

Let begin with a question. Is it possible for APPs to set a “freed” mark at the beginning bytes of the page. Whether Kernel can take a glance and know it is reclaim-able? It is NOT possible because the memory-content is arbitrary. No particular value can be reserved to stand for “reclaim-able”. We let the first bytes of freed pages indicate the location of freed page pointer pool. Pointers in which are the reliable proofs of page being free-able. A wrong indicator leads to unproper pointer and doesn’t cause any further trouble. We call this method “PIP” (Pointer Indicator Pair). 

In some case pages-content are scanned periodically. One example is page deduplication. If we can find the page recycle-able at the beginning bytes, then the rest job can be saved. PPR work alone is very cheap and can win both CPU and memory when work with other scanners. The cost and test result can be found in the original mails below.

3.  Code Break Up:

The APP side:
Page Free hook: virt_mark_page_release() (page_reclaim_guest.c)
The free-page hook is called when a page is going to be freed. It just marks the beginning of the page as an indicator. Allocate the position of pointer from the pool and set the pointer to point the freed page. So that the page can be recycled in seconds. The allocation is quite simple because we can assume the pool is big enough and reclaim can happen in time to avoid the head catchup the tail in most cases. The pool is big but not consume much memory. Because when it is empty and zeroed, it can be shrunk by page-deduplication.

int virt_mark_page_release(struct page *page)
{
    int pool_id ;
    unsigned long long alloc_id;
    unsigned long long state;
    unsigned long long idx ;
    volatile struct virt_release_mark *mark ;
    unsigned long long time_begin = 0;
    if(!guest_mem_pool)
    {
        clear_page_content((void*)page);
        set_guest_page_clear_ok();
        return -1;
    }
    if(!pone_page_reclaim_enable)
    {
        reset_guest_page_clear_ok();
        return -1;
    }
    time_begin = rdtsc_ordered();
    pool_id = guest_mem_pool->pool_id;
    /*share memory pool alloc a position,the default content is 0*/
    alloc_id = atomic64_add_return(1,(atomic64_t*)&guest_mem_pool->alloc_idx)-1;
    idx = alloc_id%guest_mem_pool->desc_max;
    state = guest_mem_pool->desc[idx];
    
    mark = get_page_content((void*)page);
    if(0 == state)
    {
        /*the reclaim identification store on the share mem position,using  gfn*/
        if(0 != atomic64_cmpxchg((atomic64_t*)&guest_mem_pool->desc[idx],0,page_to_pfn(page)))
        {
            /*if the alloced position used by another thread,release mark invalid*/
            pool_id = guest_mem_pool->pool_max  +1;
            idx = guest_mem_pool->desc_max +1;
            //atomic64_add(1,(atomic64_t*)&guest_mem_pool->mark_release_err_conflict);
            
        }
        else
        {
            //atomic64_add(1,(atomic64_t*)&guest_mem_pool->mark_release_ok);
        }
        /*write release mark on the beginning of the release page*/
        mark->pool_id = pool_id;
        mark->alloc_id = idx;
        barrier();
        mark->desc = guest_mem_pool->mem_ind;
        barrier();
        put_page_content((void*)mark);
        PONE_TIMEPOINT_SET(page_reclaim_free_ok , rdtsc_ordered()- time_begin);
        return 0;
    }
    else
    {
        /*alloced position used by another thread,release mark invalid*/
        mark->pool_id = guest_mem_pool->pool_max +1;
        mark->alloc_id = guest_mem_pool->desc_max +1;
        barrier();
        mark->desc = guest_mem_pool->mem_ind;
        barrier();
        put_page_content((void*)mark);
    }
    //atomic64_add(1,(atomic64_t*)&guest_mem_pool->mark_release_err_state);
    PONE_TIMEPOINT_SET(page_reclaim_free_fail , rdtsc_ordered()- time_begin);
    return -1;
}

Page Allocation hook: virt_mark_page_alloc() (page_reclaim_guest.c)
The allocation hook is called when a page is allocated. Assume the page is beginning with an indicator when not recycled. In this case it uses a lockless way to undo the pointer and indicator. If the beginning is zero, that means the page is reclaimed. It can be safely given to the user and leave the real allocation job to a future Copy On Write.

int virt_mark_page_alloc(struct page *page)
{
    unsigned long long state;
    unsigned long long idx ;
    volatile struct virt_release_mark *mark ;
    unsigned long long time_begin = 0;  
    if(!guest_mem_pool)
    {
        return 0;
    }

    if(!pone_page_reclaim_enable)
    {
        return 0;
    }
    time_begin = rdtsc_ordered();
    mark = get_page_content((void*)page);
    
    if(mark->desc == guest_mem_pool->mem_ind)
    {
        if(mark->pool_id == guest_mem_pool->pool_id)
        {
            if(mark->alloc_id < guest_mem_pool->desc_max)
            {
                idx = mark->alloc_id;
                state = guest_mem_pool->desc[mark->alloc_id];
                if(state == page_to_pfn(page))
                {
                    /*clear the reclaim identification from the share mem pool*/
                    if(state == atomic64_cmpxchg((atomic64_t*)&guest_mem_pool->desc[idx],state,0))
                    {
                        //atomic64_add(1,(atomic64_t*)&guest_mem_pool->mark_alloc_ok);
                    }
                    else
                    {
                        /*if the reclaim identification is cleared ,this mean the host kernel  is reclaiming or reclaimed this page*/ 
                        while(mark->desc != 0)
                        {
                            barrier();
                        }
                    }
                }
            }
        }
        /*clear the release mark in the page*/
        mark->pool_id = 0;
        mark->alloc_id = 0;
        barrier();
        mark->desc = 0;
        barrier();
        put_page_content((void*)mark);
        PONE_TIMEPOINT_SET(page_reclaim_alloc_ok,rdtsc_ordered()-time_begin);
        return 0;
    }
    else
    {
    }
    //atomic64_add(1,(atomic64_t*)&guest_mem_pool->mark_alloc_err_state);
    put_page_content((void*)mark);
    //PONE_TIMEPOINT_SET(page_reclaim_alloc_fail,rdtsc_ordered()-time_begin);
    return -1;
}

Kernel reclaim process: process_virt_page_release() (page_reclaim_host.c)
This process is called in a kernel thread when a page is found reclaim-able. It uses a lockless operation to undo the pointer and replace the page with a zero page. This issue is protected by the assumption that any pages pointed by a PIP pointer cannot begin with zero.
int process_virt_page_release(void *page_mem, unsigned long identification)
{
    int pool_id = 0;
    unsigned long long alloc_id = 0;
    unsigned long dsc_page_off = 0;
    void *page = NULL;
    void *dsc_page = NULL;
    unsigned long long *dsc = NULL;
    unsigned long new_ident = 0;
    unsigned long cmp_args[8] = {0};
    struct virt_release_mark *mark = page_mem;
    struct virt_mem_pool *pool = NULL;
    unsigned long time_begin = 0;   
    pool_id = mark->pool_id;
    alloc_id = mark->alloc_id;
    if(pool_id > MEM_POOL_MAX)
    {
        if(pool_id != MEM_POOL_MAX +1)
        PONE_DEBUG("virt mem error \r\n");
        return VIRT_MEM_FAIL;
    }
    if(NULL == mem_pool_addr[pool_id])
    {
        return VIRT_MEM_FAIL;
    }
    pool =  mem_pool_addr[pool_id];
    
    if(alloc_id > pool->desc_max)
    {
        return VIRT_MEM_FAIL;
    }

    time_begin = rdtsc_ordered();
    /*get the share mem pool page by release mark, where the page reclaim ident recorded*/
    page = get_reclaim_identification_page(pool,mark,&dsc_page_off); 
    if(NULL == page)
    {
        return VIRT_MEM_FAIL;
    }
    PONE_TIMEPOINT_SET(ppr_get_ident_page ,rdtsc_ordered()- time_begin);
    dsc_page = get_page_content(page);
    dsc = dsc_page+dsc_page_off;
    /*get the page reclaim ident from the share mem pool page*/
    new_ident = *dsc;

    time_begin = rdtsc_ordered();
    /*compare the ident from pool with the args identification,if equal reclaim the page*/
    if(VIRT_MEM_OK == compare_reclaim_identification(pool,new_ident,identification,cmp_args))   
    {
        PONE_TIMEPOINT_SET(ppr_cmp_ident ,rdtsc_ordered()- time_begin);
        /*clear the ident in the share mem pool ,if cleared ,mean the guest kernel alloc this page again*/
        if(new_ident == atomic64_cmpxchg((atomic64_t*)dsc,new_ident,0))
        {
            time_begin = rdtsc_ordered();
            /*reclaim the page*/
            if(VIRT_MEM_OK ==replace_reclaim_page(pool,identification,cmp_args))
            {
                PONE_TIMEPOINT_SET(ppr_replace_page ,rdtsc_ordered()- time_begin);
                put_page_content(page);
                put_page(page);
                return VIRT_MEM_OK;
            }
        }
        else
        {
            free_reclaim_cmp_args(pool,cmp_args);
        }
    }

    put_page_content(page);
    put_page(page);
    return VIRT_MEM_FAIL;
}


Kernel Scanning process: splitter_daemon_thread() (slice_state_daemon.c)
This is the body of daemon thread. It periodically scans memory for deduplication purpose. When found a page begin with PIP Indicator, it delivers the page to the reclaim entry.
static int splitter_daemon_thread(void *data)
{
    int i = 0;
    int j = 0;
    long long slice_num = 0;
    long long slice_state = 0;
    unsigned long slice_idx = 0;
    unsigned long slice_begin = 0;
    int volatile_oper = 0;
    int need_repeat =0 ;
    unsigned int scan_count = 0;
    unsigned long long start_jiffies = 0;
    unsigned long long end_jiffies = 0;
    unsigned long long cost_time = 0;
    unsigned long long slice_vcnt = 0;
    unsigned long long slice_scan_cnt = 0;
    long que_id;
    struct page *page = NULL;
    void    *page_addr = NULL;
    unsigned long long time_begin = 0;
    __set_current_state(TASK_RUNNING);

    do
    {

        volatile_oper = 0;
        need_repeat =0;
        scan_count++;
        if((scan_count % pone_daemon_merge_scan) == 0)
        {
            volatile_oper = 1;
        }
        start_jiffies = get_jiffies_64();
        if(pone_daemon_run)
        {
            for(i = 0 ; i<global_block->node_num;i++)
        {   
            slice_num = global_block->slice_node[i].slice_num;
            slice_begin = global_block->slice_node[i].slice_start;
            
            for(j = 0;j<slice_num;j++)
            {
                slice_state = get_slice_state(i,j);
                
                if((SLICE_VOLATILE == slice_state) ||(SLICE_WATCH == slice_state))
                {
                    slice_idx = slice_begin + j; 
                    page = pfn_to_page(slice_idx);

                    if((SLICE_VOLATILE == slice_state) && (pone_page_reclaim_enable ==1))
                    {
                        /* if page state is volatile ,determine whether this page has a release mark for PPR*/  
                        page_addr = kmap_atomic(page);
                        if(PONE_OK == is_virt_page_release(page_addr))
                        {
                            /*if page has release mark send  que to processing ,else determine merge period is reached */ 
                            kunmap_atomic(page_addr);
                            goto get_que;
                        }
                        kunmap_atomic(page_addr);   
                    }

                    if(!volatile_oper)
                    {
                        continue;
                    }
                    /*the merge period is reached*,wath state merge processing*/
                    if(SLICE_WATCH == slice_state)
                    {
                        /*change state from watch to watch_que,send page to daemon order que to processing*/
                        if(0 != change_slice_state(i,j,SLICE_WATCH,SLICE_WATCH_QUE))
                        {
                            need_repeat++;
                            continue;
                        }
                        slice_daemon_find_watch++;
                        /*this que is processing is load balancing*/
                        lfo_write(slice_daemon_order_que,48,(unsigned long)page);   
                        continue;
                    }

                    /*volatile state merge processing, volatile cnt is a opt method ,
                     * when wathed page is modified , the volatile cnt added,then we
                     * scaned this page next merge period*/
get_cnt:
                    if(0 != (slice_vcnt = get_slice_volatile_cnt(i,j)))
                    {
                        slice_scan_cnt  = get_slice_scan_cnt(i,j);
                        if(slice_scan_cnt == slice_vcnt)
                        {   
                            if(0 != change_slice_scan_cnt(i,j,slice_scan_cnt,0))
                            {
                                goto get_cnt;
                            }
                            atomic64_add(1,(atomic64_t*)&slice_daemon_volatile_cnt[slice_vcnt]);
                        }
                        else
                        {
                            if(slice_scan_cnt > slice_vcnt)
                            {
                                printk("daemon cnt bug bug bug bug %lld,%lld \r\n",slice_scan_cnt,slice_vcnt);
                                if(0 == slice_vcnt)
                                {
                                    continue;
                                }
                            }
                            if(0 != change_slice_scan_cnt(i,j,slice_scan_cnt,slice_scan_cnt+1))
                            {
                                goto get_cnt;
                            }
                            continue;

                        }
                    }

                    /*when  processing the volatile state que,can mkprotect page ,to avoid page table lock ,we dispath the page of same process to same que*/ 
get_que:
                    que_id = pone_get_slice_que_id(page);
                    if((-1 == que_id) || (0 == que_id))
                    {
                        continue;
                    }
                    que_id = hash_64(que_id,48);
                    que_id = pone_que_stat_lookup(que_id);
                    if(SLICE_VOLATILE == slice_state)
                    {
                        if(0 != change_slice_state(i,j,SLICE_VOLATILE,SLICE_ENQUE))
                        {
                            need_repeat++;
                            continue;
                        }
                        slice_daemon_find_volatile++;
                        time_begin = rdtsc_ordered();
                        lfo_write(slice_order_que[que_id],0,(unsigned long)page);
                        PONE_TIMEPOINT_SET(lf_order_que_write,(rdtsc_ordered()- time_begin));
                    }
                }
            }
        }
        }
        
        end_jiffies = get_jiffies_64();
        
        cost_time = jiffies_to_msecs(end_jiffies - start_jiffies);
        daemon_sleep_period_in_loop++;
        if(cost_time >pone_daemon_base_scan_period)
        {
            msleep(pone_daemon_base_scan_period);
        
        }
        else
        {   
            msleep(pone_daemon_base_scan_period - cost_time);
        }
    }while(!kthread_should_stop());
    return 0;
}
The content of original emails and patches could be found here:
PPR description
https://github.com/baibantech/dynamic_vm/wiki/PPR-Details
Patch:
https://github.com/baibantech/dynamic_vm/tree/master/dynamic_vm_0.5
DynamicVM Project (include this two technologies):
https://github.com/baibantech/dynamic_vm.git
User’s guide.
https://github.com/baibantech/dynamic_vm/wiki/Dynamic-Vm-Usage