lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Thu, 20 Jan 2022 21:49:13 +0100
From:   David Hildenbrand <david@...hat.com>
To:     Nadav Amit <nadav.amit@...il.com>
Cc:     Matthew Wilcox <willy@...radead.org>,
        "zhangliang (AG)" <zhangliang5@...wei.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Linux-MM <linux-mm@...ck.org>,
        Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
        wangzhigang17@...wei.com,
        Linus Torvalds <torvalds@...ux-foundation.org>
Subject: Re: [PATCH] mm: reuse the unshared swapcache page in do_wp_page

On 20.01.22 21:46, Nadav Amit wrote:
> 
> 
>> On Jan 20, 2022, at 12:37 PM, David Hildenbrand <david@...hat.com> wrote:
>>
>> On 20.01.22 21:09, David Hildenbrand wrote:
>>> On 20.01.22 21:07, Matthew Wilcox wrote:
>>>> On Thu, Jan 20, 2022 at 08:55:12PM +0100, David Hildenbrand wrote:
>>>>>>>> David, does any of it regards the lru_cache_add() reference issue that I
>>>>>>>> mentioned? [1]
>>>>
>>>>> +++ b/mm/memory.c
>>>>> @@ -3291,19 +3291,28 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
>>>>>        if (PageAnon(vmf->page)) {
>>>>>                struct page *page = vmf->page;
>>>>>
>>>>> -               /* PageKsm() doesn't necessarily raise the page refcount */
>>>>> -               if (PageKsm(page) || page_count(page) != 1)
>>>>> +               /*
>>>>> +                * PageKsm() doesn't necessarily raise the page refcount.
>>>>> +                *
>>>>> +                * These checks are racy as long as we haven't locked the page;
>>>>> +                * they are a pure optimization to avoid trying to lock the page
>>>>> +                * and trying to free the swap cache when there is little hope
>>>>> +                * it will actually result in a refcount of 1.
>>>>> +                */
>>>>> +               if (PageKsm(page) || page_count(page) > 1 + PageSwapCache(page))
>>>>>                        goto copy;
>>>>>                if (!trylock_page(page))
>>>>>                        goto copy;
>>>>> -               if (PageKsm(page) || page_mapcount(page) != 1 || page_count(page) != 1) {
>>>>> +               if (PageSwapCache(page))
>>>>> +                       try_to_free_swap(page);
>>>>> +               if (PageKsm(page) || page_count(page) != 1) {
>>>>>                        unlock_page(page);
>>>>>                        goto copy;
>>>>>                }
>>>>>                /*
>>>>> -                * Ok, we've got the only map reference, and the only
>>>>> -                * page count reference, and the page is locked,
>>>>> -                * it's dark out, and we're wearing sunglasses. Hit it.
>>>>> +                * Ok, we've got the only page reference from our mapping
>>>>> +                * and the page is locked, it's dark out, and we're wearing
>>>>> +                * sunglasses. Hit it.
>>>>>                 */
>>>>>                unlock_page(page);
>>>>>                wp_page_reuse(vmf);
>>>>>
>>>>>
>>>>> I added some vmstats that monitor various paths. After one run of
>>>>> 	./forceswap 2 1000000 1
>>>>> I'm left with a rough delta (including some noise) of
>>>>> 	anon_wp_copy_count 1799
>>>>> 	anon_wp_copy_count_early 1
>>>>> 	anon_wp_copy_lock 983396
>>>>> 	anon_wp_reuse 0
>>>>>
>>>>> The relevant part of your reproducer is
>>>>>
>>>>> 	for (i = 0; i < nops; i++) {
>>>>> 		if (madvise((void *)p, PAGE_SIZE * npages, MADV_PAGEOUT)) {
>>>>> 			perror("madvise");
>>>>> 			exit(-1);
>>>>> 		}
>>>>>
>>>>> 		for (j = 0; j < npages; j++) {
>>>>> 			c = p[j * PAGE_SIZE];
>>>>> 			c++;
>>>>> 			time -= rdtscp();
>>>>> 			p[j * PAGE_SIZE] = c;
>>>>> 			time += rdtscp();
>>>>> 		}
>>>>> 	}
>>>>>
>>>>> For this specific reproducer at least, the page lock seems to be the thingy that prohibits
>>>>> reuse if I interpret the numbers correctly. We pass the initial page_count() check.
>>>>>
>>>>> Haven't looked into the details, and I would be curious how that performs with actual
>>>>> workloads, if we can reproduce similar behavior.
>>>>
>>>> I don't see how that patch addresses the lru issue.  Wouldn't we need
>>>> something like ...
>>>>
>>>> 	if (!PageLRU(page))
>>>> 		lru_add_drain_all();
>>>>
>>
>> lru_add_drain_all() takes a mutex ... best we can do I guess is drain
>> the local CPU using lru_add_drain(). I'll go play with it and see what
>> breaks :)
>>
> 
> I did hack something similar and it solved the problem, but I felt it is
> a hack. If the thread is scheduled on another core, or if the write fault
> is triggered by another thread it wouldn’t work.

Yes, it will not match easily. One question would be how often it would
help in practice and if it would be worth the price.

> 
> If you look for a real-world workload that behaves similarly, you can try
> memcached with memory pressure and low latency device (I used 
> pmem-emulated). This is the workload in which I encountered the issue
> first.

Yes, I agree. So PageAnonExclusive is our best bet ... hopefully.

-- 
Thanks,

David / dhildenb

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ