linux-kernel - Re: [PATCH] mm: make fault_around

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20160516142900.GB9540@node.shutemov.name>
Date:	Mon, 16 May 2016 17:29:00 +0300
From:	"Kirill A. Shutemov" <kirill@...temov.name>
To:	Minchan Kim <minchan@...nel.org>
Cc:	Vinayak Menon <vinmenon@...eaurora.org>,
	Andrew Morton <akpm@...ux-foundation.org>, linux-mm@...ck.org,
	linux-kernel@...r.kernel.org, dan.j.williams@...el.com,
	mgorman@...e.de, vbabka@...e.cz, kirill.shutemov@...ux.intel.com,
	dave.hansen@...ux.intel.com, hughd@...gle.com
Subject: Re: [PATCH] mm: make fault_around_bytes configurable

On Mon, May 16, 2016 at 11:18:54PM +0900, Minchan Kim wrote:
> On Tue, May 10, 2016 at 11:48:42AM +0900, Minchan Kim wrote:
> > On Mon, May 09, 2016 at 04:32:51PM +0900, Minchan Kim wrote:
> > > Hello,
> > > 
> > > On Mon, Apr 25, 2016 at 05:21:11PM +0530, Vinayak Menon wrote:
> > > > 
> > > > 
> > > > On 4/22/2016 3:14 PM, Kirill A. Shutemov wrote:
> > > > > On Fri, Apr 22, 2016 at 02:15:08PM +0530, Vinayak Menon wrote:
> > > > >> On 04/22/2016 05:31 AM, Andrew Morton wrote:
> > > > >>> On Mon, 18 Apr 2016 20:47:16 +0530 Vinayak Menon <vinmenon@...eaurora.org> wrote:
> > > > >>>
> > > > >>>> Mapping pages around fault is found to cause performance degradation
> > > > >>>> in certain use cases. The test performed here is launch of 10 apps
> > > > >>>> one by one, doing something with the app each time, and then repeating
> > > > >>>> the same sequence once more, on an ARM 64-bit Android device with 2GB
> > > > >>>> of RAM. The time taken to launch the apps is found to be better when
> > > > >>>> fault around feature is disabled by setting fault_around_bytes to page
> > > > >>>> size (4096 in this case).
> > > > >>> Well that's one workload, and a somewhat strange one.  What is the
> > > > >>> effect on other workloads (of which there are a lot!).
> > > > >>>
> > > > >> This workload emulates the way a user would use his mobile device, opening
> > > > >> an application, using it for some time, switching to next, and then coming
> > > > >> back to the same application later. Another stat which shows significant
> > > > >> degradation on Android with fault_around is device boot up time. I have not
> > > > >> tried any other workload other than these.
> > > > >>
> > > > >>>> The tests were done on 3.18 kernel. 4 extra vmstat counters were added
> > > > >>>> for debugging. pgpgoutclean accounts the clean pages reclaimed via
> > > > >>>> __delete_from_page_cache. pageref_activate, pageref_activate_vm_exec,
> > > > >>>> and pageref_keep accounts the mapped file pages activated and retained
> > > > >>>> by page_check_references.
> > > > >>>>
> > > > >>>> === Without swap ===
> > > > >>>>                           3.18             3.18-fault_around_bytes=4096
> > > > >>>> -----------------------------------------------------------------------
> > > > >>>> workingset_refault        691100           664339
> > > > >>>> workingset_activate       210379           179139
> > > > >>>> pgpgin                    4676096          4492780
> > > > >>>> pgpgout                   163967           96711
> > > > >>>> pgpgoutclean              1090664          990659
> > > > >>>> pgalloc_dma               3463111          3328299
> > > > >>>> pgfree                    3502365          3363866
> > > > >>>> pgactivate                568134           238570
> > > > >>>> pgdeactivate              752260           392138
> > > > >>>> pageref_activate          315078           121705
> > > > >>>> pageref_activate_vm_exec  162940           55815
> > > > >>>> pageref_keep              141354           51011
> > > > >>>> pgmajfault                24863            23633
> > > > >>>> pgrefill_dma              1116370          544042
> > > > >>>> pgscan_kswapd_dma         1735186          1234622
> > > > >>>> pgsteal_kswapd_dma        1121769          1005725
> > > > >>>> pgscan_direct_dma         12966            1090
> > > > >>>> pgsteal_direct_dma        6209             967
> > > > >>>> slabs_scanned             1539849          977351
> > > > >>>> pageoutrun                1260             1333
> > > > >>>> allocstall                47               7
> > > > >>>>
> > > > >>>> === With swap ===
> > > > >>>>                           3.18             3.18-fault_around_bytes=4096
> > > > >>>> -----------------------------------------------------------------------
> > > > >>>> workingset_refault        597687           878109
> > > > >>>> workingset_activate       167169           254037
> > > > >>>> pgpgin                    4035424          5157348
> > > > >>>> pgpgout                   162151           85231
> > > > >>>> pgpgoutclean              928587           1225029
> > > > >>>> pswpin                    46033            17100
> > > > >>>> pswpout                   237952           127686
> > > > >>>> pgalloc_dma               3305034          3542614
> > > > >>>> pgfree                    3354989          3592132
> > > > >>>> pgactivate                626468           355275
> > > > >>>> pgdeactivate              990205           771902
> > > > >>>> pageref_activate          294780           157106
> > > > >>>> pageref_activate_vm_exec  141722           63469
> > > > >>>> pageref_keep              121931           63028
> > > > >>>> pgmajfault                67818            45643
> > > > >>>> pgrefill_dma              1324023          977192
> > > > >>>> pgscan_kswapd_dma         1825267          1720322
> > > > >>>> pgsteal_kswapd_dma        1181882          1365500
> > > > >>>> pgscan_direct_dma         41957            9622
> > > > >>>> pgsteal_direct_dma        25136            6759
> > > > >>>> slabs_scanned             689575           542705
> > > > >>>> pageoutrun                1234             1538
> > > > >>>> allocstall                110              26
> > > > >>>>
> > > > >>>> Looks like with fault_around, there is more pressure on reclaim because
> > > > >>>> of the presence of more mapped pages, resulting in more IO activity,
> > > > >>>> more faults, more swapping, and allocstalls.
> > > > >>> A few of those things did get a bit worse?
> > > > >> I think some numbers (like workingset, pgpgin, pgpgoutclean etc) looks
> > > > >> better with fault_around because, increased number of mapped pages is
> > > > >> resulting in less number of file pages being reclaimed (pageref_activate,
> > > > >> pageref_activate_vm_exec, pageref_keep above), but increased swapping.
> > > > >> Latency numbers are far bad with fault_around_bytes + swap, possibly because
> > > > >> of increased swapping, decrease in kswapd efficiency and increase in
> > > > >> allocstalls.
> > > > >> So the problem looks to be that unwanted pages are mapped around the fault
> > > > >> and page_check_references is unaware of this.
> > > > > Hm. It makes me think we should make ptes setup by faultaround old.
> > > > >
> > > > > Although, it would defeat (to some extend) purpose of faultaround on
> > > > > architectures without HW accessed bit :-/
> > > > >
> > > > > Could you check if the patch below changes the situation?
> > > > > It would require some more work to not mark the pte we've got fault for old.
> > > > 
> > > > Column at the end shows the values with the patch
> > > > 
> > > >                   3.18   3.18-fab=4096  3.18-Kirill's-fix
> > > > 
> > > > ---------------------------------------------------------
> > > > 
> > > > workingset_refault        597687   878109   790207
> > > > 
> > > > workingset_activate       167169   254037   207912
> > > > 
> > > > pgpgin                    4035424  5157348  4793116
> > > > 
> > > > pgpgout                   162151   85231    85539
> > > > 
> > > > pgpgoutclean              928587   1225029  1129088
> > > > 
> > > > pswpin                    46033    17100    8926
> > > > 
> > > > pswpout                   237952   127686   103435
> > > > 
> > > > pgalloc_dma               3305034  3542614  3401000
> > > > 
> > > > pgfree                    3354989  3592132  3457783
> > > > 
> > > > pgactivate                626468   355275   326716
> > > > 
> > > > pgdeactivate              990205   771902   697392
> > > > 
> > > > pageref_activate          294780   157106   138451
> > > > 
> > > > pageref_activate_vm_exec  141722   63469    64585
> > > > 
> > > > pageref_keep              121931   63028    65811
> > > > 
> > > > pgmajfault                67818    45643    34944
> > > > 
> > > > pgrefill_dma              1324023  977192   874497
> > > > 
> > > > pgscan_kswapd_dma         1825267  1720322  1577483
> > > > 
> > > > pgsteal_kswapd_dma        1181882  1365500  1243968
> > > > 
> > > > pgscan_direct_dma         41957    9622     9387
> > > > 
> > > > pgsteal_direct_dma        25136    6759     7108
> > > > 
> > > > slabs_scanned             689575   542705   618839
> > > > 
> > > > pageoutrun                1234     1538     1450
> > > > 
> > > > allocstall                110      26       13
> > > > 
> > > > Everything seems to have improved except slabs_scanned, possibly because
> > > > of this check which Minchan pointed out, that results in higher pressure on slabs.
> > > > 
> > > > if (page_mapped(page) || PageSwapCache(page))
> > > > 
> > > >     sc->nr_scanned++;
> > > > 
> > > > I had added some traces to monitor the vmpressure values. Those also seems to
> > > > be high, possibly because of the same reason.
> > > > 
> > > > Should the pressure be doubled only if page is mapped and referenced ?
> > > 
> > > Yes, pte_mkold is not perfect at the moment.
> > > 
> > > Anyway, above heuristic has been in there for a long time since I was born
> > > maybe :) (I don't want to argue why it's there and whether it's right) So,
> > > I'm really hesitant to change it that it might bite some workloads.
> > > (But I don't mean I'm against it but just don't want to make it by myself
> > > to avoid potential blame). IOW, Kirill's fault_around broke it too so it
> > > could bite some workloads.
> > > 
> > > At least, as Vinayak mentioned, it would change vmpressure level so users of
> > > vmpressure can be affected. AFAIK, some vendors in embedded side relies on
> > > vmpressure to control memory management so it will hurt them.
> > > As well, slab shrinking behavior was changed, too. Unfortunately, I don't
> > > know any workload is dependent with it.
> > > 
> > > As other regression in my company product, we have snapshot a process
> > > with workingset for later fast resume. For that, we have considered
> > > pte-mapped pages as workingset for snapshot but snapshot start to include
> > > non-workingset pages since fault-around is merged. It means snapshot
> > > image size is increased so that we need more storage space and it starts
> > > the thing slow down. I guess mincore(2) users will be affected.
> > > 
> > > Additional Note: There are lots of products with ARM which is non-HW access
> > > bit system in embedded world although ARM start to support it recenlty and
> > > sequential file access workload is not important compared to memory reclaim
> > > So, fault_around's benefit could be higly limited compared to HW-access bit
> > > architectures on server workload.
> > > 
> > > I want to ask again.
> > > I guess we could disable fault_around by kernel parameter but does it
> > > sound reasonable to enable fault_around by default for every arches
> > > at the cost of above regression?
> > > 
> > > I'm not against for that. Just what I want is some fixes about the
> > > regression should go to -stable.
> > > 
> > > > 
> > > > There is big improvement in avg latency, but still 5% higher than with fault_around
> > > > disabled. I will try to debug this further.
> > 
> > I did quick test in my ARM machine.
> > 
> > 512M file mmap sequential every word read
> > 
> > = vanilla fault_around=4096 =
> > minor fault: 131291
> > elapsed time(usec): 6686236
> > 
> > = vanilla fault_around=65536 =
> > minor fault: 12577
> > elapsed time(usec): 6586959
> > 
> > I tested 3 times and result seemed to be stable.
> > 90% minor fault was reduced. It's huge win but as looking at elapsed time,
> > it's not huge win. Just about 1.5%.
> > 
> > = pte_mkold applied fault_around=4096 =
> > minor fault: 131291
> > elapsed time(usec): 6608358
> > 
> > = pte_mkold applied fault_around=65536 =
> > minor fault: 143609
> > elapsed time(usec): 6772520
> > 
> > I tested 3 times and result seemed to be stable.
> > minor fault was rather increased and elapsed time was slow with
> > fault_around.
> > Gain is really not clear.
> 
> Kirill,
> You wanted to test non-HW access bit system and I did.
> What's your opinion?

Sorry, for late response.

My patch is incomlete: we need to find a way to not mark pte as old if we
handle page fault for the address the pte represents.

Once this will be done, the number of page faults shouldn't be higher with
fault-around enabled even on machines without hardware accessed bit. This
will address performance regression with the patch on such machines.

I'll try to find time to update the patch soon.

-- 
 Kirill A. Shutemov