linux-kernel - Re: [PATCH] mm: make fault_around

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20160516141854.GA2361@blaptop>
Date:	Mon, 16 May 2016 23:18:54 +0900
From:	Minchan Kim <minchan@...nel.org>
To:	Minchan Kim <minchan@...nel.org>,
	"Kirill A. Shutemov" <kirill@...temov.name>
Cc:	Vinayak Menon <vinmenon@...eaurora.org>,
	"Kirill A. Shutemov" <kirill@...temov.name>,
	Andrew Morton <akpm@...ux-foundation.org>, linux-mm@...ck.org,
	linux-kernel@...r.kernel.org, dan.j.williams@...el.com,
	mgorman@...e.de, vbabka@...e.cz, kirill.shutemov@...ux.intel.com,
	dave.hansen@...ux.intel.com, hughd@...gle.com
Subject: Re: [PATCH] mm: make fault_around_bytes configurable

On Tue, May 10, 2016 at 11:48:42AM +0900, Minchan Kim wrote:
> On Mon, May 09, 2016 at 04:32:51PM +0900, Minchan Kim wrote:
> > Hello,
> > 
> > On Mon, Apr 25, 2016 at 05:21:11PM +0530, Vinayak Menon wrote:
> > > 
> > > 
> > > On 4/22/2016 3:14 PM, Kirill A. Shutemov wrote:
> > > > On Fri, Apr 22, 2016 at 02:15:08PM +0530, Vinayak Menon wrote:
> > > >> On 04/22/2016 05:31 AM, Andrew Morton wrote:
> > > >>> On Mon, 18 Apr 2016 20:47:16 +0530 Vinayak Menon <vinmenon@...eaurora.org> wrote:
> > > >>>
> > > >>>> Mapping pages around fault is found to cause performance degradation
> > > >>>> in certain use cases. The test performed here is launch of 10 apps
> > > >>>> one by one, doing something with the app each time, and then repeating
> > > >>>> the same sequence once more, on an ARM 64-bit Android device with 2GB
> > > >>>> of RAM. The time taken to launch the apps is found to be better when
> > > >>>> fault around feature is disabled by setting fault_around_bytes to page
> > > >>>> size (4096 in this case).
> > > >>> Well that's one workload, and a somewhat strange one.  What is the
> > > >>> effect on other workloads (of which there are a lot!).
> > > >>>
> > > >> This workload emulates the way a user would use his mobile device, opening
> > > >> an application, using it for some time, switching to next, and then coming
> > > >> back to the same application later. Another stat which shows significant
> > > >> degradation on Android with fault_around is device boot up time. I have not
> > > >> tried any other workload other than these.
> > > >>
> > > >>>> The tests were done on 3.18 kernel. 4 extra vmstat counters were added
> > > >>>> for debugging. pgpgoutclean accounts the clean pages reclaimed via
> > > >>>> __delete_from_page_cache. pageref_activate, pageref_activate_vm_exec,
> > > >>>> and pageref_keep accounts the mapped file pages activated and retained
> > > >>>> by page_check_references.
> > > >>>>
> > > >>>> === Without swap ===
> > > >>>>                           3.18             3.18-fault_around_bytes=4096
> > > >>>> -----------------------------------------------------------------------
> > > >>>> workingset_refault        691100           664339
> > > >>>> workingset_activate       210379           179139
> > > >>>> pgpgin                    4676096          4492780
> > > >>>> pgpgout                   163967           96711
> > > >>>> pgpgoutclean              1090664          990659
> > > >>>> pgalloc_dma               3463111          3328299
> > > >>>> pgfree                    3502365          3363866
> > > >>>> pgactivate                568134           238570
> > > >>>> pgdeactivate              752260           392138
> > > >>>> pageref_activate          315078           121705
> > > >>>> pageref_activate_vm_exec  162940           55815
> > > >>>> pageref_keep              141354           51011
> > > >>>> pgmajfault                24863            23633
> > > >>>> pgrefill_dma              1116370          544042
> > > >>>> pgscan_kswapd_dma         1735186          1234622
> > > >>>> pgsteal_kswapd_dma        1121769          1005725
> > > >>>> pgscan_direct_dma         12966            1090
> > > >>>> pgsteal_direct_dma        6209             967
> > > >>>> slabs_scanned             1539849          977351
> > > >>>> pageoutrun                1260             1333
> > > >>>> allocstall                47               7
> > > >>>>
> > > >>>> === With swap ===
> > > >>>>                           3.18             3.18-fault_around_bytes=4096
> > > >>>> -----------------------------------------------------------------------
> > > >>>> workingset_refault        597687           878109
> > > >>>> workingset_activate       167169           254037
> > > >>>> pgpgin                    4035424          5157348
> > > >>>> pgpgout                   162151           85231
> > > >>>> pgpgoutclean              928587           1225029
> > > >>>> pswpin                    46033            17100
> > > >>>> pswpout                   237952           127686
> > > >>>> pgalloc_dma               3305034          3542614
> > > >>>> pgfree                    3354989          3592132
> > > >>>> pgactivate                626468           355275
> > > >>>> pgdeactivate              990205           771902
> > > >>>> pageref_activate          294780           157106
> > > >>>> pageref_activate_vm_exec  141722           63469
> > > >>>> pageref_keep              121931           63028
> > > >>>> pgmajfault                67818            45643
> > > >>>> pgrefill_dma              1324023          977192
> > > >>>> pgscan_kswapd_dma         1825267          1720322
> > > >>>> pgsteal_kswapd_dma        1181882          1365500
> > > >>>> pgscan_direct_dma         41957            9622
> > > >>>> pgsteal_direct_dma        25136            6759
> > > >>>> slabs_scanned             689575           542705
> > > >>>> pageoutrun                1234             1538
> > > >>>> allocstall                110              26
> > > >>>>
> > > >>>> Looks like with fault_around, there is more pressure on reclaim because
> > > >>>> of the presence of more mapped pages, resulting in more IO activity,
> > > >>>> more faults, more swapping, and allocstalls.
> > > >>> A few of those things did get a bit worse?
> > > >> I think some numbers (like workingset, pgpgin, pgpgoutclean etc) looks
> > > >> better with fault_around because, increased number of mapped pages is
> > > >> resulting in less number of file pages being reclaimed (pageref_activate,
> > > >> pageref_activate_vm_exec, pageref_keep above), but increased swapping.
> > > >> Latency numbers are far bad with fault_around_bytes + swap, possibly because
> > > >> of increased swapping, decrease in kswapd efficiency and increase in
> > > >> allocstalls.
> > > >> So the problem looks to be that unwanted pages are mapped around the fault
> > > >> and page_check_references is unaware of this.
> > > > Hm. It makes me think we should make ptes setup by faultaround old.
> > > >
> > > > Although, it would defeat (to some extend) purpose of faultaround on
> > > > architectures without HW accessed bit :-/
> > > >
> > > > Could you check if the patch below changes the situation?
> > > > It would require some more work to not mark the pte we've got fault for old.
> > > 
> > > Column at the end shows the values with the patch
> > > 
> > >                   3.18   3.18-fab=4096  3.18-Kirill's-fix
> > > 
> > > ---------------------------------------------------------
> > > 
> > > workingset_refault        597687   878109   790207
> > > 
> > > workingset_activate       167169   254037   207912
> > > 
> > > pgpgin                    4035424  5157348  4793116
> > > 
> > > pgpgout                   162151   85231    85539
> > > 
> > > pgpgoutclean              928587   1225029  1129088
> > > 
> > > pswpin                    46033    17100    8926
> > > 
> > > pswpout                   237952   127686   103435
> > > 
> > > pgalloc_dma               3305034  3542614  3401000
> > > 
> > > pgfree                    3354989  3592132  3457783
> > > 
> > > pgactivate                626468   355275   326716
> > > 
> > > pgdeactivate              990205   771902   697392
> > > 
> > > pageref_activate          294780   157106   138451
> > > 
> > > pageref_activate_vm_exec  141722   63469    64585
> > > 
> > > pageref_keep              121931   63028    65811
> > > 
> > > pgmajfault                67818    45643    34944
> > > 
> > > pgrefill_dma              1324023  977192   874497
> > > 
> > > pgscan_kswapd_dma         1825267  1720322  1577483
> > > 
> > > pgsteal_kswapd_dma        1181882  1365500  1243968
> > > 
> > > pgscan_direct_dma         41957    9622     9387
> > > 
> > > pgsteal_direct_dma        25136    6759     7108
> > > 
> > > slabs_scanned             689575   542705   618839
> > > 
> > > pageoutrun                1234     1538     1450
> > > 
> > > allocstall                110      26       13
> > > 
> > > Everything seems to have improved except slabs_scanned, possibly because
> > > of this check which Minchan pointed out, that results in higher pressure on slabs.
> > > 
> > > if (page_mapped(page) || PageSwapCache(page))
> > > 
> > >     sc->nr_scanned++;
> > > 
> > > I had added some traces to monitor the vmpressure values. Those also seems to
> > > be high, possibly because of the same reason.
> > > 
> > > Should the pressure be doubled only if page is mapped and referenced ?
> > 
> > Yes, pte_mkold is not perfect at the moment.
> > 
> > Anyway, above heuristic has been in there for a long time since I was born
> > maybe :) (I don't want to argue why it's there and whether it's right) So,
> > I'm really hesitant to change it that it might bite some workloads.
> > (But I don't mean I'm against it but just don't want to make it by myself
> > to avoid potential blame). IOW, Kirill's fault_around broke it too so it
> > could bite some workloads.
> > 
> > At least, as Vinayak mentioned, it would change vmpressure level so users of
> > vmpressure can be affected. AFAIK, some vendors in embedded side relies on
> > vmpressure to control memory management so it will hurt them.
> > As well, slab shrinking behavior was changed, too. Unfortunately, I don't
> > know any workload is dependent with it.
> > 
> > As other regression in my company product, we have snapshot a process
> > with workingset for later fast resume. For that, we have considered
> > pte-mapped pages as workingset for snapshot but snapshot start to include
> > non-workingset pages since fault-around is merged. It means snapshot
> > image size is increased so that we need more storage space and it starts
> > the thing slow down. I guess mincore(2) users will be affected.
> > 
> > Additional Note: There are lots of products with ARM which is non-HW access
> > bit system in embedded world although ARM start to support it recenlty and
> > sequential file access workload is not important compared to memory reclaim
> > So, fault_around's benefit could be higly limited compared to HW-access bit
> > architectures on server workload.
> > 
> > I want to ask again.
> > I guess we could disable fault_around by kernel parameter but does it
> > sound reasonable to enable fault_around by default for every arches
> > at the cost of above regression?
> > 
> > I'm not against for that. Just what I want is some fixes about the
> > regression should go to -stable.
> > 
> > > 
> > > There is big improvement in avg latency, but still 5% higher than with fault_around
> > > disabled. I will try to debug this further.
> 
> I did quick test in my ARM machine.
> 
> 512M file mmap sequential every word read
> 
> = vanilla fault_around=4096 =
> minor fault: 131291
> elapsed time(usec): 6686236
> 
> = vanilla fault_around=65536 =
> minor fault: 12577
> elapsed time(usec): 6586959
> 
> I tested 3 times and result seemed to be stable.
> 90% minor fault was reduced. It's huge win but as looking at elapsed time,
> it's not huge win. Just about 1.5%.
> 
> = pte_mkold applied fault_around=4096 =
> minor fault: 131291
> elapsed time(usec): 6608358
> 
> = pte_mkold applied fault_around=65536 =
> minor fault: 143609
> elapsed time(usec): 6772520
> 
> I tested 3 times and result seemed to be stable.
> minor fault was rather increased and elapsed time was slow with
> fault_around.
> Gain is really not clear.

Kirill,
You wanted to test non-HW access bit system and I did.
What's your opinion?