linux-kernel - excessive CPU utilization by isolate

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <4FEB8237.6030402@sandia.gov>
Date:	Wed, 27 Jun 2012 15:59:19 -0600
From:	"Jim Schutt" <jaschut@...dia.gov>
To:	"Andrew Morton" <akpm@...ux-foundation.org>,
	"Mel Gorman" <mel@....ul.ie>,
	"KAMEZAWA Hiroyuki" <kamezawa.hiroyu@...fujitsu.com>,
	"Rik van Riel" <riel@...hat.com>,
	"Minchan Kim" <minchan@...nel.org>
cc:	linux-mm@...ck.org, linux-kernel@...r.kernel.org,
	"ceph-devel@...r.kernel.org" <ceph-devel@...r.kernel.org>
Subject: excessive CPU utilization by isolate_freepages?

Hi,

I'm running into trouble with systems going unresponsive,
and perf suggests it's excessive CPU usage by isolate_freepages().
I'm currently testing 3.5-rc4, but I think this problem may have
first shown up in 3.4.  I'm only just learning how to use perf,
so I only currently have results to report for 3.5-rc4.

(FWIW I'm running the my distro version of perf; please let me know
if I need to compile the tools/perf version to match my kernel.)

The systems in question have 24 SAS drives spread across 3 HBAs,
running 24 Ceph OSD instances, one per drive.  FWIW these servers
are dual-socket Intel 5675 Xeons w/48 GB memory.  I've got ~160
Ceph Linux clients doing dd simultaneously to a Ceph file system
backed by 12 of these servers.

In the early phase of such a test, when things are running well,
here's what vmstat reports for the state of one of these servers:

2012-06-27 13:56:58.356-06:00
vmstat -w 4 16
procs -------------------memory------------------ ---swap-- -----io---- --system-- -----cpu-------
  r  b       swpd       free       buff      cache   si   so    bi    bo   in   cs  us sy  id wa st
31 15          0     287216        576   38606628    0    0     2  1158    2   14   1  3  95  0  0
27 15          0     225288        576   38583384    0    0    18 2222016 203357 134876  11 56  17 15  0
28 17          0     219256        576   38544736    0    0    11 2305932 203141 146296  11 49  23 17  0
  6 18          0     215596        576   38552872    0    0     7 2363207 215264 166502  12 45  22 20  0
22 18          0     226984        576   38596404    0    0     3 2445741 223114 179527  12 43  23 22  0
30 12          0     230844        576   38461648    0    0    14 2298537 216580 166661  12 45  23 20  0
29 19          0     237856        576   38422884    0    0     5 2332741 209408 157138  12 42  25 22  0
17 11          0     222156        576   38483992    0    0     4 2380495 210312 173121  12 39  28 22  0
11 13          0     216152        576   38463872    0    0    44 2362186 215236 176454  12 39  27 22  0
12 14          0     223704        576   38546720    0    0     2 2395530 214684 177518  12 39  28 22  0
39 17          0     219932        576   38598184    0    0     4 2428231 223284 179095  12 42  24 21  0
11 10          0     219956        576   38521048    0    0     4 2323853 207472 166665  12 39  28 21  0
23 15          0     216580        576   38451904    0    0     3 2241800 201049 163496  11 37  31 21  0
  9 13          0     225792        576   38451916    0    0    13 2281900 204869 171814  11 36  30 23  0
14 12          0     233820        576   38492728    0    0     4 2293423 207686 173019  11 37  31 21  0
11 20          0     213796        576   38533208    0    0     3 2288636 205605 168118  11 37  31 21  0


The system begins to struggle over the next several
minutes; here's what vmstat has to say:

2012-06-27 13:57:58.831-06:00
vmstat -w 4 16
procs -------------------memory------------------ ---swap-- -----io---- --system-- -----cpu-------
  r  b       swpd       free       buff      cache   si   so    bi    bo   in   cs  us sy  id wa st
21 16          0     224628        576   38526872    0    0     2  1233    9   19   1  4  95  0  0
12 19          0     232060        576   38501020    0    0     4 2366769 221418 159890  12 48  20 19  0
124  8          0     218548        576   38379656    0    0    13 2103075 199660 108618  11 65  13 12  0
24 10          0     300476        576   38230288    0    0    31 1966282 177472 84572  10 76   7  7  0
20 16          0     217584        576   38296700    0    0     9 2062571 195936 128810  10 55  20 15  0
53 12          0     235720        576   38247968    0    0    30 2035407 196973 133921  10 52  23 15  0
20 16          0     360340        576   38067992    0    0     6 2192179 208692 136784  11 54  19 15  0
26 10          0     310800        576   38093884    0    0    43 2138772 207105 118718  11 64  12 13  0
24 15          0     261108        576   38030828    0    0    68 2174015 205793 135302  11 56  18 15  0
  9 17          0     241816        576   37982072    0    0    20 2076145 194971 120285  10 60  16 14  0
37 15          0     255972        576   37892868    0    0    14 2225076 205694 126954  11 59  16 13  0
27 16          0     243212        576   37872704    0    0     6 2249476 210885 134684  12 60  15 14  0
30 10          0     217572        576   37795388    0    0     3 2128688 205027 118319  11 66  12 11  0
26 11          0     236420        576   37740164    0    0    23 2109709 205105 133925  10 56  19 14  0
45 15          0     330056        576   37619896    0    0    15 1948311 196188 119330  10 62  15 13  0
54 15          0     242696        576   37631500    0    0     4 2159530 202595 132588  11 59  16 15  0

2012-06-27 13:58:59.569-06:00
vmstat -w 4 16
procs -------------------memory------------------ ---swap-- -----io---- --system-- -----cpu-------
  r  b       swpd       free       buff      cache   si   so    bi    bo   in   cs  us sy  id wa st
14 15          0     274932        576   37621548    0    0     2  1301   15    0   1  4  95  0  0
44 16          0     278748        576   37509516    0    0     6 2097643 196722 112697  10 66  12 11  0
88 14          0     228088        576   37412008    0    0     6 2089559 202206 116146  10 66  13 11  0
94 11          0     348348        576   37270624    0    0     7 1906390 181488 87333   9 76   9  6  0
42 11          0     215996        576   37288556    0    0     6 1782459 184375 95901   9 74   9  7  0
45 11          0     323112        576   37146500    0    0    12 1868376 187814 103947   9 71  11  9  0
51  7          0     244560        576   37100124    0    0    35 1767496 181383 107259   9 68  12 11  0
74 12          0     221584        576   37016420    0    0    18 1884986 183376 93425   9 75   9  7  0
45 10          0     275564        576   36985324    0    0    23 1683688 167223 97036   8 75   9  8  0
19 10          0     322176        576   36813176    0    0    14 1747378 177594 97218   8 72  12  8  0
122  7          0     225256        576   36838084    0    0    26 1730643 177915 92621   8 75   9  8  0
243 10          0     223464        576   36765460    0    0    18 1730158 173059 79373   8 80   6  5  0
100 10          0     307528        576   36598456    0    0     4 1738567 174077 79585   9 82   6  4  0
243  6          0     370064        576   36358576    0    0     9 1586528 174680 85353   8 81   6  5  0
267  2          0     322640        576   36254044    0    0    40 1011650 129389 42277   5 93   1  1  0
210  4          0     505092        576   35865460    0    0    25 720825 116356 32422   3 96   1  0  0

2012-06-27 14:00:03.219-06:00
vmstat -w 4 16
procs -------------------memory------------------ ---swap-- -----io---- --system-- -----cpu-------
  r  b       swpd       free       buff      cache   si   so    bi    bo   in   cs  us sy  id wa st
75  1          0     566988        576   35664800    0    0     2  1355   21    3   1  4  95  0  0
433  1          0     964052        576   35069112    0    0     7 456359 102256 20901   2 98   0  0  0
547  3          0     820116        576   34893932    0    0    57 560507 114878 28115   3 96   0  0  0
806  2          0     606992        576   34848180    0    0   339 309668 101230 21056   2 98   0  0  0
708  1          0     529624        576   34708000    0    0   248 370886 101327 20062   2 97   0  0  0
231  5          0     504772        576   34663880    0    0   305 334824 95045 20407   2 97   1  1  0
158  6          0    1063088        576   33518536    0    0   531 847435 130696 47140   4 92   1  2  0
193  0          0    1449156        576   33035572    0    0   363 371279 94470 18955   2 96   1  1  0
266  6          0    1623512        576   32728164    0    0    77 241114 95730 15483   2 98   0  0  0
243  8          0    1629504        576   32653080    0    0    81 471018 100223 20920   3 96   0  1  0
70 11          0    1342140        576   33084020    0    0   100 925869 139876 56599   6 88   3  3  0
211  7          0    1130316        576   33470432    0    0   290 1008984 150699 74320   6 83   6  5  0
365  3          0     776736        576   34072772    0    0   182 747167 139436 67135   5 88   4  3  0
29  1          0    1528412        576   34110640    0    0    50 612181 137403 77609   4 87   6  3  0
266  5          0    1657688        576   34105696    0    0     3 258307 62879 38508   2 93   3  2  0
1159  2          0    2002256        576   33775476    0    0    19 88554 42112 14230   1 98   0  0  0


Right around 14:00 I was able to get a "perf -a -g"; here's the
beginning of what "perf report --sort symbol --call-graph fractal,5"
had to say:

#
     64.86%  [k] _raw_spin_lock_irqsave
             |
             |--97.94%-- isolate_freepages
             |          compaction_alloc
             |          unmap_and_move
             |          migrate_pages
             |          compact_zone
             |          |
             |          |--99.56%-- try_to_compact_pages
             |          |          __alloc_pages_direct_compact
             |          |          __alloc_pages_slowpath
             |          |          __alloc_pages_nodemask
             |          |          alloc_pages_vma
             |          |          do_huge_pmd_anonymous_page
             |          |          handle_mm_fault
             |          |          do_page_fault
             |          |          page_fault
             |          |          |
             |          |          |--53.53%-- skb_copy_datagram_iovec
             |          |          |          tcp_recvmsg
             |          |          |          inet_recvmsg
             |          |          |          sock_recvmsg
             |          |          |          sys_recvfrom
             |          |          |          system_call_fastpath
             |          |          |          __recv
             |          |          |          |
             |          |          |           --100.00%-- (nil)
             |          |          |
             |          |          |--27.80%-- __pthread_create_2_1
             |          |          |          (nil)
             |          |          |
             |          |           --18.67%-- memcpy
             |          |                     |
             |          |                     |--57.38%-- 0x50d000005
             |          |                     |
             |          |                     |--34.52%-- 0x3b300bf271940a35
             |          |                     |
             |          |                      --8.10%-- 0x1500000000000009
             |           --0.44%-- [...]
              --2.06%-- [...]

      6.15%  [k] isolate_freepages_block
             |
             |--99.95%-- isolate_freepages
             |          compaction_alloc
             |          unmap_and_move
             |          migrate_pages
             |          compact_zone
             |          |
             |          |--99.54%-- try_to_compact_pages
             |          |          __alloc_pages_direct_compact
             |          |          __alloc_pages_slowpath
             |          |          __alloc_pages_nodemask
             |          |          alloc_pages_vma
             |          |          do_huge_pmd_anonymous_page
             |          |          handle_mm_fault
             |          |          do_page_fault
             |          |          page_fault
             |          |          |
             |          |          |--54.40%-- skb_copy_datagram_iovec
             |          |          |          tcp_recvmsg
             |          |          |          inet_recvmsg
             |          |          |          sock_recvmsg
             |          |          |          sys_recvfrom
             |          |          |          system_call_fastpath
             |          |          |          __recv
             |          |          |          |
             |          |          |           --100.00%-- (nil)
             |          |          |
             |          |          |--25.19%-- __pthread_create_2_1
             |          |          |          (nil)
             |          |          |
             |          |           --20.41%-- memcpy
             |          |                     |
             |          |                     |--40.24%-- 0x3b300bf271940a35
             |          |                     |
             |          |                     |--38.29%-- 0x1500000000000009
             |          |                     |
             |          |                      --21.47%-- 0x50d000005
             |           --0.46%-- [...]
              --0.05%-- [...]

      3.96%  [.] ceph_crc32c_le
             |
             |--99.99%-- 0xb8057558d0065990
              --0.01%-- [...]

A different system in the same test had a slightly different
call tree, but isolate_freepages() still seems to show up
prominently:

#
     32.32%  [k] _raw_spin_lock_irqsave
             |
             |--97.64%-- isolate_freepages
             |          compaction_alloc
             |          unmap_and_move
             |          migrate_pages
             |          compact_zone
             |          try_to_compact_pages
             |          __alloc_pages_direct_compact
             |          __alloc_pages_slowpath
             |          __alloc_pages_nodemask
             |          alloc_pages_vma
             |          do_huge_pmd_anonymous_page
             |          handle_mm_fault
             |          do_page_fault
             |          page_fault
             |          |
             |          |--65.31%-- skb_copy_datagram_iovec
             |          |          tcp_recvmsg
             |          |          inet_recvmsg
             |          |          sock_recvmsg
             |          |          sys_recvfrom
             |          |          system_call_fastpath
             |          |          __recv
             |          |          |
             |          |           --100.00%-- (nil)
             |          |
             |          |--30.98%-- memcpy
             |          |          |
             |          |          |--50.60%-- 0x50d0000
             |          |          |
             |          |           --49.40%-- 0x50d000005
             |           --3.70%-- [...]
              --2.36%-- [...]

     17.10%  [k] _raw_spin_lock_irq
             |
             |--98.27%-- isolate_migratepages_range
             |          compact_zone
             |          try_to_compact_pages
             |          __alloc_pages_direct_compact
             |          __alloc_pages_slowpath
             |          __alloc_pages_nodemask
             |          alloc_pages_vma
             |          do_huge_pmd_anonymous_page
             |          handle_mm_fault
             |          do_page_fault
             |          page_fault
             |          |
             |          |--99.85%-- __pthread_create_2_1
             |          |          (nil)
             |           --0.15%-- [...]
              --1.73%-- [...]

      4.59%  [k] mutex_spin_on_owner
             |
             --- __mutex_lock_slowpath
                 mutex_lock
                |
                |--50.14%-- page_lock_anon_vma
                |          |
                |          |--99.99%-- try_to_unmap_anon
                |          |          try_to_unmap
                |          |          __unmap_and_move
                |          |          unmap_and_move
                |          |          migrate_pages
                |          |          compact_zone
                |          |          try_to_compact_pages
                |          |          __alloc_pages_direct_compact
                |          |          __alloc_pages_slowpath
                |          |          __alloc_pages_nodemask
                |          |          alloc_pages_vma
                |          |          do_huge_pmd_anonymous_page
                |          |          handle_mm_fault
                |          |          do_page_fault
                |          |          page_fault
                |          |          |
                |          |          |--99.91%-- __pthread_create_2_1
                |          |          |          (nil)
                |          |           --0.09%-- [...]
                |           --0.01%-- [...]
                |
                |--49.67%-- rmap_walk
                |          move_to_new_page
                |          __unmap_and_move
                |          unmap_and_move
                |          migrate_pages
                |          compact_zone
                |          try_to_compact_pages
                |          __alloc_pages_direct_compact
                |          __alloc_pages_slowpath
                |          __alloc_pages_nodemask
                |          alloc_pages_vma
                |          do_huge_pmd_anonymous_page
                |          handle_mm_fault
                |          do_page_fault
                |          page_fault
                |          |
                |          |--99.69%-- __pthread_create_2_1
                |          |          (nil)
                |           --0.31%-- [...]
                 --0.20%-- [...]

      4.10%  [k] isolate_freepages_block
             |
             |--99.95%-- isolate_freepages
             |          compaction_alloc
             |          unmap_and_move
             |          migrate_pages
             |          compact_zone
             |          try_to_compact_pages
             |          __alloc_pages_direct_compact
             |          __alloc_pages_slowpath
             |          __alloc_pages_nodemask
             |          alloc_pages_vma
             |          do_huge_pmd_anonymous_page
             |          handle_mm_fault
             |          do_page_fault
             |          page_fault
             |          |
             |          |--46.97%-- skb_copy_datagram_iovec
             |          |          tcp_recvmsg
             |          |          inet_recvmsg
             |          |          sock_recvmsg
             |          |          sys_recvfrom
             |          |          system_call_fastpath
             |          |          __recv
             |          |          |
             |          |           --100.00%-- (nil)
             |          |
             |          |--31.79%-- __pthread_create_2_1
             |          |          (nil)
             |          |
             |           --21.24%-- memcpy
             |                     |
             |                     |--61.90%-- 0x50d000005
             |                     |
             |                      --38.10%-- 0x50d0000
              --0.05%-- [...]

      3.65%  [.] ceph_crc32c_le
             |
             |--99.86%-- 0xb8057558d0065990
              --0.14%-- [...]



I seem to be able to recreate this issue at will, so please
let me know what I can do to help learn what is going on.

Thanks -- Jim

P.S. I got the recipients list via "scripts/get_maintainer.pl -f mm/compaction.c";
      please let me know if I should have done something else.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/