[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20170327105514.1ed5b1ba@redhat.com>
Date: Mon, 27 Mar 2017 10:55:14 +0200
From: Jesper Dangaard Brouer <brouer@...hat.com>
To: Pankaj Gupta <pagupta@...hat.com>
Cc: Tariq Toukan <ttoukan.linux@...il.com>,
Mel Gorman <mgorman@...hsingularity.net>,
Tariq Toukan <tariqt@...lanox.com>, netdev@...r.kernel.org,
akpm@...ux-foundation.org, linux-mm <linux-mm@...ck.org>,
Saeed Mahameed <saeedm@...lanox.com>, brouer@...hat.com
Subject: Re: Page allocator order-0 optimizations merged
On Mon, 27 Mar 2017 03:32:47 -0400 (EDT)
Pankaj Gupta <pagupta@...hat.com> wrote:
> Hello,
>
> It looks like a race with softirq and normal process context.
>
> Just thinking if we really want allocations from 'softirqs' to be
> done using per cpu list?
Yes, softirq need fast page allocs. The softirq use-case is refilling
the DMA RX rings, which is time critical, especially for NIC drivers.
For this reason most drivers implement different page recycling tricks.
> Or we can have some check in 'free_hot_cold_page' for softirqs
> to check if we are on a path of returning from hard interrupt don't
> allocate from per cpu list.
A possible solution, would be use the local_bh_{disable,enable} instead
of the {preempt_disable,enable} calls. But it is slower, using numbers
from [1] (19 vs 11 cycles), thus the expected cycles saving is 38-19=19.
The problematic part of using local_bh_enable is that this adds a
softirq/bottom-halves rescheduling point (as it checks for pending
BHs). Thus, this might affects real workloads.
I'm unsure what the best option is. I'm leaning towards partly
reverting[1] and go back to doing the slower local_irq_save +
local_irq_restore as before.
Afterwards we can add a bulk page alloc+free call, that can amortize
this 38 cycles cost (of local_irq_{save,restore}). Or add a function
call that MUST only be called from contexts with IRQs enabled, which
allow using the unconditionally local_irq_{disable,enable} as it only
costs 7 cycles.
[1] commit 374ad05ab64d ("mm, page_alloc: only use per-cpu allocator for irq-safe requests")
https://git.kernel.org/torvalds/c/374ad05ab64d
> > On 26/03/2017 11:21 AM, Tariq Toukan wrote:
> > >
> > >
> > > On 23/03/2017 4:51 PM, Mel Gorman wrote:
> > >> On Thu, Mar 23, 2017 at 02:43:47PM +0100, Jesper Dangaard Brouer wrote:
> > >>> On Wed, 22 Mar 2017 23:40:04 +0000
> > >>> Mel Gorman <mgorman@...hsingularity.net> wrote:
> > >>>
> > >>>> On Wed, Mar 22, 2017 at 07:39:17PM +0200, Tariq Toukan wrote:
> > >>>>>>>> This modification may slow allocations from IRQ context slightly
> > >>>>>>>> but the
> > >>>>>>>> main gain from the per-cpu allocator is that it scales better for
> > >>>>>>>> allocations from multiple contexts. There is an implicit
> > >>>>>>>> assumption that
> > >>>>>>>> intensive allocations from IRQ contexts on multiple CPUs from a
> > >>>>>>>> single
> > >>>>>>>> NUMA node are rare
> > >>>>> Hi Mel, Jesper, and all.
> > >>>>>
> > >>>>> This assumption contradicts regular multi-stream traffic that is
> > >>>>> naturally
> > >>>>> handled
> > >>>>> over close numa cores. I compared iperf TCP multistream (8 streams)
> > >>>>> over CX4 (mlx5 driver) with kernels v4.10 (before this series) vs
> > >>>>> kernel v4.11-rc1 (with this series).
> > >>>>> I disabled the page-cache (recycle) mechanism to stress the page
> > >>>>> allocator,
> > >>>>> and see a drastic degradation in BW, from 47.5 G in v4.10 to 31.4 G in
> > >>>>> v4.11-rc1 (34% drop).
> > >>>>> I noticed queued_spin_lock_slowpath occupies 62.87% of CPU time.
> > >>>>
> > >>>> Can you get the stack trace for the spin lock slowpath to confirm it's
> > >>>> from IRQ context?
> > >>>
> > >>> AFAIK allocations happen in softirq. Argh and during review I missed
> > >>> that in_interrupt() also covers softirq. To Mel, can we use a in_irq()
> > >>> check instead?
> > >>>
> > >>> (p.s. just landed and got home)
> > >
> > > Glad to hear. Thanks for your suggestion.
> > >
> > >>
> > >> Not built or even boot tested. I'm unable to run tests at the moment
> > >
> > > Thanks Mel, I will test it soon.
> > >
> > Crashed in iperf single stream test:
> >
> > [ 3974.123386] ------------[ cut here ]------------
> > [ 3974.128778] WARNING: CPU: 2 PID: 8754 at lib/list_debug.c:53
> > __list_del_entry_valid+0xa3/0xd0
> > [ 3974.138751] list_del corruption. prev->next should be
> > ffffea0040369c60, but was dead000000000100
> > [ 3974.149016] Modules linked in: netconsole nfsv3 nfs fscache dm_mirror
> > dm_region_hash dm_log dm_mod sb_edac edac_core x86_pkg_temp_thermal
> > coretemp i2c_diolan_u2c kvm irqbypass ipmi_si ipmi_devintf crc32_pclmul
> > iTCO_wdt ghash_clmulni_intel ipmi_msghandler dcdbas iTCO_vendor_support
> > sg pcspkr lpc_ich shpchp wmi mfd_core acpi_power_meter nfsd auth_rpcgss
> > nfs_acl lockd grace sunrpc binfmt_misc ip_tables mlx4_en sr_mod cdrom
> > sd_mod i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt
> > fb_sys_fops mlx5_core ttm tg3 ahci libahci mlx4_core drm libata ptp
> > megaraid_sas crc32c_intel i2c_core pps_core [last unloaded: netconsole]
> > [ 3974.212743] CPU: 2 PID: 8754 Comm: iperf Not tainted 4.11.0-rc2+ #30
> > [ 3974.220073] Hardware name: Dell Inc. PowerEdge R730/0H21J3, BIOS
> > 1.5.4 10/002/2015
> > [ 3974.228974] Call Trace:
> > [ 3974.231925] <IRQ>
> > [ 3974.234405] dump_stack+0x63/0x8c
> > [ 3974.238355] __warn+0xd1/0xf0
> > [ 3974.241891] warn_slowpath_fmt+0x4f/0x60
> > [ 3974.246494] __list_del_entry_valid+0xa3/0xd0
> > [ 3974.251583] get_page_from_freelist+0x84c/0xb40
> > [ 3974.256868] ? napi_gro_receive+0x38/0x140
> > [ 3974.261666] __alloc_pages_nodemask+0xca/0x200
> > [ 3974.266866] mlx5e_alloc_rx_wqe+0x49/0x130 [mlx5_core]
> > [ 3974.272862] mlx5e_post_rx_wqes+0x84/0xc0 [mlx5_core]
> > [ 3974.278725] mlx5e_napi_poll+0xc7/0x450 [mlx5_core]
> > [ 3974.284409] net_rx_action+0x23d/0x3a0
> > [ 3974.288819] __do_softirq+0xd1/0x2a2
> > [ 3974.293054] irq_exit+0xb5/0xc0
> > [ 3974.296783] do_IRQ+0x51/0xd0
> > [ 3974.300353] common_interrupt+0x89/0x89
> > [ 3974.304859] RIP: 0010:free_hot_cold_page+0x228/0x280
> > [ 3974.310629] RSP: 0018:ffffc9000ea07c90 EFLAGS: 00000202 ORIG_RAX:
> > ffffffffffffffa8
> > [ 3974.319565] RAX: 0000000000000001 RBX: ffff88103f85f158 RCX:
> > ffffea0040369c60
> > [ 3974.327764] RDX: ffffea0040369c60 RSI: ffff88103f85f168 RDI:
> > ffffea0040369ca0
> > [ 3974.335961] RBP: ffffc9000ea07cc0 R08: ffff88103f85f168 R09:
> > 00000000000005a8
> > [ 3974.344178] R10: 00000000000005a8 R11: 0000000000010468 R12:
> > ffffea0040369c80
> > [ 3974.352387] R13: ffff88103f85f168 R14: ffff88107ffdeb80 R15:
> > ffffea0040369ca0
> > [ 3974.360577] </IRQ>
> > [ 3974.363145] __put_page+0x34/0x40
> > [ 3974.367068] skb_release_data+0xca/0xe0
> > [ 3974.371575] skb_release_all+0x24/0x30
> > [ 3974.375984] __kfree_skb+0x12/0x20
> > [ 3974.380003] tcp_recvmsg+0x6ac/0xaf0
> > [ 3974.384251] inet_recvmsg+0x3c/0xa0
> > [ 3974.388394] sock_recvmsg+0x3d/0x50
> > [ 3974.392511] SYSC_recvfrom+0xd3/0x140
> > [ 3974.396826] ? handle_mm_fault+0xce/0x240
> > [ 3974.401535] ? SyS_futex+0x71/0x150
> > [ 3974.405653] SyS_recvfrom+0xe/0x10
> > [ 3974.409673] entry_SYSCALL_64_fastpath+0x1a/0xa9
> > [ 3974.415056] RIP: 0033:0x7f04ca9315bb
> > [ 3974.419309] RSP: 002b:00007f04c955de70 EFLAGS: 00000246 ORIG_RAX:
> > 000000000000002d
> > [ 3974.428243] RAX: ffffffffffffffda RBX: 0000000000020000 RCX:
> > 00007f04ca9315bb
> > [ 3974.436450] RDX: 0000000000020000 RSI: 00007f04bc0008f0 RDI:
> > 0000000000000004
> > [ 3974.444653] RBP: 0000000000000000 R08: 0000000000000000 R09:
> > 0000000000000000
> > [ 3974.452851] R10: 0000000000000000 R11: 0000000000000246 R12:
> > 00007f04bc0008f0
> > [ 3974.461051] R13: 0000000000034ac8 R14: 00007f04bc020910 R15:
> > 000000000001c480
> > [ 3974.469297] ---[ end trace 6fd472c9e1973d53 ]---
> >
> >
> > >>
> > >> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > >> index 6cbde310abed..f82225725bc1 100644
> > >> --- a/mm/page_alloc.c
> > >> +++ b/mm/page_alloc.c
> > >> @@ -2481,7 +2481,7 @@ void free_hot_cold_page(struct page *page, bool
> > >> cold)
> > >> unsigned long pfn = page_to_pfn(page);
> > >> int migratetype;
> > >>
> > >> - if (in_interrupt()) {
> > >> + if (in_irq()) {
> > >> __free_pages_ok(page, 0);
> > >> return;
> > >> }
> > >> @@ -2647,7 +2647,7 @@ static struct page *__rmqueue_pcplist(struct
> > >> zone *zone, int migratetype,
> > >> {
> > >> struct page *page;
> > >>
> > >> - VM_BUG_ON(in_interrupt());
> > >> + VM_BUG_ON(in_irq());
> > >>
> > >> do {
> > >> if (list_empty(list)) {
> > >> @@ -2704,7 +2704,7 @@ struct page *rmqueue(struct zone *preferred_zone,
> > >> unsigned long flags;
> > >> struct page *page;
> > >>
> > >> - if (likely(order == 0) && !in_interrupt()) {
> > >> + if (likely(order == 0) && !in_irq()) {
> > >> page = rmqueue_pcplist(preferred_zone, zone, order,
> > >> gfp_flags, migratetype);
> > >> goto out;
> > >>
> >
--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
LinkedIn: http://www.linkedin.com/in/brouer
Powered by blists - more mailing lists