[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CY4PR21MB0741B78A465208A457E64457CEBA0@CY4PR21MB0741.namprd21.prod.outlook.com>
Date: Fri, 6 Sep 2019 01:22:48 +0000
From: Long Li <longli@...rosoft.com>
To: Daniel Lezcano <daniel.lezcano@...aro.org>,
Ming Lei <ming.lei@...hat.com>
CC: Bart Van Assche <bvanassche@....org>, Jens Axboe <axboe@...com>,
Hannes Reinecke <hare@...e.com>,
Sagi Grimberg <sagi@...mberg.me>,
"linux-scsi@...r.kernel.org" <linux-scsi@...r.kernel.org>,
Peter Zijlstra <peterz@...radead.org>,
John Garry <john.garry@...wei.com>,
LKML <linux-kernel@...r.kernel.org>,
"linux-nvme@...ts.infradead.org" <linux-nvme@...ts.infradead.org>,
Keith Busch <keith.busch@...el.com>,
Ingo Molnar <mingo@...hat.com>,
Thomas Gleixner <tglx@...utronix.de>,
Christoph Hellwig <hch@....de>
Subject: RE: [PATCH 1/4] softirq: implement IRQ flood detection mechanism
>Subject: Re: [PATCH 1/4] softirq: implement IRQ flood detection mechanism
>
>
>Hi Ming,
>
>On 05/09/2019 11:06, Ming Lei wrote:
>> On Wed, Sep 04, 2019 at 07:31:48PM +0200, Daniel Lezcano wrote:
>>> Hi,
>>>
>>> On 04/09/2019 19:07, Bart Van Assche wrote:
>>>> On 9/3/19 12:50 AM, Daniel Lezcano wrote:
>>>>> On 03/09/2019 09:28, Ming Lei wrote:
>>>>>> On Tue, Sep 03, 2019 at 08:40:35AM +0200, Daniel Lezcano wrote:
>>>>>>> It is a scheduler problem then ?
>>>>>>
>>>>>> Scheduler can do nothing if the CPU is taken completely by
>>>>>> handling interrupt & softirq, so seems not a scheduler problem, IMO.
>>>>>
>>>>> Why? If there is a irq pressure on one CPU reducing its capacity,
>>>>> the scheduler will balance the tasks on another CPU, no?
>>>>
>>>> Only if CONFIG_IRQ_TIME_ACCOUNTING has been enabled. However, I
>>>> don't know any Linux distro that enables that option. That's
>>>> probably because that option introduces two rdtsc() calls in each
>>>> interrupt. Given the overhead introduced by this option, I don't
>>>> think this is the solution Ming is looking for.
>>>
>>> Was this overhead reported somewhere ?
>>
>> The syscall of gettimeofday() calls ktime_get_real_ts64() which
>> finally calls tk_clock_read() which calls rdtsc too.
>>
>> But gettimeofday() is often used in fast path, and block IO_STAT needs
>> to read it too.
>>
>>>
>>>> See also irqtime_account_irq() in kernel/sched/cputime.c.
>>>
>>> From my POV, this framework could be interesting to detect this situation.
>>
>> Now we are talking about IRQ_TIME_ACCOUNTING instead of
>IRQ_TIMINGS,
>> and the former one could be used to implement the detection. And the
>> only sharing should be the read of timestamp.
>
>You did not share yet the analysis of the problem (the kernel warnings give
>the symptoms) and gave the reasoning for the solution. It is hard to
>understand what you are looking for exactly and how to connect the dots.
>
>AFAIU, there are fast medium where the responses to requests are faster
>than the time to process them, right?
>
>I don't see how detecting IRQ flooding and use a threaded irq is the solution,
>can you explain?
>
>If the responses are coming at a very high rate, whatever the solution
>(interrupts, threaded interrupts, polling), we are still in the same situation.
>
>My suggestion was initially to see if the interrupt load will be taken into
>accounts in the cpu load and favorize task migration with the scheduler load
>balance to a less loaded CPU, thus the CPU processing interrupts will end up
>doing only that while other CPUs will handle the "threaded" side.
>
>Beside that, I'm wondering if the block scheduler should be somehow
>involved in that [1]
>
> -- Daniel
Hi Daniel
I want to share some test results with IRQ_TIME_ACCOUNTING. During tests, the kernel had warnings on RCU stall and soft lockup:
An example of RCU stall;
[ 3016.148250] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[ 3016.148299] rcu: 66-....: (1 GPs behind) idle=cc2/0/0x3 softirq=10037/10037 fqs=4717
[ 3016.148299] (detected by 27, t=15011 jiffies, g=45173, q=17194)
[ 3016.148299] Sending NMI from CPU 27 to CPUs 66:
[ 3016.148299] NMI backtrace for cpu 66
[ 3016.148299] CPU: 66 PID: 0 Comm: swapper/66 Tainted: G L 5.3.0-rc6+ #68
[ 3016.148299] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 090007 05/18/2018
[ 3016.148299] RIP: 0010:0xffff9c4740013003
[ 3016.148299] Code: Bad RIP value.
[ 3016.148299] RSP: 0018:ffff9c4759acc8d0 EFLAGS: 00000046
[ 3016.148299] RAX: 0000000000000000 RBX: 0000000000000080 RCX: 000000000001000b
[ 3016.148299] RDX: 00000000000000fb RSI: ffff9c4740013000 RDI: ffff9c4759acc920
[ 3016.148299] RBP: ffff9c4759acc920 R08: 0000000000000008 R09: 01484a845c6de350
[ 3016.148299] R10: ffff9c4759accd30 R11: 0000000000000001 R12: 00000000000000fb
[ 3016.148299] R13: 0000000000000042 R14: ffff8a7d9b771f80 R15: 00000000000001e1
[ 3016.148299] FS: 0000000000000000(0000) GS:ffff8afd9f880000(0000) knlGS:0000000000000000
[ 3016.148299] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3016.148299] CR2: ffff9c4740012fd9 CR3: 000000208b9bc000 CR4: 00000000003406e0
[ 3016.148299] Call Trace:
[ 3016.148299] <IRQ>
[ 3016.148299] ? __send_ipi_mask+0x145/0x2e0
[ 3016.148299] ? __send_ipi_one+0x3a/0x60
[ 3016.148299] ? hv_send_ipi+0x10/0x30
[ 3016.148299] ? generic_exec_single+0x63/0xe0
[ 3016.148299] ? smp_call_function_single_async+0x1f/0x40
[ 3016.148299] ? blk_mq_complete_request+0xdf/0xf0
[ 3016.148299] ? nvme_irq+0x144/0x240 [nvme]
[ 3016.148299] ? tick_sched_do_timer+0x80/0x80
[ 3016.148299] ? __handle_irq_event_percpu+0x40/0x190
[ 3016.148299] ? handle_irq_event_percpu+0x30/0x70
[ 3016.148299] ? handle_irq_event+0x36/0x60
[ 3016.148299] ? handle_edge_irq+0x7e/0x190
[ 3016.148299] ? handle_irq+0x1c/0x30
[ 3016.148299] ? do_IRQ+0x49/0xd0
[ 3016.148299] ? common_interrupt+0xf/0xf
[ 3016.148299] ? common_interrupt+0xa/0xf
[ 3016.148299] ? __do_softirq+0x76/0x2e3
[ 3016.148299] ? __do_softirq+0x4b/0x2e3
[ 3016.148299] ? sched_clock_local+0x12/0x80
[ 3016.148299] ? irq_exit+0xdd/0xf0
[ 3016.148299] ? hv_stimer0_vector_handler+0x62/0x70
[ 3016.148299] ? hv_stimer0_callback_vector+0xf/0x20
[ 3016.148299] </IRQ>
[ 3016.148299] ? __sched_text_end+0x2/0x2
[ 3016.148299] ? default_idle+0x25/0x150
[ 3016.148299] ? do_idle+0x1ad/0x250
[ 3016.148299] ? cpu_startup_entry+0x19/0x20
[ 3016.148299] ? start_secondary+0x156/0x1a0
[ 3016.148299] ? secondary_startup_64+0xa4/0xb0
An example of soft lockup:
[ 3082.490820] watchdog: BUG: soft lockup - CPU#66 stuck for 22s! [swapper/66:0]
[ 3082.501353] Modules linked in: xt_owner xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_security nls_iso8859_1 nvme nvme_core pci_hyperv hv_balloon serio_raw joydev sch_fq_codel ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 btrfs zstd_decompress zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel hyperv_fb hid_generic aes_x86_64 crypto_simd hid_hyperv cfbfillrect cryptd hv_netvsc glue_helper cfbimgblt hid hv_utils pata_acpi hyperv_keyboard cfbcopyarea
[ 3082.501353] CPU: 66 PID: 0 Comm: swapper/66 Tainted: G L 5.3.0-rc6+ #68
[ 3082.501353] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 090007 05/18/2018
[ 3082.501353] RIP: 0010:__do_softirq+0x76/0x2e3
[ 3082.501353] Code: 81 05 72 5c e1 55 00 01 00 00 c7 44 24 20 0a 00 00 00 44 89 74 24 04 48 c7 c0 80 98 02 00 65 66 c7 00 00 00 fb b8 ff ff ff ff <0f> bc 44 24 04 83 c0 01 89 44 24 10 0f 84 a0 00 00 00 48 c7 44 24
[ 3082.501353] RSP: 0018:ffff9c4759accf78 EFLAGS: 00000286 ORIG_RAX: ffffffffffffff12
[ 3082.501353] RAX: 00000000ffffffff RBX: 0000000000000000 RCX: 000002aff0e90b1d
[ 3082.501353] RDX: 00000000000000b5 RSI: 000002aff0e90bd2 RDI: 00000000000000b5
[ 3082.501353] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000029900
[ 3082.501353] R10: ffff9c4759accf20 R11: 0000000002b140e3 R12: 0000000000000000
[ 3082.501353] R13: 0000000000000000 R14: 0000000000000282 R15: 0000000000000000
[ 3082.501353] FS: 0000000000000000(0000) GS:ffff8afd9f880000(0000) knlGS:0000000000000000
[ 3082.501353] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3082.501353] CR2: 0000000000638a60 CR3: 000000208b9bc000 CR4: 00000000003406e0
[ 3082.501353] Call Trace:
[ 3082.501353] <IRQ>
[ 3082.501353] ? sched_clock_local+0x12/0x80
[ 3082.501353] irq_exit+0xdd/0xf0
[ 3082.501353] hv_stimer0_vector_handler+0x62/0x70
[ 3082.501353] hv_stimer0_callback_vector+0xf/0x20
[ 3082.501353] </IRQ>
[ 3082.501353] RIP: 0010:default_idle+0x25/0x150
[ 3082.501353] Code: f2 75 ff 90 90 0f 1f 44 00 00 41 55 41 54 55 53 65 8b 2d 7e 49 0f 56 0f 1f 44 00 00 e9 07 00 00 00 0f 00 2d 5f 9a 4e 00 fb f4 <65> 8b 2d 64 49 0f 56 0f 1f 44 00 00 5b 5d 41 5c 41 5d c3 65 8b 05
[ 3082.501353] RSP: 0018:ffff9c4758b5bec0 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff12
[ 3082.501353] RAX: ffffffffa9f1b9c0 RBX: 0000000000000042 RCX: 0000000000000000
[ 3082.501353] RDX: 0000000000000042 RSI: 0000000000000000 RDI: 000002af5fd2e9ef
[ 3082.501353] RBP: 0000000000000042 R08: 0000000000000000 R09: 0000000000029900
[ 3082.501353] R10: ffff9c4758b5be98 R11: 0000000000008a0c R12: 0000000000000000
[ 3082.501353] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 3082.501353] ? __sched_text_end+0x2/0x2
[ 3082.501353] do_idle+0x1ad/0x250
[ 3082.501353] cpu_startup_entry+0x19/0x20
[ 3082.501353] start_secondary+0x156/0x1a0
[ 3082.501353] secondary_startup_64+0xa4/0xb0
Tracing shows that the CPU was in either hardirq or softirq all the time before warnings. During tests, the system was unresponsive at times.
Ming's patch fixed this problem. The system was responsive throughout tests.
As for performance hit, both resulted in a small drop in peak IOPS. With IRQ_TIME_ACCOUNTING I see a 3% drop. With Ming's patch it is 1% drop.
For the tests, I used the following fio command on 10 NVMe disks:
fio --bs=4k --ioengine=libaio --iodepth=128 --filename=/dev/nvme0n1:/dev/nvme1n1:/dev/nvme2n1:/dev/nvme3n1:/dev/nvme4n1:/dev/nvme5n1:/dev/nvme6n1:/dev/nvme7n1:/dev/nvme8n1:/dev/nvme9n1 --direct=1 --runtime=12000 --numjobs=80 --rw=randread --name=test --group_reporting --gtod_reduce=1
Thanks
Long
>
>[1]
>https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww
>.linaro.org%2Fblog%2Fio-bandwidth-management-for-production-quality-
>services%2F&data=02%7C01%7Clongli%40microsoft.com%7C57db27acef
>38457bf8f808d731ed05e6%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0
>%7C637032766394916978&sdata=ADJU0iEvTITaPCkLcKsGCbqr2GEbQ8U1
>S81jmarmrMU%3D&reserved=0
>
>
>
>--
>
><https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww
>.linaro.org%2F&data=02%7C01%7Clongli%40microsoft.com%7C57db27ac
>ef38457bf8f808d731ed05e6%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C
>0%7C637032766394926977&sdata=1IoPpi8C87%2BGPz4RYEMrEYBma0jSe
>Bbd%2FS%2FaAb%2BUKSA%3D&reserved=0> Linaro.org │ Open source
>software for ARM SoCs
>
>Follow Linaro:
><https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww
>.facebook.com%2Fpages%2FLinaro&data=02%7C01%7Clongli%40microso
>ft.com%7C57db27acef38457bf8f808d731ed05e6%7C72f988bf86f141af91ab2d7
>cd011db47%7C1%7C0%7C637032766394926977&sdata=8ycbXr2QuArabxf
>yARomp2ApmadaOG9lmzfC7IT3%2Bj0%3D&reserved=0> Facebook |
><https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Ftwitt
>er.com%2F%23!%2Flinaroorg&data=02%7C01%7Clongli%40microsoft.co
>m%7C57db27acef38457bf8f808d731ed05e6%7C72f988bf86f141af91ab2d7cd01
>1db47%7C1%7C0%7C637032766394926977&sdata=aQllrUNHWZ3Vkyp7lJ4
>Rd9Qxx3DMcjYv9AQ7g9ytZzU%3D&reserved=0> Twitter |
><https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww
>.linaro.org%2Flinaro-
>blog%2F&data=02%7C01%7Clongli%40microsoft.com%7C57db27acef3845
>7bf8f808d731ed05e6%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C6
>37032766394926977&sdata=o2SbnrMPZxRA%2Bu51Zzuew2AQhUvcF0TG
>XsZz%2Bzrivb8%3D&reserved=0> Blog
Powered by blists - more mailing lists