[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <990f76d8-ae4b-0b09-023b-2467a238c0e1@yandex-team.ru>
Date: Mon, 26 Mar 2018 17:18:15 +0300
From: Konstantin Khlebnikov <khlebnikov@...dex-team.ru>
To: Paolo Valente <paolo.valente@...aro.org>
Cc: linux-block <linux-block@...r.kernel.org>,
Jens Axboe <axboe@...nel.dk>,
Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
'Paolo Valente' via bfq-iosched
<bfq-iosched@...glegroups.com>, Mark Brown <broonie@...nel.org>,
Ulf Hansson <ulf.hansson@...aro.org>,
Linus Walleij <linus.walleij@...aro.org>
Subject: Re: [PATCH] block, bfq: keep peak_rate estimation within range
1..2^32-1
On 26.03.2018 17:06, Paolo Valente wrote:
>
>
>> Il giorno 26 mar 2018, alle ore 12:28, Konstantin Khlebnikov <khlebnikov@...dex-team.ru> ha scritto:
>>
>>
>>
>> On 26.03.2018 11:01, Paolo Valente wrote:
>>>> Il giorno 21 mar 2018, alle ore 00:49, Paolo Valente <paolo.valente@...aro.org> ha scritto:
>>>>
>>>>
>>>>
>>>>> Il giorno 20 mar 2018, alle ore 15:41, Konstantin Khlebnikov <khlebnikov@...dex-team.ru> ha scritto:
>>>>>
>>>>> On 20.03.2018 06:00, Paolo Valente wrote:
>>>>>>> Il giorno 19 mar 2018, alle ore 14:28, Konstantin Khlebnikov <khlebnikov@...dex-team.ru> ha scritto:
>>>>>>>
>>>>>>> On 19.03.2018 09:03, Paolo Valente wrote:
>>>>>>>>> Il giorno 05 mar 2018, alle ore 04:48, Konstantin Khlebnikov <khlebnikov@...dex-team.ru> ha scritto:
>>>>>>>>>
>>>>>>>>> Rate should never overflow or become zero because it is used as divider.
>>>>>>>>> This patch accumulates it with saturation.
>>>>>>>>>
>>>>>>>>> Signed-off-by: Konstantin Khlebnikov <khlebnikov@...dex-team.ru>
>>>>>>>>> ---
>>>>>>>>> block/bfq-iosched.c | 8 +++++---
>>>>>>>>> 1 file changed, 5 insertions(+), 3 deletions(-)
>>>>>>>>>
>>>>>>>>> diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
>>>>>>>>> index aeca22d91101..a236c8d541b5 100644
>>>>>>>>> --- a/block/bfq-iosched.c
>>>>>>>>> +++ b/block/bfq-iosched.c
>>>>>>>>> @@ -2546,7 +2546,8 @@ static void bfq_reset_rate_computation(struct bfq_data *bfqd,
>>>>>>>>>
>>>>>>>>> static void bfq_update_rate_reset(struct bfq_data *bfqd, struct request *rq)
>>>>>>>>> {
>>>>>>>>> - u32 rate, weight, divisor;
>>>>>>>>> + u32 weight, divisor;
>>>>>>>>> + u64 rate;
>>>>>>>>>
>>>>>>>>> /*
>>>>>>>>> * For the convergence property to hold (see comments on
>>>>>>>>> @@ -2634,9 +2635,10 @@ static void bfq_update_rate_reset(struct bfq_data *bfqd, struct request *rq)
>>>>>>>>> */
>>>>>>>>> bfqd->peak_rate *= divisor-1;
>>>>>>>>> bfqd->peak_rate /= divisor;
>>>>>>>>> - rate /= divisor; /* smoothing constant alpha = 1/divisor */
>>>>>>>>> + do_div(rate, divisor); /* smoothing constant alpha = 1/divisor */
>>>>>>>>>
>>>>>>>>> - bfqd->peak_rate += rate;
>>>>>>>>> + /* rate should never overlow or become zero */
>>>>>>>> It is bfqd->peak_rate that is used as a divider, and bfqd->peak_rate doesn't risk to be zero even if the variable 'rate' is zero here.
>>>>>>>> So I guess the reason why you consider the possibility that bfqd->peak_rate becomes zero is because of an overflow when summing 'rate'. But, according to my calculations, this should be impossible with devices with sensible speeds.
>>>>>>>> These are the reasons why I decided I could make it with a 32-bit variable, without any additional clamping. Did I make any mistake in my evaluation?
>>>>>>>
>>>>>>> According to Murphy's law this is inevitable..
>>>>>>>
>>>>>> Yep. Actually Murphy has been even clement this time, by making the
>>>>>> failure occur to a kernel expert :)
>>>>>>> I've seen couple division by zero crashes in bfq_wr_duration.
>>>>>>> Unfortunately logs weren't recorded.
>>>>>>>
>>>>>>>> Anyway, even if I made some mistake about the maximum possible value of the device rate, and the latter may be too high for bfqd->peak_rate to contain it, then I guess the right solution would not be to clamp the actual rate to U32_MAX, but to move bfqd->peak_rate to 64 bits. Or am I missing something else?
>>>>>>>>>> + bfqd->peak_rate = clamp_t(u64, rate + bfqd->peak_rate, 1, U32_MAX);
>>>>>>>
>>>>>>> 32-bit should be enough and better for division.
>>>>>>> My patch makes sure it never overflows/underflows.
>>>>>>> That's cheaper than full 64-bit/64-bit division.
>>>>>>> Anyway 64-bit speed could overflow too. =)
>>>>>>>
>>>>>> I see your point. Still, if the mistake is not in sizing, then you
>>>>>> bumped into some odd bug. In this respect, I don't like much the idea
>>>>>> of sweeping the dust under the carpet. So, let me ask you for a
>>>>>> little bit more help. With your patch applied, and thus with no risk
>>>>>> of crashes, what about adding, right before your clamp_t, something
>>>>>> like:
>>>>>> if (!bfqd->peak_rate)
>>>>>> pr_crit(<dump of all the variables involved in updating bfqd->peak_rate>);
>>>>>> Once the failure shows up (Murphy permitting), we might have hints to
>>>>>> the bug causing it.
>>>>>> Apart from that, I have no problem with patches that make bfq more
>>>>>> robust, even in a sort of black-box way.
>>>>>
>>>>> This rate estimator is already works as a black-box with smoothing and
>>>>> low-pass filter inside.
>>>>
>>>> Actually, it is just what you say: a low-pass filter, which works by
>>>> deflating two non-null quantities (the previous and the current rate
>>>> rate), by and adding these quantities to each other (to get the new
>>>> estimated rate). In addition to being larger than 0, both quantities
>>>> are much lower than U32_MAX, so this computation should never yield
>>>> zero because of an overflow.
>>>>
>>>> Even some occasional zeros read for the sampled rate (because of the
>>>> suspicion you express below), cannot lead to a null estimated rate.
>>>> In fact, this is a low-pass filter, meant to cut off occasional
>>>> spurious values.
>>>>
>>>> I'm insisting on this part, just because, maybe, you say that it "acts
>>>> as a black-box" because there is something more, which I'm overlooking
>>>> here, and maybe the bug is exactly in the part I'm overlooking. But,
>>>> if this is not the case, I'll just stop now for this point.
>>>>
>>>>> It has limitations thus this is ok to declare
>>>>> range of expected\sane results.
>>>>>
>>>>
>>>> Correct, but if the odd value you cut off is a random value that
>>>> follows from a bug (and I see no other possibility, in view of the
>>>> above arguments), then the result may still be completely wrong. For
>>>> example:
>>>> - the device may have a rate of, say, 200 MB/s
>>>> - the computation of the new estimated rate yields 0
>>>> - you limit this result to ~0 MB/s but >0 MB/s; this avoids division
>>>> errors, but still leaves you with a completely wrong rate, and a
>>>> consequently inconsistent behavior of the scheduler
>>>>
>>>>> It would be nice to show estimated rate in sysfs to let user
>>>>> diagnose whenever it is wrong for a long period of time.
>>>>
>>>> Absolutely. This is actually already shown indirectly, through the
>>>> budget_max parameter, which is the number of sectors served at peak
>>>> rate during slice_sync.
>>>>
>>>>> Printing message all the time even ratelimited is a wrong idea.
>>>>> If this should never happens - WARN_ONCE is more than enough.
>>>>>
>>>>
>>>> I'm sorry, I was totally unclear here. My proposal for you is:
>>>> since this failure does occur in your system, please add, temporarily, the
>>>> printing of all involved values *all the times*, so that you can track down
>>>> the bug, and fix the actual problem leading to this division by zero.
>>>>
>>>> What do you think?
>>>>
>>>>> I suspect crashes might be caused by bumpy timer which was affected by smi/nmi from mce.
>>>>
>>>> Yep. The problem is that, for the estimated rate to go to zero, a
>>>> null value must be sampled for several consecutive times.
>>>>
>>>>> I'll try to find evidence.
>>>>
>>>> Getting to the bottom of this would be really great.
>>>>
>>>> Looking forward to your findings,
>>> I guess no news yet. Anyway, your bug report has pushed me to think
>>> over the representation of the rate. And I may have found a problem,
>>> which might be the cause of the zeros you have seen.
>>> With a little math, it comes out that, for a sampled rate below about
>>> 5 KB/s, 0 is stored in the variable 'rate'. The reason is that such
>>> low values cannot be represented with the maximum precision currently
>>> available in BFQ for rates. So, if these very low values are sampled
>>> for a few consecutive times, then the estimated peak rate may actually
>>> become zero.
>>> In the system in which you have seen null estimated rates, may the
>>> virtual or physical storage device be occasionally so slow?
>>
>> I've found couple crashes from qemu-kvm virtual machines.
>> Their disks are slow for sure. Zero rate for a minute wouldn't be a surprise.
>>
>
> Ok, looks like we are converging. Hoping that it is ok for you, I
> took the liberty to prepare a patch that only lower-bounds the
> estimated rate to 1, and that, to prevent future problems of this
> kind, adds comments to report the range of peak rates representable.
> I have added you as signer of the patch, being this new patch just a
> subset of yours.
>
> If this patch is the right solution to the failure you detected, then
> the latter should just go away.
>
> Thank you,
> Paolo
Ok. Thanks. I'll pick your patch into our kernel.
>
>> <4>[125253.240885] divide error: 0000 [#1] SMP PTI
>> <4>[125253.241031] Modules linked in: netconsole configfs tcp_diag inet_diag unix_diag nf_log_ipv6 nf_log_common xt_LOG xt_u32 ip6t_REJECT nf_reject_ipv6 xt_conntrack xt_multiport ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_filter ip6_tables ipt_REJECT nf_reject_ipv4 xt_tcpudp xt_mark xt_owner iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_filter ip_tables x_tables nfsd auth_rpcgss nfs_acl nfs lockd grace sunrpc 8021q garp mrp stp llc drm_kms_helper syscopyarea sysfillrect sysimgblt snd_hda_codec_generic fb_sys_fops snd_hda_intel ttm drm snd_hda_codec snd_hwdep snd_hda_core snd_pcm snd_timer snd soundcore i2c_piix4 ppdev kvm_intel kvm irqbypass input_leds parport_pc fuse lp parport xfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor
>> <4>[125253.242257] async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear ata_generic pata_acpi floppy [last unloaded: netconsole]
>> <4>[125253.242537] CPU: 0 PID: 36766 Comm: push-client Not tainted 4.14.15-5 #1
>> <4>[125253.242731] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
>> <4>[125253.242928] task: ffff880210ed3500 task.stack: ffffc90002d78000
>> <4>[125253.243092] RIP: 0010:bfq_wr_duration.isra.15.part.16+0xc/0x40
>> <4>[125253.243240] RSP: 0018:ffffc90002d7b9e0 EFLAGS: 00010046
>> <4>[125253.243357] RAX: 00000000000d59f8 RBX: ffff88023014af00 RCX: 000000000000001e
>> <4>[125253.243567] RDX: 0000000000000000 RSI: ffff8802302685a0 RDI: 0000000000000000
>> <4>[125253.243785] RBP: ffff880230268400 R08: 0000000000000001 R09: 0000000000000001
>> <4>[125253.244004] R10: ffff8802305341b0 R11: ffff880230268400 R12: ffff88023066cb60
>> <4>[125253.244222] R13: 0000000000000000 R14: ffff8802305341b0 R15: ffffc90002d7ba88
>> <4>[125253.244489] FS: 00007f74d59e6700(0000) GS:ffff88023fc00000(0000) knlGS:0000000000000000
>> <4>[125253.244784] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> <4>[125253.245073] CR2: 00007f78e7261258 CR3: 000000021275e000 CR4: 00000000000006f0
>> <4>[125253.245406] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> <4>[125253.245696] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>> <4>[125253.245961] Call Trace:
>> <4>[125253.246097] bfq_insert_requests+0xda7/0xf60
>> <4>[125253.246313] blk_mq_sched_insert_request+0xf9/0x150
>> <4>[125253.246520] blk_mq_make_request+0x1da/0x630
>> <4>[125253.246692] generic_make_request+0xfb/0x2d0
>> <4>[125253.246847] ? submit_bio+0x73/0x150
>> <4>[125253.246984] submit_bio+0x73/0x150
>> <4>[125253.247229] xfs_submit_ioend+0x80/0x1a0 [xfs]
>> <4>[125253.247466] xfs_do_writepage+0x4d7/0x710 [xfs]
>> <4>[125253.247636] ? invalid_page_referenced_vma+0x90/0x90
>> <4>[125253.247906] write_cache_pages+0x23a/0x4e0
>> <4>[125253.248166] ? xfs_add_to_ioend+0x240/0x240 [xfs]
>> <4>[125253.248345] ? dequeue_entity+0x2f4/0xa90
>> <4>[125253.248596] xfs_vm_writepages+0x94/0xd0 [xfs]
>> <4>[125253.248793] do_writepages+0x17/0x70
>> <4>[125253.249020] __filemap_fdatawrite_range+0xa7/0xe0
>> <4>[125253.249269] file_write_and_wait_range+0x4c/0xa0
>> <4>[125253.249514] xfs_file_fsync+0x55/0x220 [xfs]
>> <4>[125253.249719] do_fsync+0x38/0x60
>> <4>[125253.249877] SyS_fsync+0xc/0x10
>> <4>[125253.250098] entry_SYSCALL_64_fastpath+0x20/0x83
>> <4>[125253.250285] RIP: 0033:0x7f74e92a6a7d
>> <4>[125253.250484] RSP: 002b:00007f74d59dcc90 EFLAGS: 00000293
>> <4>[125253.250485] Code: ff ff 8b 07 4d 29 c8 4c 0f af c0 e9 27 ff ff ff 48 89 c8 c3 90 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 8b 06 31 d2 89 ff <48> f7 f7 48 89 c1 b8 b2 0c 00 00 48 81 f9 b2 0c 00 00 77 0f 48
>> <1>[125253.251304] RIP: bfq_wr_duration.isra.15.part.16+0xc/0x40 RSP: ffffc90002d7b9e0
>>
>> And this one definitely have timer issues.
>>
>> <4>[69620.113311] hpet1: lost 1123 rtc interrupts
>> <4>[69725.670214] hpet1: lost 1257 rtc interrupts
>> <4>[69730.301092] hpet1: lost 295 rtc interrupts
>> <4>[69785.817074] hpet1: lost 804 rtc interrupts
>> <4>[70276.099325] hpet1: lost 1138 rtc interrupts
>> <4>[70338.962431] hpet1: lost 1274 rtc interrupts
>> <4>[70687.531960] hpet1: lost 314 rtc interrupts
>> <4>[70708.789681] hpet1: lost 1359 rtc interrupts
>> <4>[70892.699792] hpet1: lost 775 rtc interrupts
>> <4>[70985.631630] hpet1: lost 448 rtc interrupts
>> <4>[71045.952024] hpet1: lost 1109 rtc interrupts
>> <4>[71236.781185] hpet1: lost 1216 rtc interrupts
>> <4>[71253.521384] hpet1: lost 1071 rtc interrupts
>> <4>[71477.087387] hpet1: lost 563 rtc interrupts
>> <4>[71568.587212] hpet1: lost 357 rtc interrupts
>> <4>[71837.330897] hpet1: lost 705 rtc interrupts
>> <4>[72029.244809] hpet1: lost 1286 rtc interrupts
>> <4>[72251.173477] hpet1: lost 458 rtc interrupts
>> <4>[72398.366782] hpet1: lost 1172 rtc interrupts
>> <4>[73218.363935] hpet1: lost 249 rtc interrupts
>> <4>[73280.472760] hpet1: lost 1225 rtc interrupts
>> <4>[73458.506121] hpet1: lost 398 rtc interrupts
>> <4>[73504.331306] hpet1: lost 183 rtc interrupts
>> <4>[73596.270009] hpet1: lost 386 rtc interrupts
>> <4>[73858.989240] hpet1: lost 320 rtc interrupts
>> <4>[73903.326725] hpet1: lost 88 rtc interrupts
>> <4>[74170.760022] hpet1: lost 622 rtc interrupts
>> <4>[74517.835154] hpet1: lost 216 rtc interrupts
>> <4>[74609.630248] hpet1: lost 375 rtc interrupts
>> <4>[74884.937850] hpet1: lost 1126 rtc interrupts
>> <4>[74886.291818] hpet1: lost 86 rtc interrupts
>> <4>[74902.415909] hpet1: lost 1031 rtc interrupts
>> <4>[75134.991339] hpet1: lost 1140 rtc interrupts
>> <4>[75139.948051] hpet1: lost 316 rtc interrupts
>> <4>[75183.709393] hpet1: lost 50 rtc interrupts
>> <4>[75486.178692] hpet1: lost 115 rtc interrupts
>> <4>[75634.770692] hpet1: lost 1261 rtc interrupts
>> <4>[76322.311822] hpet1: lost 18 rtc interrupts
>> <4>[76941.322201] hpet1: lost 1128 rtc interrupts
>> <4>[77341.716930] hpet1: lost 884 rtc interrupts
>> <4>[77347.532830] hpet1: lost 371 rtc interrupts
>> <4>[77348.620826] hpet1: lost 68 rtc interrupts
>> <4>[77362.430004] divide error: 0000 [#1] SMP PTI
>> <4>[77362.430139] Modules linked in: netconsole configfs tcp_diag inet_diag unix_diag nf_log_ipv6 nf_log_common xt_LOG xt_u32 ip6t_REJECT nf_reject_ipv6 xt_conntrack xt_multiport ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_filter ip6_tables ipt_REJECT nf_reject_ipv4 xt_tcpudp xt_mark xt_owner iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_filter ip_tables x_tables nfsd auth_rpcgss nfs_acl nfs lockd grace sunrpc snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hwdep snd_hda_core snd_pcm snd_timer drm_kms_helper snd syscopyarea sysfillrect sysimgblt fb_sys_fops ttm soundcore drm 8021q garp mrp stp llc ppdev i2c_piix4 kvm_intel input_leds kvm irqbypass parport_pc fuse lp parport xfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor
>> <4>[77362.431347] async_tx xor raid6_pq libcrc32c raid1 raid0 floppy multipath linear ata_generic pata_acpi [last unloaded: netconsole]
>> <4>[77362.431545] CPU: 0 PID: 39134 Comm: skynet.heartbea Not tainted 4.14.15-5 #1
>> <4>[77362.431720] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
>> <4>[77362.432013] task: ffff8800964adcc0 task.stack: ffffc900030bc000
>> <4>[77362.432185] RIP: 0010:bfq_wr_duration.isra.15.part.16+0xc/0x40
>> <4>[77362.432348] RSP: 0018:ffffc900030bfa90 EFLAGS: 00010046
>> <4>[77362.432552] RAX: 00000000000d59f8 RBX: ffff8801348d6900 RCX: 000000000000001e
>> <4>[77362.432725] RDX: 0000000000000000 RSI: ffff8801348c1da0 RDI: 0000000000000000
>> <4>[77362.432992] RBP: ffff8801348c1c00 R08: 0000000000000001 R09: 0000000000000001
>> <4>[77362.433214] R10: ffff8800b9eacf30 R11: ffff8801348c1c00 R12: ffff880134decb60
>> <4>[77362.433466] R13: 0000000000000000 R14: ffff8800b9eacf30 R15: ffffc900030bfb38
>> <4>[77362.433696] FS: 00007fe28c721700(0000) GS:ffff88013fc00000(0000) knlGS:0000000000000000
>> <4>[77362.433948] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> <4>[77362.434125] CR2: 0000000001370ec8 CR3: 0000000135b2c000 CR4: 00000000000006f0
>> <4>[77362.434328] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> <4>[77362.434571] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>> <4>[77362.434828] Call Trace:
>> <4>[77362.434997] bfq_insert_requests+0xda7/0xf60
>> <4>[77362.435150] blk_mq_sched_insert_request+0xf9/0x150
>> <4>[77362.435366] blk_mq_make_request+0x1da/0x630
>> <4>[77362.435516] generic_make_request+0xfb/0x2d0
>> <4>[77362.435701] ? submit_bio+0x73/0x150
>> <4>[77362.435827] submit_bio+0x73/0x150
>> <4>[77362.436100] xfs_submit_ioend+0x80/0x1a0 [xfs]
>> <4>[77362.436339] xfs_vm_writepages+0xa8/0xd0 [xfs]
>> <4>[77362.436549] do_writepages+0x17/0x70
>> <4>[77362.436692] __filemap_fdatawrite_range+0xa7/0xe0
>> <4>[77362.436884] filemap_write_and_wait_range+0x30/0x80
>> <4>[77362.437106] xfs_setattr_size+0x121/0x380 [xfs]
>> <4>[77362.437313] xfs_vn_setattr+0x6b/0x90 [xfs]
>> <4>[77362.437472] notify_change+0x2da/0x400
>> <4>[77362.437663] do_truncate+0x5b/0x90
>> <4>[77362.437842] do_sys_ftruncate.constprop.18+0x10b/0x170
>> <4>[77362.438033] ? SyS_write+0x66/0x90
>> <4>[77362.438251] entry_SYSCALL_64_fastpath+0x20/0x83
>> <4>[77362.438429] RIP: 0033:0x7fe293e8bd87
>> <4>[77362.438586] RSP: 002b:00007fe28c71fb28 EFLAGS: 00000246
>> <4>[77362.438587] Code: ff ff 8b 07 4d 29 c8 4c 0f af c0 e9 27 ff ff ff 48 89 c8 c3 90 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 8b 06 31 d2 89 ff <48> f7 f7 48 89 c1 b8 b2 0c 00 00 48 81 f9 b2 0c 00 00 77 0f 48
>>
>>
>>> Thanks,
>>> Paolo
>>>> Paolo
>>>>
>>>>>
>>>>>> Thanks a lot,
>>>>>> Paolo
>>>>>>>
>>>>>>>>> update_thr_responsiveness_params(bfqd);
>>>>>>>>>
>>>>>>>>> reset_computation:
>
Powered by blists - more mailing lists