linux-kernel - Antw: 3.0.101: "blk_rq_check_limits: over max size limit."

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [day] [month] [year] [list]
Message-Id: <584815F8020000A100023940@gwsmtp1.uni-regensburg.de>
Date:   Wed, 07 Dec 2016 14:00:24 +0100
From:   "Ulrich Windl" <Ulrich.Windl@...uni-regensburg.de>
To:     <linux-kernel@...r.kernel.org>
Cc:     <axboe@...nel.dk>,
        "Ulrich Windl" <Ulrich.Windl@...uni-regensburg.de>
Subject: Antw: 3.0.101: "blk_rq_check_limits: over max size limit."

Hi once more!

I managed to get the call traces of involved processes:

1) The process doing read():
Dec  7 13:51:16 h10 kernel: [183809.594434] SysRq : Show Blocked State
Dec  7 13:51:16 h10 kernel: [183809.594447]   task                        PC stack   pid father
Dec  7 13:51:16 h10 kernel: [183809.594750] randio          D ffff8801703a9d68     0  2762  53250 0x00000004
Dec  7 13:51:16 h10 kernel: [183809.594758]  ffff880100887ad8 0000000000000046 ffff880100886010 0000000000010900
Dec  7 13:51:16 h10 kernel: [183809.594765]  0000000000010900 0000000000010900 0000000000010900 ffff880100887fd8
Dec  7 13:51:16 h10 kernel: [183809.594772]  ffff880100887fd8 0000000000010900 ffff88016bb6a280 ffff88017670c300
Dec  7 13:51:16 h10 kernel: [183809.594778] Call Trace:
Dec  7 13:51:16 h10 kernel: [183809.594806]  [<ffffffff814682ec>] io_schedule+0x9c/0xf0
Dec  7 13:51:16 h10 kernel: [183809.594820]  [<ffffffff810fbb93>] __lock_page+0x93/0xc0
Dec  7 13:51:16 h10 kernel: [183809.594834]  [<ffffffff8110a004>] truncate_inode_pages_range+0x294/0x460
Dec  7 13:51:16 h10 kernel: [183809.594845]  [<ffffffff811948b7>] __blkdev_put+0x1d7/0x210
Dec  7 13:51:16 h10 kernel: [183809.594856]  [<ffffffff811613b3>] __fput+0xb3/0x200
Dec  7 13:51:16 h10 kernel: [183809.594868]  [<ffffffff8115cddc>] filp_close+0x5c/0x90
Dec  7 13:51:16 h10 kernel: [183809.594880]  [<ffffffff810652ea>] put_files_struct+0x7a/0xd0
Dec  7 13:51:16 h10 kernel: [183809.594889]  [<ffffffff81066fd0>] do_exit+0x1d0/0x470
Dec  7 13:51:16 h10 kernel: [183809.594897]  [<ffffffff810672ad>] do_group_exit+0x3d/0xb0
Dec  7 13:51:16 h10 kernel: [183809.594907]  [<ffffffff810782e7>] get_signal_to_deliver+0x247/0x480
Dec  7 13:51:16 h10 kernel: [183809.594919]  [<ffffffff81002951>] do_signal+0x71/0x1b0
Dec  7 13:51:16 h10 kernel: [183809.594928]  [<ffffffff81002b28>] do_notify_resume+0x98/0xb0
Dec  7 13:51:16 h10 kernel: [183809.594940]  [<ffffffff81472740>] int_signal+0x12/0x17
Dec  7 13:51:16 h10 kernel: [183809.594988]  [<00007f64f28cbba0>] 0x7f64f28cbb9f

2) The process trying to modify the queue scheduler:
Dec  7 13:51:16 h10 kernel: [183809.594995] blocktune       D ffff88014e048000     0 58867      1 0x00000000
Dec  7 13:51:16 h10 kernel: [183809.595000]  ffff880128defd18 0000000000000086 ffff880128dee010 0000000000010900
Dec  7 13:51:16 h10 kernel: [183809.595007]  0000000000010900 0000000000010900 0000000000010900 ffff880128deffd8
Dec  7 13:51:16 h10 kernel: [183809.595013]  ffff880128deffd8 0000000000010900 ffff88012889a3c0 ffff8801767bc1c0
Dec  7 13:51:16 h10 kernel: [183809.595019] Call Trace:
Dec  7 13:51:16 h10 kernel: [183809.595026]  [<ffffffff81468680>] schedule_timeout+0x1b0/0x2a0
Dec  7 13:51:16 h10 kernel: [183809.595040]  [<ffffffff8107379d>] msleep+0x1d/0x30
Dec  7 13:51:16 h10 kernel: [183809.595052]  [<ffffffff8122c4fc>] __blk_drain_queue+0xbc/0x140
Dec  7 13:51:16 h10 kernel: [183809.595063]  [<ffffffff812281e1>] elv_quiesce_start+0x51/0x90
Dec  7 13:51:16 h10 kernel: [183809.595071]  [<ffffffff8122826a>] elevator_switch+0x4a/0x150
Dec  7 13:51:16 h10 kernel: [183809.595079]  [<ffffffff812283dd>] elevator_change+0x6d/0xb0
Dec  7 13:51:16 h10 kernel: [183809.595086]  [<ffffffff81228447>] elv_iosched_store+0x27/0x60
Dec  7 13:51:16 h10 kernel: [183809.595096]  [<ffffffff81230887>] queue_attr_store+0x67/0xc0
Dec  7 13:51:16 h10 kernel: [183809.595106]  [<ffffffff811d134b>] sysfs_write_file+0xcb/0x160
Dec  7 13:51:16 h10 kernel: [183809.595115]  [<ffffffff8115fade>] vfs_write+0xce/0x140
Dec  7 13:51:16 h10 kernel: [183809.595123]  [<ffffffff8115fc53>] sys_write+0x53/0xa0
Dec  7 13:51:16 h10 kernel: [183809.595131]  [<ffffffff81472472>] system_call_fastpath+0x16/0x1b
Dec  7 13:51:16 h10 kernel: [183809.595140]  [<00007f12b7f70c00>] 0x7f12b7f70bff

3) The process trying to read the queue scheduler:
Dec  7 13:51:16 h10 kernel: [183809.595149] cat             D ffff880147873718     0 45053   5957 0x00000004
Dec  7 13:51:16 h10 kernel: [183809.595155]  ffff880122f7be00 0000000000000082 ffff880122f7a010 0000000000010900
Dec  7 13:51:16 h10 kernel: [183809.595161]  0000000000010900 0000000000010900 0000000000010900 ffff880122f7bfd8
Dec  7 13:51:16 h10 kernel: [183809.595167]  ffff880122f7bfd8 0000000000010900 ffff8801154ea1c0 ffffffff81a11020
Dec  7 13:51:16 h10 kernel: [183809.595174] Call Trace:
Dec  7 13:51:16 h10 kernel: [183809.595181]  [<ffffffff81468fe0>] __mutex_lock_slowpath+0x160/0x1f0
Dec  7 13:51:16 h10 kernel: [183809.595189]  [<ffffffff81468b1a>] mutex_lock+0x1a/0x40
Dec  7 13:51:16 h10 kernel: [183809.595196]  [<ffffffff81230929>] queue_attr_show+0x49/0xb0
Dec  7 13:51:16 h10 kernel: [183809.595203]  [<ffffffff811d14de>] sysfs_read_file+0xfe/0x1c0
Dec  7 13:51:16 h10 kernel: [183809.595212]  [<ffffffff8115fd67>] vfs_read+0xc7/0x130
Dec  7 13:51:16 h10 kernel: [183809.595219]  [<ffffffff8115fed3>] sys_read+0x53/0xa0
Dec  7 13:51:16 h10 kernel: [183809.595226]  [<ffffffff81472472>] system_call_fastpath+0x16/0x1b
Dec  7 13:51:16 h10 kernel: [183809.595235]  [<00007fb04560dba0>] 0x7fb04560db9f



>>> Ulrich Windl schrieb am 07.12.2016 um 13:23 in Nachricht <5847FF5E.7E4 : 161 :
60728>:
> Hi again!
> 
> An addition: Processes doing such I/O seem to be unkillable, and I also 
> cannot change the queue parameters while this problem occurs (the process 
> trying to write (e.g.: to queue/scheduler) is also blocked. The process 
> status of the process doing I/O looks like this:
> # cat /proc/2762/status
> Name:   randio
> State:  D (disk sleep)
> Tgid:   2762
> Pid:    2762
> PPid:   53250
> TracerPid:      0
> Uid:    0       0       0       0
> Gid:    0       0       0       0
> FDSize: 0
> Groups: 0 105
> Threads:        1
> SigQ:   5/38340
> SigPnd: 0000000000000000
> ShdPnd: 0000000000004102
> SigBlk: 0000000000000000
> SigIgn: 0000000000000000
> SigCgt: 0000000180000000
> CapInh: 0000000000000000
> CapPrm: ffffffffffffffff
> CapEff: ffffffffffffffff
> CapBnd: ffffffffffffffff
> Cpus_allowed:   ffffffff,ffffffff
> Cpus_allowed_list:      0-63
> Mems_allowed:   
> 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000
> 000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,0
> 0000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,000000
> 00,00000000,00000000,00000000,00000000,00000000,00000001
> Mems_allowed_list:      0
> voluntary_ctxt_switches:        5
> nonvoluntary_ctxt_switches:     1
> 
> Best regards,
> Ulrich
> 
> >>> Ulrich Windl schrieb am 07.12.2016 um 13:19 in Nachricht <5847FE66.7E4 : 161 
> :
> 60728>:
> > Hi again!
> > 
> > Maybe someone can confirm this:
> > If you have a device (e.g. multipath map) that limits max_sectors_kb to 
> > maybe 64, and then define an LVM LV using that multipath map as PV, the LV 
> > still has a larger max_sectors_kb. If you send a big request (read in my 
> > case), the kernel will complain:
> > 
> > kernel: [173116.098798] blk_rq_check_limits: over max size limit.
> > 
> > Note that this message does not give any clue to the device being involved, 
> 
> > nor the size of the IO attempted, nor the limit of the IO size.
> > 
> > My expectation would be that the higher layer reports back an I/O error, 
> and 
> > the user process receives an I/O error, or, alternatively, the big request 
> is 
> > split into acceptable chunks before passing it to the lower layers.
> > 
> > However none of the above happens; instead the request seems to block the 
> > request queue, because later TUR-checks also fail:
> > kernel: [173116.105701] device-mapper: multipath: Failing path 66:384.
> > kernel: [173116.105714] device-mapper: multipath: Failing path 66:352.
> > multipathd: 66:384: mark as failed
> > multipathd: NAP_S11: remaining active paths: 1
> > multipathd: 66:352: mark as failed
> > multipathd: NAP_S11: Entering recovery mode: max_retries=6
> > multipathd: NAP_S11: remaining active paths: 0
> > 
> > (somewhat later)
> >  multipathd: NAP_S11: sdkh - tur checker reports path is up
> > multipathd: 66:336: reinstated
> > multipathd: NAP_S11: Recovered to normal mode
> > kernel: [173117.286712] device-mapper: multipath: Could not failover device 
> 
> > 66:368: Handler scsi_dh_alua error 8.
> > (I don't know the implications of this)
> > 
> > Of course this error does not appear as long as all devices use the same 
> > maximum request size, but tests have shown that different SAN disk systems 
> > prefer different request sizes (as they split large requests internally to 
> > handle them in chunks anyway).
> > 
> > Last seen with this kernel (SLES11 SP4 on x86_64): Linux version 
> > 3.0.101-88-default (geeko@...ldhost) (gcc version 4.3.4 [gcc-4_3-branch 
> > revision 152973] (SUSE Linux) ) #1 SMP Fri Nov 4 22:07:35 UTC 2016 
> (b45f205)
> > 
> > Regards,
> > Ulrich
> > 
> > >>> Ulrich Windl schrieb am 23.08.2016 um 17:03 in Nachricht <57BC65CD.D1A : 
> 161 
> > :
> > 60728>:
> > > Hello!
> > > 
> > > While performance-testing a 3PARdata StorServ 8400 with SLES11SP4, I 
> > noticed 
> > > that I/Os dropped, until everything stood still more or less. Looking into 
> 
> > > the syslog I found that multipath's TUR-checker considered the paths (FC, 
> > > BTW) as dead. Amazingly I did not have this problem when I did read-only 
> > > tests.
> > > 
> > > The start looks like this:
> > > Aug 23 14:44:58 h10 multipathd: 8:32: mark as failed
> > > Aug 23 14:44:58 h10 multipathd: FirstTest-32: remaining active paths: 3
> > > Aug 23 14:44:58 h10 kernel: [  880.159425] blk_rq_check_limits: over max 
> > > size limit.
> > > Aug 23 14:44:58 h10 kernel: [  880.159611] blk_rq_check_limits: over max 
> > > size limit.
> > > Aug 23 14:44:58 h10 kernel: [  880.159615] blk_rq_check_limits: over max 
> > > size limit.
> > > Aug 23 14:44:58 h10 kernel: [  880.159623] device-mapper: multipath: 
> > Failing 
> > > path 8:32.
> > > Aug 23 14:44:58 h10 kernel: [  880.186609] blk_rq_check_limits: over max 
> > > size limit.
> > > Aug 23 14:44:58 h10 kernel: [  880.186626] blk_rq_check_limits: over max 
> > > size limit.
> > > Aug 23 14:44:58 h10 kernel: [  880.186628] blk_rq_check_limits: over max 
> > > size limit.
> > > Aug 23 14:44:58 h10 kernel: [  880.186631] device-mapper: multipath: 
> > Failing 
> > > path 129:112.
> > > [...]
> > > It seems the TUR-checker does some ping-pong-like game: Paths go up and 
> > down
> > > 
> > > Now for the Linux part: I found the relevant message in blk-core.c 
> > > (blk_rq_check_limits()).
> > > First s/agaist/against/ in " *    Such request stacking drivers should 
> > check 
> > > those requests agaist", the there's the problem that the message neither 
> > > outputs the blk_rq_sectors(), nor the blk_queue_get_max_sectors(), nor the 
> 
> > > underlying device. That makes debugging somewhat difficult if you 
> customize 
> > 
> > > the block queue settings per device as I did:
> > > 
> > > Aug 23 14:32:33 h10 blocktune: (notice) start: activated tuning of 
> > > queue/rotational for FirstTest-31 (0)
> > > Aug 23 14:32:33 h10 blocktune: (notice) start: activated tuning of 
> > > queue/add_random for FirstTest-31 (0)
> > > Aug 23 14:32:33 h10 blocktune: (notice) start: activated tuning of 
> > > queue/scheduler for FirstTest-31 (noop)
> > > Aug 23 14:32:33 h10 blocktune: (notice) start: activated tuning of 
> > > queue/max_sectors_kb for FirstTest-31 (128)
> > > Aug 23 14:32:33 h10 blocktune: (notice) start: activated tuning of 
> > > queue/rotational for FirstTest-32 (0)
> > > Aug 23 14:32:33 h10 blocktune: (notice) start: activated tuning of 
> > > queue/add_random for FirstTest-32 (0)
> > > Aug 23 14:32:33 h10 blocktune: (notice) start: activated tuning of 
> > > queue/scheduler for FirstTest-32 (noop)
> > > Aug 23 14:32:34 h10 blocktune: (notice) start: activated tuning of 
> > > queue/max_sectors_kb for FirstTest-32 (128)
> > > 
> > > I suspect the "queue/max_sectors_kb=128" is the culprit:
> > > # multipath -ll FirstTest-32
> > > FirstTest-32 (360002ac000000000000000040001b383) dm-7 3PARdata,VV
> > > size=10G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
> > > `-+- policy='service-time 0' prio=50 status=active
> > >   |- 2:0:0:1  sdet 129:80  failed ready running
> > >   |- 2:0:2:1  sdev 129:112 failed ready running
> > >   |- 1:0:0:1  sdb  8:16    failed ready running
> > >   `- 1:0:1:1  sdc  8:32    failed ready running
> > > # cat /sys/block/{dm-7,sd{b,c},sde{t,v}}/queue/max_sectors_kb
> > > 128
> > > 128
> > > 128
> > > 128
> > > 128
> > > 
> > > While writing this message, I noticed that I had created a primary 
> > partition 
> > > of dm-7:
> > > # dmsetup ls |grep Fi
> > > FirstTest-32_part1      (253:8)
> > > FirstTest-32    (253:7)
> > > # cat /sys/block/dm-8/queue/max_sectors_kb
> > > 1024
> > > 
> > > After "# echo 128 >/sys/block/dm-8/queue/max_sectors_kb" things still did 
> > not 
> > > get better.
> > > 
> > > Can't blk_rq_check_limits() do anything more clever than returning -EIO?
> > > 
> > > Regards,
> > > Ulrich
> > > P.S: Keep me in CC:, please!
> > > 
> > > 
> > > 
> > 
> > 
> > 
> > 
> 
> 
> 
>