linux-kernel - invalid counter value in request

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-ID: <TYWP286MB226752FFC0E0E33777AB319FB9709@TYWP286MB2267.JPNP286.PROD.OUTLOOK.COM>
Date:   Thu, 9 Dec 2021 04:49:37 +0000
From:   小川　修一 
        <ogawa.syuuichi@...d-mse.com>
To:     "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
CC:     小川　修一 
        <ogawa.syuuichi@...d-mse.com>,
        山本　達史 
        <yamamoto.tatsushi@...d-mse.com>,
        "Natsume, Wataru (ADITJ/SWG)" <wnatsume@...adit-jv.com>
Subject: invalid counter value in request_queue

Hi, all
I have first time to post mail, so if you have some matter, please let me know.
I'm studying Linux kernel, referencing kdump, to clarify system performance problem.
Now I found strange value in request_queue->q_usage_counter.percpu_count_ptr

Kernel version: 4.9.52, I checked 5.10.80 briefly, and looked similar.

super_block 0xffff9a563820e430 "vdb"
 q=(struct request_queue *) 0xffff9a563b948920,q->q_usage_counter.percpu_count_ptr=(unsigned long *) 0x39dbc000b2b8
 0:0xffffd431ffc0b2b8,0xffffffffffffdaf1,-9487
 1:0xffffd431ffc8b2b8,0x0,0
 2:0xffffd431ffd0b2b8,0x2510,9488

This is output of gdb script in crash commadn. Format is <cpu>:<address>,<value in HEX>, <value in signed long DEC>

Values of percpu_counter_ptr, big value or negative value on cpu0, and positive value on cpu2.
If sum up all cpus, total=1, it means 1 request remain in /dev/vdb at that kdump.

Easy to estimate, count up cpu and count down cpu are different.
I think the q_usage_counter doesn't work as reference counter to guard invalid disposing request queue, however I don't found to use this counter.

System looks no problem. However I wonder that causes any troubles like invalid disposing resource.
I ask you that this is really no problem at all.

---

As we know, this counter is reference counter of request queue access, For example
generic_make_request
   blk_queue_enter(q, false) -> percpu_ref_tryget_live(&q->q_usage_counter) : count up
   blk_queue_exit(q) -> percpu_ref_put(&q->q_usage_counter) : count down

If count up on cpu2, and count down on cpu0, this counter becomes invalid.
I found 2 cases:

case-1: normal case of counting actual requested I/O

blk_mq_map_request() request bio to block device, then count up in blk_queue_enter_live(q)
blk_mq_end_request() called at terminating I/O at IRQ context, then count down in
  blk_mq_free_request() -> blk_queue_exit(q)

IRQ context is normally run on cpu0 in my system. so If AP requests FILE-I/O on cpu2, this problem is reproduced.

case-2: preemption

generic_make_request is not preempt disabled, then cpu may changes between
blk_queue_enter and blk_queue_exit.

Now I think q_usage_counter should consist of simple atomic_t or kref_t instead of percpu_ref.
System looks no problem as of now, I've not yet make any patches to correct it.
If I have a chance to make the patch, I will post again.

Best Regards, shuichi ogawa, NTT-Data MSE corporation