lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Mon, 25 Jun 2012 17:18:50 -0400
From:	Vivek Goyal <vgoyal@...hat.com>
To:	Josh Hunt <joshhunt00@...il.com>
Cc:	Jens Axboe <axboe@...nel.dk>, linux-kernel@...r.kernel.org,
	tj@...nel.org
Subject: Re: multi-second application stall in open()

On Mon, Jun 25, 2012 at 11:22:32AM -0500, Josh Hunt wrote:
> On Mon, Jun 25, 2012 at 8:30 AM, Vivek Goyal <vgoyal@...hat.com> wrote:
> > On Fri, Jun 22, 2012 at 04:34:07PM -0500, Josh Hunt wrote:
> >
> > [..]
> >> Shouldn't the queue stay on the RR list until it is empty?
> >
> > This does look odd. cfqq should stay on service tree as long as it has
> > requests.
> >
> > Can you attach the full log again. Also make sure that blktrace is not
> > dropping any trace events.
> >
> > In slice_expire() we check following.
> >
> >        if (cfq_cfqq_on_rr(cfqq) && RB_EMPTY_ROOT(&cfqq->sort_list))
> >                cfq_del_cfqq_rr(cfqd, cfqq);
> >
> > So for some reason RB_EMPTY_ROOT() is returning true. But we must have
> > added the request and it should not have been empty.
> >
> > cfq_insert_request()
> >  cfq_add_rq_rb()
> >    elv_rb_add()
> >
> > So may be little more tracing after request addition will help. Just check
> > that RB_EMPTY_ROOT() is not true after addition of request and also print
> > number of requests queued.
> >
> > In slice_expired() we can probably put a BUG_ON() which checks following.
> >
> > BUG_ON(RB_EMPTY_ROOT(&cfqq->sort_list) && (cfqq->queued[0] || cfqq->queued[1]));
> >
> > Thanks
> > Vivek
> 
> Vivek
> 
> First off thanks for all the time you've spent helping me on this :)
> 
> I'm attaching the log. I will add more instrumentation based on your
> mail and let you know the results.

I suspect that cfq5917 are actually two queues. One sync and other async.
In latest kernels, trace messages append "A" or "S" in queue name to
represent async/sync queue. Why do I think so.

- I see that cfq5917 has dispatched both "W" and "R" requests. "R" don't
  get queued in sync queues and vice a versa.

- Once cfq5917 has preempted cfq5917. A queue can't preempt itself. 

  8,0    1   296615  4847.030318652  5917  I   R 1572951 + 8 (    2793)
[syslogd]
  8,0    1        0  4847.030321446     0  m   N cfq5917 insert_request
  8,0    1        0  4847.030323122     0  m   N cfq5917 add_to_rr
  8,0    1        0  4847.030325636     0  m   N cfq5917 preempt
  8,0    1        0  4847.030327033     0  m   N cfq5917 slice expired t=1
  8,0    1        0  4847.030328710     0  m   N cfq5917 resid=39
  8,0    1        0  4847.030331783     0  m   N cfq5917 sl_used=1 disp=7
charge=1 iops=0 sect=96

Hence I think that it is just a chance that both sync and async queues
have got same cfqq->pid. If you just append anohter character to represent
sync and async property of queue, that can cofirm it.

Thanks
Vivek
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ