[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4C07A8E9.30608@panasas.com>
Date: Thu, 03 Jun 2010 16:06:49 +0300
From: Boaz Harrosh <bharrosh@...asas.com>
To: Vladislav Bolkhovitin <vst@...b.net>
CC: James Bottomley <James.Bottomley@...e.de>,
Christof Schmitt <christof.schmitt@...ibm.com>,
"Martin K. Petersen" <martin.petersen@...cle.com>,
linux-scsi@...r.kernel.org, linux-kernel@...r.kernel.org,
linux-fsdevel@...r.kernel.org,
Chris Mason <chris.mason@...cle.com>,
Gennadiy Nerubayev <parakie@...il.com>
Subject: Re: Wrong DIF guard tag on ext2 write
On 06/03/2010 03:41 PM, Vladislav Bolkhovitin wrote:
> Boaz Harrosh, on 06/03/2010 04:07 PM wrote:
>> On 06/03/2010 02:20 PM, Vladislav Bolkhovitin wrote:
>>> There's one interesting problem here, at least theoretically, with SCSI
>>> or similar transports which allow to have commands queue depth >1 and
>>> allowed to internally reorder queued requests. I don't know the FS/block
>>> layers sufficiently well to tell if sending several requests for the
>>> same page really possible or not, but we can see a real life problem,
>>> which can be well explained if it's possible.
>>>
>>> The problem could be if the second (rewrite) request (SCSI command) for
>>> the same page queued to the corresponding device before the original
>>> request finished. Since the device allowed to freely reorder requests,
>>> there's a probability that the original write request would hit the
>>> permanent storage *AFTER* the retry request, hence the data changes it's
>>> carrying would be lost, hence welcome data corruption.
>>>
>>
>> I might be totally wrong here but I think NCQ can reorder sectors but
>> not writes. That is if the sector is cached in device memory and a later
>> write comes to modify the same sector then the original should be
>> replaced not two values of the same sector be kept in device cache at the
>> same time.
>>
>> Failing to do so is a scsi device problem.
>
> SCSI devices supporting Full task management model (almost all) and
> having QUEUE ALGORITHM MODIFIER bits in Control mode page set to 1
> allowed to freely reorder any commands with SIMPLE task attribute. If an
> application wants to maintain order of some commands for such devices,
> it must issue them with ORDERED task attribute and over a _single_ MPIO
> path to the device.
>
> Linux neither uses ORDERED attribute, nor honors or enforces anyhow
> QUEUE ALGORITHM MODIFIER bits, nor takes care to send commands with
> order dependencies (overlapping writes in our case) over a single MPIO path.
>
OK I take your word for it. But that sounds stupid to me. I would think
that sectors can be ordered. not commands per se. What happen with reads
then? do they get to be ordered? I mean a read in between the two writes which
value is read? It gets so complicated that only a sector model makes sense
to me.
>> Please note that page-to-sector is not necessary constant. And the same page
>> might get written at a different sector, next time. But FSs will have to
>> barrier in this case.
>>
>>> For single parallel SCSI or SAS devices such race may look practically
>>> impossible, but for sophisticated clusters when many nodes pretending to
>>> be a single SCSI device in a load balancing configuration, it becomes
>>> very real.
>>>
>>> The real life problem we can see in an active-active DRBD-setup. In this
>>> configuration 2 nodes act as a single SCST-powered SCSI device and they
>>> both run DRBD to keep their backstorage in-sync. The initiator uses them
>>> as a single multipath device in an active-active round-robin
>>> load-balancing configuration, i.e. sends requests to both nodes in
>>> parallel, then DRBD takes care to replicate the requests to the other node.
>>>
>>> The problem is that sometimes DRBD complies about concurrent local
>>> writes, like:
>>>
>>> kernel: drbd0: scsi_tgt0[12503] Concurrent local write detected!
>>> [DISCARD L] new: 144072784s +8192; pending: 144072784s +8192
>>>
>>> This message means that DRBD detected that both nodes received
>>> overlapping writes on the same block(s) and DRBD can't figure out which
>>> one to store. This is possible only if the initiator sent the second
>>> write request before the first one completed.
>>
>> It is totally possible in today's code.
>>
>> DRBD should store the original command_sn of the write and discard
>> the sector with the lower SN. It should appear as a single device
>> to the initiator.
>
> How can it find the SN? The commands were sent over _different_ MPIO
> paths to the device, so at the moment of the sending all the order
> information was lost.
>
I'm not hard on the specifics here. But I think the initiator has set
the same SN on the two paths, or has incremented them between paths.
You said:
> The initiator uses them as a single multipath device in an active-active
> round-robin load-balancing configuration, i.e. sends requests to both nodes
> in paralle.
So what was the SN sent to each side. Is there a relationship between them
or they each advance independently?
If there is a relationship then the targets on two sides should store
the SN for later comparison. (Life is hard)
> Until SCSI generally allowed to preserve ordering information between
> MPIO paths in such configurations the only way to maintain commands
> order would be queue draining. Hence, for safety all initiators working
> with such devices must do it.
>
> But looks like Linux doesn't do it, so unsafe with MPIO clusters?
>
> Vlad
>
Thanks
Boaz
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists