Herein lies madness

Completion deadlines is part of my Master's thesis ("Native command queuing and latency"). The full paper is available in the Download section below.

Background

With the advances in disk technology, seek time has been reduced at a higher rate than rotational latency. And thus rotational latency accounts for a higher and higher percentage of the mechanical overhead in disk drives. Most modern disk drives have internal command queues that allows for multiple commands to be outstanding within a drive at the same time. Command queuing can provide a substantial improvement in performance as the requests gets serviced in an order that results in the lowest overall mechanical overhead. At the same time, it can result in unacceptable high worst-case request response times.

Once requests have been dispatched to a disk drive, the operating system have no control over the order in which they are serviced. The disk drive may choose to ignore one or more requests for an extended period of time. Disk drives that supports command queuing typically contain a starvation timeout to ensure that individual requests cannot be starved indefinitely.

A simple experiment will illustrate the problem. Consider a synthetic workload consisting of two asynchronous random readers, each accessing different areas on the disk. One reader will constantly maintain three outstanding requests while the other will only issue one request at a time. As the disk drive schedules the requests in the order that results in the lowest overall mechanical overhead, the disk drive will only service the single outstanding request when it triggers the starvation timeout.

Benchmarking the maximum time a request spends queued in the command queue (device latency) for three different disk drive models yields the following results:

Disk drive Maximum device latency
Western digital RE3
WD5002ABYS (3.5")
2032 ms
Seagate Momentus
7200.3 (2.5")
1548 ms
Hitachi Travelstar
7K320 (2.5")
528 ms

Clearly requests are being ignored, resulting in maximum device latencies of up to 2 seconds, and the starvation timeouts of 2, 1.5 and 0.5 seconds for the disk drives are easily identified.

Deadlines

To achieve high disk throughput, operating systems generally rely on reordering of requests. This, however, can result in high worst-case OS latency as the scheduler may ignore some requests for extended period of time. Many scheduling policies use some form of request deadlines to limit OS latency. With the exception of the noop scheduler, all of the scheduling policies in Linux incorporate request deadlines. These deadlines could more accurately be called dispatch deadlines, as they are only concerned with the time until a request is dispatched to the disk drive.

Enforcement of additional completion deadlines can provide an upper bound on worst-case device latency and still allow the system to benefit from command queuing. By tracking how long individual requests has been queued in a disk drive, additional dispatch of requests can be blocked if the queuing time of any request exceed a specified completion deadline.

Completion deadlines can be implemented in the Linux elevator layer, by maintaining two red-black trees of dispatched requests. One tree for synchronous requests and one for synchronous requests, both sorted by request dispatch time. As with the disk schedulers in Linux, all read requests are regarded as synchronous I/O. Asynchronous write requests are generally issued by the page cache and may already have been delayed for several seconds, so longer completion deadlines can be allowed for these or they can be ignored altogether.

The strategy routine will keep asking the elevator layer for more requests until the dispatch queue is empty (or blocked), or until the device is no longer able to accept any more requests. If the device is too busy to receive more requests, it will plug the dispatch queue and the last request will be reinserted into the dispatch queue. In this case, the request is removed from the corresponding dispatch tree and the dispatch time is reset.

A patch for adding completion deadlines to the Linux kernel is available in Download section below. The patch adds two new tunables to the request queue sysfs:

/sys/block/sdX/queue/device_expire_sync
/sys/block/sdX/queue/device_expire_async

that specify the maximum time in milliseconds requests can spend in a block device before additional dispatch of requests is blocked.

Experiments

Evaluation of completion deadlines with a random I/O workload shows that worse-case response time can be significantly reduced with overall disk performance still benefitting from command queuing. Experiments and results are available in the thesis paper below.

Download

$Date: 2009-12-15 13:33:33 +0100 (Tue, 15 Dec 2009) $ - lp@core.dk