DBW.

Advanced

Direct I/O: Bypassing the OS Page Cache

Article diagram
April 22, 2026·10 min read

Database storage engines use direct I/O to eliminate double buffering, gain explicit control over write ordering, and manage memory without interference from the OS page cache.

Introduction

Most database storage engines face a fundamental tension with the operating system's memory management.
The OS page cache exists to transparently accelerate file I/O by keeping frequently accessed disk blocks in RAM.
For general-purpose applications, this is beneficial.
For database systems that implement their own buffer pool management, the OS page cache becomes a redundant, wasteful, and sometimes harmful intermediary.
Direct I/O is the mechanism by which a database bypasses this layer entirely, issuing reads and writes straight to the storage device.

Understanding when and why to use direct I/O is essential for anyone building or tuning storage engines.
The tradeoffs are real: you gain precise control over memory and write ordering, but you accept full responsibility for buffering, alignment, and caching yourself.

The OS Page Cache and Its Limitations for Databases

diagram-1
Show the relationship between the database engine (buffer pool), the OS page cache, the kernel I/O path, and the block device: illustrate double-buffering (duplicate pages in DB and page cache), eviction-policy mismatch (kernel evicting pages unknown to DB), and where write ordering ambiguity occurs between buffers and device.

When a process calls read() or write() on a file, the Linux kernel routes the operation through the page cache.
On a read, the kernel checks whether the requested pages are already cached in memory.
If not, it reads them from disk into the page cache, then copies them into the user-space buffer.
On a write, data flows from user space into the page cache, and the kernel flushes dirty pages to disk asynchronously (or on fsync()).

This design works well for most workloads, but it creates several problems for database systems:

Double buffering. A database with its own buffer pool (InnoDB, PostgreSQL's shared buffers, RocksDB's block cache) already maintains a carefully managed in-memory copy of hot pages.
When the OS also caches these pages, the same data occupy memory twice.
On a machine with 128 GB of RAM, you might configure a 96 GB buffer pool only to find the OS consuming another 30+ GB caching the same pages.
This is pure waste.

Eviction policy mismatch. The OS page cache typically uses a variant of LRU (or the more sophisticated CLOCK-Pro in some kernels).
Database buffer pools use specialized replacement policies tuned for database access patterns: LRU-K, ARC, CLOCK-sweep, or application-specific heuristics.
A full table scan can pollute the OS page cache with cold data, evicting pages that the database buffer pool considers hot, but the kernel knows nothing about.

Write ordering ambiguity. Correct crash recovery depends on controlling the order in which writes reach stable storage.
The page cache introduces a layer of indirection that complicates write ordering: the kernel is free to reorder writeback of dirty pages between fsync() calls.
With direct I/O, writes submitted before an fdatasync() call go directly to the device without passing through an intermediate cache layer, making it easier to reason about what is on stable storage at any point.
Note that ordering between concurrent writes still depends on the engine's own serialization logic and correct use of fdatasync() or write barriers — direct I/O alone does not eliminate all ordering concerns.

Memory pressure and unpredictable latency. Under memory pressure, the kernel may evict page cache entries, triggering sudden I/O latency spikes for operations the database expected to be memory-resident.
The database has no visibility into or control over these eviction decisions.

How Direct I/O Works

On Linux, direct I/O is enabled by opening a file with the O_DIRECT flag:

int fd = open("/data/tablespace.ibd", O_RDWR | O_DIRECT);

When O_DIRECT is set, the kernel attempts to transfer data directly between the user-space buffer and the block device, bypassing the page cache entirely.
The key word is "attempts": the kernel does not guarantee that all caching is eliminated, but in practice on Linux with local filesystems (ext4, XFS), the page cache is fully bypassed.

Note: O_SYNC or O_DSYNC can be added at open time to make every write() call synchronous, but most database engines prefer to open with O_DIRECT alone and call fdatasync() explicitly after batching multiple writes.
This gives the engine more control over when flush overhead is incurred.

Alignment Requirements

diagram-3
Visually depict buffer alignment, file offset, and transfer size relative to the device logical block size (e.g., 4096B): show an aligned case that succeeds and common misaligned cases that produce EINVAL, illustrating required multiples and memory allocation (posix_memalign/aligned_alloc).

Direct I/O imposes strict alignment constraints.
The user-space buffer, the file offset, and the transfer size must all be aligned to the logical block size of the underlying device (typically 512 bytes, increasingly 4096 bytes for modern drives with 4K sectors).
Failure to meet these constraints results in EINVAL.

Database engines typically handle this by allocating I/O buffers with posix_memalign() or aligned_alloc():

void *buf;
int ret = posix_memalign(&buf, 4096, buffer_size);

Page-sized allocation (4096 bytes) satisfies alignment requirements on virtually all modern hardware.

Interaction with fsync

O_DIRECT alone does not guarantee durability.
It bypasses the page cache, but data may still reside in the drive's volatile write cache.
To ensure data has reached stable storage, the database must also call fdatasync() or use O_DSYNC/O_SYNC at open time.
The combination of O_DIRECT and fdatasync() is the standard pattern for database write paths.

Walkthrough

The following walkthrough illustrates how a storage engine performs a page write using direct I/O, contrasted with the buffered I/O path.

Buffered I/O Write Path

diagram-2
Present the step-by-step sequence from engine modifying a page to fsync returning, highlighting data copies (user buffer -> page cache -> device), dirty marking, asynchronous writeback, and the non-deterministic ordering of device writes before fsync completes.
1. Engine modifies page in buffer pool.
2. Engine calls write(fd, page_buf, PAGE_SIZE) at target offset.
3. Kernel copies page_buf into a page cache page (may allocate one).
4. Kernel marks page cache page as dirty.
5. write() returns to user space.
6. Engine calls fsync(fd).
7. Kernel iterates dirty page cache pages for this file.
8. Kernel issues device I/O for dirty pages (all dirty pages are
   flushed before fsync() returns, but the order in which they
   are issued to the device is not guaranteed).
9. Kernel waits for device acknowledgment.
10. fsync() returns.

Total copies: user buffer -> page cache -> device.
Memory consumed: buffer pool page + page cache page.

Direct I/O Write Path

1. Engine modifies page in buffer pool (buffer must be aligned).
2. Engine calls pwrite(fd, aligned_buf, PAGE_SIZE, aligned_offset).
3. Kernel builds I/O request directly from aligned_buf.
4. Kernel submits I/O to block device (bypasses page cache entirely).
5. Write completes; pwrite() returns.
6. Engine calls fdatasync(fd).
7. Kernel issues flush/FUA command to device.
8. Device acknowledges data on stable storage.
9. fdatasync() returns.

Total copies: user buffer -> device.
Memory consumed: buffer pool page only.
The engine has full control over which pages are written and in what order, provided it serializes writes correctly before calling fdatasync().

Note: The walkthrough uses pwrite() (offset-based) rather than write(). For direct I/O, pwrite() is preferred because it allows the engine to specify the file offset explicitly without relying on the file descriptor's current position, which simplifies concurrent I/O.

Pseudocode: Direct I/O Page Flush

function flush_dirty_page(page):
    assert page.buffer is aligned to BLOCK_SIZE
    assert page.offset is aligned to BLOCK_SIZE

    # Write the page
    bytes_written = pwrite(fd, page.buffer, PAGE_SIZE, page.offset)
    if bytes_written != PAGE_SIZE:
        raise IOError("short write")

    # Ensure durability
    ret = fdatasync(fd)
    if ret != 0:
        raise IOError("fdatasync failed")

    page.mark_clean()

In practice, engines batch multiple dirty pages before calling fdatasync() once, amortizing the flush cost.

Who Uses Direct I/O (and Who Doesn't)

MySQL/InnoDB supports direct I/O via the innodb_flush_method=O_DIRECT setting, which is the recommended configuration for production.
InnoDB manages its own buffer pool and benefits from eliminating double caching.

PostgreSQL historically did not use direct I/O, relying on the OS page cache as a secondary cache below shared buffers.
Direct I/O support has been under active development across recent major versions (the work was in progress through PostgreSQL 16 and 17), motivated by the desire for better control over write ordering and the ability to use larger shared buffer configurations without double caching.
As of this writing, it has not yet shipped as a stable, production-ready feature.

RocksDB and LevelDB use direct I/O for both reads and writes via options (use_direct_reads, use_direct_io_for_flush_and_compaction).
Since compaction generates large sequential writes that would pollute the page cache, direct I/O is particularly valuable here.

SQLite generally does not use direct I/O, since it targets embedded/lightweight use cases where the OS page cache is the primary caching layer and the overhead of managing alignment and buffering is not justified.

Tradeoffs and Pitfalls

Direct I/O is not universally superior.
Several considerations apply:

Read-ahead disappears. The OS page cache implements automatic read-ahead for sequential access patterns.
With direct I/O, the engine must implement its own prefetching, or sequential scan performance will degrade significantly.
Many engines use posix_fadvise() or explicit asynchronous I/O (via io_uring or libaio) to compensate.

Small I/O amplification. Because of alignment requirements, reading a 100-byte record may require reading an entire 4096-byte block.
With buffered I/O, repeated small reads to the same block would hit the page cache after the first access.
With direct I/O, each read goes to disk unless the engine's own cache handles it.

Filesystem metadata still goes through the page cache. Even with O_DIRECT, filesystem metadata operations (directory updates, inode modifications, extent allocation) are cached and flushed normally.
Direct I/O only affects data blocks.

Portability concerns. O_DIRECT semantics vary across operating systems and filesystems.
On macOS, the equivalent is fcntl(fd, F_NOCACHE, 1).
Some network filesystems do not support O_DIRECT at all.
OpenZFS on Linux has historically had incomplete O_DIRECT support, though this has improved significantly in recent releases.
Always verify behavior on the specific OS, filesystem, and kernel version in use.

Error handling complexity increases. When the page cache is involved, some transient device errors can be masked or retried by the kernel.
With direct I/O, errors propagate directly to user space, and the engine must handle partial writes, EINVAL from misalignment, and device-level errors explicitly.

Key Points

  • The OS page cache creates double buffering when a database maintains its own buffer pool, wasting memory that could be used for useful caching.
  • Direct I/O, enabled via O_DIRECT on Linux, bypasses the page cache so that data transfers occur directly between user-space buffers and the storage device.
  • Alignment of buffers, file offsets, and transfer sizes to the device block size is mandatory for direct I/O; violations produce EINVAL.
  • O_DIRECT does not guarantee durability on its own. It must be paired with fdatasync(), O_DSYNC, or equivalent mechanisms to ensure data reaches stable storage.
  • The engine becomes responsible for functionality the page cache previously provided: read-ahead, caching of small or repeated reads, and coalescing of writes.
  • Production database systems (InnoDB, RocksDB) use direct I/O by default or as a recommended setting; systems without their own buffer pool (SQLite) typically do not.
  • Write ordering control, a critical requirement for crash recovery, is improved with direct I/O because the kernel's asynchronous writeback of cached pages is removed from the equation — though the engine must still serialize its own writes correctly.

References

Mathur, A., Cao, M., Bhattacharya, S., Dilger, A., Tomas, A., and Vivier, L. "The new ext4 filesystem: current status and future plans." Proceedings of the Linux Symposium, 2007.

Silberschatz, A., Galvin, P., and Gagne, G. "Operating System Concepts." 10th Edition, Wiley, 2018.

Sweeney, A., Doucette, D., Hu, W., Anderson, C., Nishimoto, M., and Peck, G. "Scalability in the XFS file system." Proceedings of the USENIX Annual Technical Conference, 1996.

Axboe, J. "Efficient IO with io_uring." Kernel documentation, 2019. https://kernel.dk/io_uring.pdf

Newsletter

Signal
over noise.

Database deep-dives, delivered once a week. Storage engines, query optimization, and the data layer.

You will receive Databases Weekly.