Direct I/O: Bypassing the OS Page Cache

Introduction

Most database systems perform their own buffer management.
They maintain carefully tuned buffer pools, implement custom replacement policies, and control exactly when dirty pages get written to disk.
The operating system, unaware of this, runs its own page cache in parallel, duplicating the same data in memory and making eviction decisions based on generic heuristics.
Direct I/O eliminates this redundancy by letting the application bypass the OS page cache entirely, issuing reads and writes straight to the storage device.

This is not a niche optimization.
PostgreSQL, MySQL/InnoDB, Oracle, ScyllaDB, RocksDB, and most serious storage engines either use or support direct I/O.
Understanding when and why to bypass the page cache is fundamental to building high-performance storage systems.

How the OS Page Cache Works

When a process calls read() on a file descriptor opened normally, the kernel does not read from disk into the user-space buffer directly.
Instead, it reads from disk into kernel page cache pages, then copies the data from those pages into the user buffer.
Writes follow the reverse path: data goes into the page cache first, and the kernel flushes dirty pages to disk asynchronously or on fsync().

This design benefits general-purpose workloads.
Repeated reads of the same file region hit the cache.
The kernel can batch and reorder writes.
Applications get simple read()/write() semantics without worrying about alignment or device characteristics.

For a database engine, this well-intentioned caching creates several problems.

Double Buffering

A database with a 32 GB buffer pool running on a machine with 64 GB of RAM will, under normal I/O, have two copies of hot pages in memory: one in the database buffer pool, and one in the OS page cache.
This wastes memory that could otherwise hold additional database pages, index entries, or application data.

Eviction Policy Mismatch

The Linux kernel page cache uses a two-list LRU variant (active and inactive lists) with a CLOCK-like approximation to manage eviction.
A database engine has far more information about access patterns.
It knows that a full table scan should not pollute the buffer pool.
It knows that certain index root pages should be pinned.
The OS knows none of this, so its eviction decisions are suboptimal and can interfere with the database's own caching strategy.

Write Ordering and Durability

Database write-ahead logging (WAL) requires strict ordering: the log record must reach stable storage before the corresponding data page.
When writes flow through the page cache, the kernel can reorder or delay flushes.
The database must call fsync() or fdatasync() to force durability; even then, the interaction between page cache writeback and explicit syncs can cause latency spikes and unpredictable I/O patterns.

Memory Pressure and Stalls

Under memory pressure, the kernel's page reclamation (kswapd, direct reclaim) can stall application threads.
A database performing a large sequential scan can flood the page cache, evicting pages used by other parts of the system or even other cached database pages.
These stalls are difficult to diagnose and control from user space.

Using Direct I/O

On Linux, direct I/O is enabled by opening a file with the O_DIRECT flag:

int fd = open("/data/tablespace.dat", O_RDWR | O_DIRECT);

On FreeBSD and macOS, equivalent functionality is available through fcntl() with F_NOCACHE.
Windows provides FILE_FLAG_NO_BUFFERING.

Alignment Requirements

diagram-3 — Direct-I/O alignment: aligned offsets/buffers/sizes vs EINVAL on misalignment

Direct I/O imposes constraints that buffered I/O hides.
On Linux, the user-space buffer address, the file offset, and the transfer size must all be aligned to the logical block size reported by the filesystem (obtainable via statx() or ioctl(BLKSSZGET)).
In practice this is typically 512 bytes, but 4096-byte alignment is safer and more portable across devices and filesystems.
Failure to meet these constraints causes read() or write() to return EINVAL.

Database engines typically allocate page-aligned buffers using posix_memalign() or aligned_alloc(), and they already work in page-sized units (4 KB, 8 KB, or 16 KB), so these constraints rarely require architectural changes.

What Direct I/O Does NOT Do

A common misconception: O_DIRECT does not guarantee durability.
Data written with O_DIRECT bypasses the page cache but may still sit in the storage device's volatile write cache.
To ensure data is on stable media, the application must still call fsync(), fdatasync(), or open the file with O_DSYNC/O_SYNC.

Walkthrough

The following walkthrough illustrates the I/O path for a database page read under both buffered and direct I/O.

Buffered I/O Read Path

diagram-1 — Buffered read path: page-cache miss, double-buffering into DB pool

1. Database requests page P from its buffer pool manager.
2. Page P is not in the buffer pool (cache miss).
3. Buffer pool manager calls pread(fd, buf, 8192, offset).
4. Kernel checks page cache for pages covering [offset, offset+8192).
5a. If cached: kernel copies data from page cache into buf. Done.
5b. If not cached:
    6. Kernel allocates page cache pages.
    7. Kernel issues block I/O to read from disk into page cache.
    8. Kernel copies data from page cache into buf.
9. Database installs page P in its buffer pool.

Result: page P exists in both the database buffer pool and the OS page cache.
Memory holds two copies.

Direct I/O Read Path

diagram-2 — O_DIRECT read path: block I/O straight into the aligned DB buffer

1. Database requests page P from its buffer pool manager.
2. Page P is not in the buffer pool (cache miss).
3. Buffer pool manager calls pread(fd, aligned_buf, 8192, offset).
   (fd was opened with O_DIRECT; aligned_buf is page-aligned.)
4. Kernel issues block I/O to read from disk directly into aligned_buf.
   No page cache is consulted or populated.
5. Database installs page P in its buffer pool.

Result: page P exists only in the database buffer pool.
No kernel memory is consumed for caching this page.

Direct I/O Write Path

1. Database buffer pool manager decides to flush dirty page P.
2. Manager calls pwrite(fd, aligned_buf, 8192, offset).
   (fd opened with O_DIRECT.)
3. Kernel issues block I/O to write from aligned_buf directly to device.
   No page cache copy is created.
4. Manager calls fdatasync(fd) to ensure data reaches stable storage.
5. Page P is marked clean in the buffer pool.

Performance Implications

When Direct I/O Helps

Direct I/O tends to improve performance for database workloads where the buffer pool is the primary cache.
Specific scenarios include:

Large buffer pools. When the database allocates most of the available RAM to its buffer pool, double buffering in the page cache wastes the remaining memory and puts the system under memory pressure.

Write-heavy OLTP. Direct I/O gives the database precise control over when and which pages get written, eliminating interference from the kernel's per-device writeback flush threads.
This reduces write amplification and tail latency.

Sequential scans. A full table scan under buffered I/O pollutes the page cache with data that may never be read again, evicting genuinely hot data.
With direct I/O, the database can use a small ring buffer for scan I/O without side effects.

When Direct I/O Hurts

Small or absent buffer pools. If the application has no buffer management of its own, bypassing the page cache removes the only caching layer.
Performance degrades because every read hits the device.

Metadata-heavy workloads. File system metadata operations (directory lookups, inode reads) still go through the page cache regardless of O_DIRECT.
Workloads that create or delete many small files will not benefit.

Portability. O_DIRECT semantics vary across operating systems and file systems.
Some file systems (notably ZFS) ignore O_DIRECT or handle it differently.
Applications must test behavior on their target platform.

Interaction with io_uring and AIO

Direct I/O pairs naturally with asynchronous I/O interfaces.
Linux AIO (io_submit/io_getevents) historically required O_DIRECT to behave truly asynchronously; with buffered I/O, submissions could fall back to synchronous execution in kernel threads depending on the filesystem and kernel version.
The newer io_uring interface handles both buffered and direct I/O asynchronously more reliably, but direct I/O remains the preferred mode for database engines because it avoids page cache overhead and provides more predictable latency.

A typical modern database I/O subsystem combines O_DIRECT with io_uring for submission batching and completion polling, achieving hundreds of thousands of IOPS on NVMe devices with minimal CPU overhead.

Real-World Usage

InnoDB (MySQL): Supports O_DIRECT for data files via the innodb_flush_method setting (e.g., O_DIRECT, O_DIRECT_NO_FSYNC).
The behavior for redo log files also depends on this setting; options such as O_DSYNC affect log writes specifically.
Consult the MySQL documentation for the full matrix of flush method options.

PostgreSQL: Does not use O_DIRECT by default, relying on the OS page cache as a "second-level cache" beyond its shared buffers.
This is a deliberate design choice, though it results in double buffering.
There is ongoing work and community discussion about adding direct I/O support.

RocksDB: Supports O_DIRECT for both reads and writes via configuration (use_direct_reads, use_direct_io_for_flush_and_compaction).
This is particularly effective for LSM-tree compaction, which generates large sequential writes that would otherwise churn the page cache.

ScyllaDB: Built around direct I/O and the Seastar framework, using AIO (and now io_uring) for all disk operations.
The entire I/O scheduler operates under the assumption that the application owns caching.

Key Points

Direct I/O bypasses the OS page cache, eliminating double buffering when the database maintains its own buffer pool.
The OS page cache uses generic eviction policies that conflict with application-specific caching strategies, reducing effective cache hit rates.
O_DIRECT requires aligned buffers, aligned file offsets, and aligned transfer sizes to the filesystem's logical block size (512 bytes minimum on Linux; 4096 bytes is safer and more portable).
Direct I/O does not guarantee durability; fsync() or fdatasync() is still required to ensure data reaches stable storage.
Write-heavy and scan-heavy database workloads benefit most from direct I/O, due to reduced memory pressure and elimination of page cache pollution.
Applications without their own buffer management should generally not use direct I/O, as they would lose the only caching layer available.
Direct I/O combines effectively with asynchronous I/O interfaces (Linux AIO, io_uring) to achieve high throughput with predictable latency on modern NVMe storage.

References

Hagmann, R. "Reimplementing the Cedar File System Using Logging and Group Commit." Proceedings of the 11th ACM Symposium on Operating Systems Principles (SOSP), 1987.

Stonebraker, M. "Operating System Support for Database Management." Communications of the ACM, 24(7), 1981.

Axboe, J. "Efficient I/O with io_uring." Kernel documentation, 2019. https://kernel.dk/io_uring.pdf

Mathur, A., Cao, M., Bhattacharya, S., Dilger, A., Thomas, A., and Vivier, L. "The New ext4 Filesystem: Current Status and Future Plans." Proceedings of the Linux Symposium, Ottawa, 2007.