In-Memory OLTP: How Durability is Achieved for Memory-Optimized Tables

October 11, 2013

In-memory OLTP provides full durability for memory-optimized tables. When a transaction that has made changes to memory-optimized table commits, SQL Server, just like it does for disk-based tables, guarantees that the changes are permanent and can survive the database restart provided the underlying storage is available.

There are two key components of durability. First is the transaction logging and the second is persisting changes to data to on-disk storage. Let us look at each of these in the context of disk-based and memory-optimized tables.

Transaction Logging: All changes made to disk-based tables or durable memory-optimized tables are captured in one or more transaction log records. When a transaction commits, SQL Server flushes all the log records associated with the transaction to disk before sending communicating to the application or user session that the transaction has committed. This guarantees that the changes done on behalf of transaction are durable. The transaction logging for memory-optimized tables is somewhat different than disk-based tables with the key differences described below.

Transaction log records are generated only after a transaction begins the process of committing This provides for efficient logging as there is no need to log any information should the transaction needs to be rolled back. In other words, there is no UNDO log information logged for memory-optimized tables
Changes to indexes on memory-optimized tables are not logged. These indexes are only kept in-memory and are not persisted and are re-generated at the database restart. This can potentially reduce the transaction log overhead significantly.
Multiple changes are grouped into one large log record (max size 24 KB at this time and is subject to change in the future). For example, if the transaction has inserted 1000 rows and 50 rows can fit into a large log record, there will be total of 20 log records generated for memory-optimized tables. Contrast this to disk-based tables, which would have generated 2000 log records just for data rows alone. Fewer number of log records minimize the log-header overhead and effectively reduce the contention for inserting into log-buffer

These differences make logging for memory-optimized table more efficient while still guaranteeing the full durability.

Persisting changes to storage: The transaction log records all the changes that ever happened in a database starting from the database was created and can used to re-construct the database assuming the transaction log was truncated. However, this would be very inefficient as SQL Server will need to apply all the transaction log records sequentially since the database was created thereby making recovery time (RTO) unacceptably high. To avoid this, SQL Server and other database systems, take periodic checkpoints that flush operations to durable storage and allow log truncation to reduce the amount of transaction log that must be replayed after a crash. Before describing how data is persisted for memory-optimized tables, let us first look at how data is persisted for disk-based tables and what are the performance implications.

Disk-Based Tables: Traditional relational database servers in use today were designed at a time when the size of memory was much smaller than the size of the database. In this architecture, the data is organized into multiple pages and these pages are the unit of IO. SQL Server uses a page size of 8KB. Each table has its own set of data and index pages and these pages are not shared with other tables. When a change is made to some row, it is first logged and then the data and/or index page is updated. SQL Server enforces write-ahead logging (WAL) to make sure that the log record is persisted before the data or index page. Without WAL, there will be no way to UNDO changes to a page if needed. Over time, as transactions commit, the list of dirty pages in the buffer pool may grow depending upon the memory available. Periodically, there is automatic checkpoint operation done that flushes all the dirty pages to the disk. The checkpoint operation ensures that data/index pages containing changes from all the transactions before the checkpoint operation was started are persisted to the data storage disk. The checkpoint operation also trims the ‘active’ portion of the transaction log, the part of the log that needs to be applied when the database is restarted.

Performance Bottleneck: The access to data/index pages generates random IO thereby reducing the IOPS available from rotating media. Also, the checkpoint operation can potentially cause significantly high IO activity that can impact the production workload negatively. With Indirect-checkpoint feature available as part of SQL Server 2012, the amount of IO done as part of checkpoint is reduced but still the fact remains that the IO is random.

Memory-Optimized Tables: The data in memory-optimized tables is stored as free-form data rows that are linked through one or more in-memory indexes. There are no page structures for data rows like we have for disk-based tables. When a change is made to some row(s) as part of a transaction, a new row version(s) is created but no transaction log records are generated at this time. There is no issue with WAL as the in-memory data will be lost when database is re-started. When the application signals to commit the transaction, the in-memory OLTP engine validates that the transaction can be committed (i.e. there are no validation failures) and then generates the log records for the transaction. At this time, the changes for the committed transaction only exist in the transaction log records.

The persistence of memory-optimized tables is done with a set of data and delta files using a background thread (described later). These files are located in one or more containers leveraging the same mechanism as used for FILESTREAM data. These containers are mapped to a new type of filegroup, called Memory_Optimized filegroup. For example, the following command adds a memory_optimized filegroup with one container to the database ContosoOLTP.

CREATE DATABASE ContosoOLTP
on PRIMARY (NAME = [contoso_data], FILENAME = 'C:\data\contoso_data.mdf', SIZE=100mb)
LOG ON (name = [contoso_log], Filename='C:\data\contoso_log.ldf', size=100MB)

----- Enable database for memory optimized tab--les
 -- add memory_optimized_data filegroup
ALTER DATABASE ContosoOLTP 
ADD FILEGROUP contoso_mod CONTAINS MEMORY_OPTIMIZED_DATA

-- add container to the filegroup
ALTER DATABASE ContosoOLTP 
ADD FILE (NAME='contoso_mod', FILENAME='c:\data\contoso_mod') 
TO FILEGROUP contoso_mod

As indicated earlier, there are two kinds of files as follows

Data File

A data file contains rows from one or more memory-optimized tables inserted by multiple transactions as part of INSERT or UPDATE operation. For example, one row can be from memory_optimized table T1 and the next row can be from table T2. Each data file is sized approximately to 128MB. The rows only appended to the data file exploiting the sequential IO. Once the data file is full, the rows inserted by new transactions are stored in another data file. Over time, the rows from durable memory-optimized tables are stored across one of more data files and each data file containing rows from a disjoint but contiguous range of transactions. For example a data file with transaction commit timestamp in the range of (100, 200) has all the rows inserted by transactions that have commit timestamp in in this range. The commit timestamp is a monotonically increasing number assigned to a transaction when it is ready to commit. Each transaction has a unique commit timestamp.

When a row is deleted or updated by a future transaction, the rows is not removed or changed in-place in the data file but the deleted rows are tracked in another type of file ‘delta’ file. Update operations are processed as delete/insert of the row. This eliminates random IO on the data file.

Delta File

Each data file is paired with a delta file that has the same transaction range and tracks the deleted rows inserted by transactions in the transaction range. For example, a delta file corresponding to transaction range (100, 200) will store deleted rows that were inserted by transactions in the range (100, 200). Like data files, the delta file is accessed sequentially.

Populating the Data and Delta files

The user transactions don’t actually append to data or delta file. This is done by a background thread, called offline checkpoint worker. This thread reads the transaction log records generated by committed transactions on memory-optimized tables and appends inserted and deleted rows into appropriate data and delta files. Unlike disk-based tables where data/index pages are flushed with random IO when checkpoint is done, the persistence of memory-optimized table is continuous background operation.

Checkpoint for Memory-Optimized Tables

The checkpoint for memory –optimized tables is done when the transaction log grows larger than an internal threshold (currently set to 1GB) since the last checkpoint. The checkpoint operation is completed with the following steps

All buffered writes are flushed to the data and delta files.
A checkpoint inventory is constructed that includes descriptors for all files from the previous checkpoint plus any files added by the current checkpoint. The inventory is hardened to durable storage.
The location of the inventory is stored in the transaction log so that it is available at recovery time

During crash recovery, SQL Server locates the last completed checkpoint from the transaction log, loads the data and delta files and then applies the active part of the transaction log to bring memory-optimized tables to the current point-in-time.

For more information, download SQL Server CTP1 and get started today, or see more blogs in the series introduction and index here.