Статьи

Практические советы по уменьшению размера таблицы базы данных SQL Server

В этой статье мы подробнее рассмотрим вопрос повышения производительности при работе с таблицами базы данных. Скорее всего, вы уже знаете это из нескольких ресурсов по процессу разработки баз данных.

Эксперты1 SQL SERVER Практические советы по уменьшению размера таблицы базы данных SQL Server Мнение экспертов

Тем не менее, эта тема, как представляется, становится основной проблемой, когда происходит непрерывный рост данных — таблицы становятся слишком большими, что приводит к потере производительности.

Это происходит из-за плохо спроектированной схемы базы данных, которая изначально не была предназначена для обработки больших объемов данных.

Чтобы избежать потери производительности в контексте постоянного роста данных, вы должны придерживаться определенных правил при разработке схемы базы данных.

Правило № 1 — минимальная избыточность типов данных

Основной единицей хранения данных SQL Server является страница. Дисковое пространство, предназначенное для файла данных в базе данных, логически разделено на страницы, пронумерованные непрерывно от 0 до n. В SQL Server размер страницы составляет 8 КБ. Это означает, что базы данных SQL Server имеют 128 страниц на мегабайт.

Операции дискового ввода-вывода выполняются на уровне страницы. То есть SQL Server читает или записывает целые страницы данных. Чем более компактны типы данных, тем меньше страниц для хранения этих данных, и, как следствие, требуется меньше операций ввода-вывода.

Представленный в SQL Server буферный пул значительно улучшает пропускную способность ввода-вывода. Основное назначение пула буферов SQL — уменьшить количество операций ввода-вывода для файла базы данных и сократить время отклика для извлечения данных.

Таким образом, когда используются компактные типы данных, буферный пул хранит больший объем данных на том же количестве страниц. В результате вы не будете тратить память и сокращать количество логических операций.

Рассмотрим следующий пример — таблица, в которой хранятся рабочие дни сотрудников.

CREATE TABLE dbo.WorkOut1 (
DateOut DATETIME
, EmployeeID BIGINT
, WorkShiftCD NVARCHAR(10)
, WorkHours DECIMAL(24,2)
, CONSTRAINT PK_WorkOut1 PRIMARY KEY (DateOut, EmployeeID)
)

Верны ли выбранные типы данных? Наиболее вероятный ответ — нет. Вряд ли на предприятии есть (2 ^ 63-1) работников. Поэтому BIGINT является неподходящим типом данных в этом случае.

Мы можем удалить эту избыточность и оценить время выполнения запроса.

CREATE TABLE dbo.WorkOut2 (
DateOut SMALLDATETIME
, EmployeeID INT
, WorkShiftCD VARCHAR(10)
, WorkHours DECIMAL(8,2)
, CONSTRAINT PK_WorkOut2 PRIMARY KEY (DateOut, EmployeeID)
)

Следующий план выполнения демонстрирует разницу в стоимости, которая зависит от размера строки и ожидаемого количества строк.

Эксперты2 SQL SERVER Практические советы по уменьшению размера таблицы базы данных SQL Server Мнение экспертов

The less data you need to retrieve, the faster query will run.

(3492294 row(s) affected)

SQL Server Execution Times:
CPU time = 1919 ms, elapsed time = 33606 ms.

(3492294 row(s) affected)

SQL Server Execution Times:
CPU time = 1420 ms, elapsed time = 29694 ms.

As you can see, the usage of non-redundant data types is a keystone for the best query performance. It also allows reducing the size of problem tables. By the way, you can execute the following query for measuring a table size:

SELECT
table_name = SCHEMA_NAME(o.[schema_id]) + '.' + o.name
, data_size_mb = CAST(do.pages * 8. / 1024 AS DECIMAL(8,4))
FROM sys.objects o
JOIN (
SELECT
p.[object_id]
, total_rows = SUM(p.[rows])
, total_pages = SUM(a.total_pages)
, usedpages = SUM(a.used_pages)
, pages = SUM(
CASE
WHEN it.internal_type IN (202, 204, 207, 211, 212, 213, 214, 215, 216,221, 222) THEN 0
WHEN a.[type]! = 1 AND p.index_id < 2 THEN a.used_pages
WHEN p.index_id < 2 THEN a.data_pages ELSE 0
END
)
FROM sys.partitions p
JOIN sys.allocation_units a ON p.[partition_id] = a.container_id
LEFT JOIN sys.internal_tables it ON p.[object_id] = it.[object_id]
GROUP BY p.[object_id]
) do ON o.[object_id] = do.[object_id]
WHERE o.[type] = 'U'

For the above-considered tables, the query returns the following results:

table_name           data_size_mb
— — — — — — – — — — — — — — — — — -
dbo.WorkOut1         167.2578
dbo.WorkOut2         97.1250

Rule # 2 — Use Database Normalization and Avoid Data Duplication

Recently, I have analyzed a database of a free web service that allows formatting T-SQL code. The server part is quite simple over there and consists of a single table:

CREATE TABLE dbo.format_history (
session_id BIGINT
, format_date DATETIME
, format_options XML
)

Every time when formatting SQL code, the following parameters were saved to the database: current session ID, server time, and the settings that were applied while formatting user’s SQL code.

This data subsequently were used for determining of most popular formatting styles. There were plans to add these styles to SQL Complete default formatting styles.

However, the service popularity rise led to a significant table rows increase, and profiles processing became slow. The settings had the following XML structure:

&lt;FormatProfile&gt;
&lt;FormatOptions&gt;
&lt;PropertyValue Name="Select_SelectList_IndentColumnList"&gt;true&lt;/PropertyValue&gt;
&lt;PropertyValue Name="Select_SelectList_SingleLineColumns"&gt;false&lt;/PropertyValue&gt;
&lt;PropertyValue Name="Select_SelectList_StackColumns"&gt;true&lt;/PropertyValue&gt;
&lt;PropertyValue Name="Select_SelectList_StackColumnsMode"&gt;1&lt;/PropertyValue&gt;
&lt;PropertyValue Name="Select_Into_LineBreakBeforeInto"&gt;true&lt;/PropertyValue&gt;
...
&lt;PropertyValue Name="UnionExceptIntersect_LineBreakBeforeUnion"&gt;true&lt;/PropertyValue&gt;
&lt;PropertyValue Name="UnionExceptIntersect_LineBreakAfterUnion"&gt;true&lt;/PropertyValue&gt;
&lt;PropertyValue Name="UnionExceptIntersect_IndentKeyword"&gt;true&lt;/PropertyValue&gt;
&lt;PropertyValue Name="UnionExceptIntersect_IndentSubquery"&gt;false&lt;/PropertyValue&gt;
...
&lt;/FormatOptions&gt;
&lt;/FormatProfile&gt;

450 formatting options in total. Each row takes 33 KB in the table. The daily data growth exceeds 100 MB. As an obvius outcome, the database has been expanding day by day, thus making data analysis yet more complicated .

Surprisingly, the salvation turned out to be quite easy: all unique profiles were placed into a separate table, where a hash was defined for every set of options. As of SQL Server 2008, you can use the sys.fn_repl_hash_binary function for this.

So the DB schema has been normalized:

CREATE TABLE dbo.format_profile (
format_hash BINARY(16) PRIMARY KEY
, format_profile XML NOT NULL
)
CREATE TABLE dbo.format_history (
session_id BIGINT
, format_date SMALLDATETIME
, format_hash BINARY(16) NOT NULL
, CONSTRAINT PK_format_history PRIMARY KEY CLUSTERED (session_id, format_date)
)

And if the previous query looked like this:

SELECT fh.session_id, fh.format_date, fh.format_options
FROM SQLF.dbo.format_history fh

The new schema required the JOIN usage to retrieve the same data:

SELECT fh.session_id, fh.format_date, fp.format_profile
FROM SQLF_v2.dbo.format_history fh
JOIN SQLF_v2.dbo.format_profile fp ON fh.format_hash = fp.format_hash

If we compare the execution time for two queries, we can hardly see the advantages of the schema changes.

(3090 row(s) affected)

SQL Server Execution Times:
CPU time = 203 ms, elapsed time = 4698 ms.

(3090 row(s) affected)

SQL Server Execution Times:
CPU time = 125 ms, elapsed time = 4479 ms.

But in this case, the goal was to decrease time for analysis. Before we had to write an intricate query for getting the list of popular formatting profiles:

;WITH cte AS (
SELECT
fh.format_options
, hsh = sys.fn_repl_hash_binary(CAST(fh.format_options AS VARBINARY(MAX)))
, rn = ROW_NUMBER() OVER (ORDER BY 1/0)
FROM SQLF.dbo.format_history fh
)
SELECT c2.format_options, c1.cnt
FROM (
SELECT TOP (10) hsh, rn = MIN(rn), cnt = COUNT(1)
FROM cte
GROUP BY hsh
ORDER BY cnt DESC
) c1
JOIN cte c2 ON c1.rn = c2.rn
ORDER BY c1.cnt DESC

Now due to the data normalization, we managed to simplify the query:

SELECT
fp.format_profile
, t.cnt
FROM (
SELECT TOP (10)
fh.format_hash
, cnt = COUNT(1)
FROM SQLF_v2.dbo.format_history fh
GROUP BY fh.format_hash
ORDER BY cnt DESC
) t
JOIN SQLF_v2.dbo.format_profile fp ON t.format_hash = fp.format_hash

As well as to decrease the query execution time:

(10 row(s) affected)

SQL Server Execution Times:
CPU time = 2684 ms, elapsed time = 2774 ms.

(10 row(s) affected)

SQL Server Execution Times:
CPU time = 15 ms, elapsed time = 379 ms.

In addition, the database size has decreased:

database_name    row_size_mb
— — — — — - — — — — —
SQLF             123.50
SQLF_v2          7.88

To retrieve a file size, the following query can be used:

SELECT
database_name = DB_NAME(database_id)
, row_size_mb = CAST(SUM(CASE WHEN type_desc = 'ROWS' THEN size END) *8. / 1024 AS DECIMAL(8,2))
FROM sys.master_files
WHERE database_id IN (DB_ID('SQLF'), DB_ID('SQLF_v2'))
GROUP BY database_id

Hope I managed to demonstrate the importance of data normalization.

Rule # 3 — Be Careful While Selecting Indexed Columns.

An index is an on-disk structure associated with a table or view that speeds retrieval of rows from a table or a view. Indexes are stored on pages, thus, the less pages is required to store indexes, the faster search process is. It is extremely important to be careful while selecting clustered indexed columns, because all the clustered index columns are included in every non-clustered index. Due to this fact, a database size can increase dramatically.

Rule # 4 — Use Consolidated Tables.

You do not need to execute a complex query on a large table. Instead, you can execute a simple query on a small table.

For instance, we have the following consolidation query

SELECT
WorkOutID
, CE = SUM(CASE WHEN WorkKeyCD = 'CE' THEN Value END)
, DE = SUM(CASE WHEN WorkKeyCD = 'DE' THEN Value END)
, RE = SUM(CASE WHEN WorkKeyCD = 'RE' THEN Value END)
, FD = SUM(CASE WHEN WorkKeyCD = 'FD' THEN Value END)
, TR = SUM(CASE WHEN WorkKeyCD = 'TR' THEN Value END)
, FF = SUM(CASE WHEN WorkKeyCD = 'FF' THEN Value END)
, PF = SUM(CASE WHEN WorkKeyCD = 'PF' THEN Value END)
, QW = SUM(CASE WHEN WorkKeyCD = 'QW' THEN Value END)
, FH = SUM(CASE WHEN WorkKeyCD = 'FH' THEN Value END)
, UH = SUM(CASE WHEN WorkKeyCD = 'UH' THEN Value END)
, NU = SUM(CASE WHEN WorkKeyCD = 'NU' THEN Value END)
, CS = SUM(CASE WHEN WorkKeyCD = 'CS' THEN Value END)
FROM dbo.WorkOutFactor
WHERE Value > 0
GROUP BY WorkOutID

If there is no need to often change the table data, we can create a separate table

SELECT *
FROM dbo.WorkOutFactorCache

The data retrieval from such consolidated table will be much faster:

(185916 row(s) affected)

SQL Server Execution Times:

CPU time = 3448 ms, elapsed time = 3116 ms.

(185916 row(s) affected)

SQL Server Execution Times:

CPU time = 1410 ms, elapsed time = 1202 ms.

Rule # 5 — Every Rule Has an Exception

I’ve shown a couple of examples that demonstrated how to eliminate redundant data types and shorten queries execution time. But it does not always happen.

For instance, the BIT data type has a peculiarity —  SQL Server optimizes the storage of such columns group on a disk. If a table contains 8 (or less) columns of the BIT type, they are stored in the page as 1 byte. And if the table contains 16 columns of the BIT type, they are stored in the page as 2 bytes etc. The good news is that the table will take up little space and reduce disc I/O.

The bad news is that an implicit decoding will take place while retrieving data,  and the process is very demanding in terms of CPU resources.

Here is the example. Assume we have 3 identical tables containing information about employees work schedule (31 + 2 PK columns).  The only difference between tables is the data type for consolidated values (1– presence, 2 – absence)

SELECT * FROM dbo.E_51_INT
SELECT * FROM dbo.E_52_TINYINT
SELECT * FROM dbo.E_53_BIT

When using less redundant data types, the table size decreases considerably (especially the last table)

table_name           data_size_mb
— — — — — — – — — — — –
dbo.E31_INT          150.2578
dbo.E32_TINYINT      50.4141
dbo.E33_BIT          24.1953

However, there is no significant speed gain from using the BIT type

(1000000 row(s) affected)

Table ‘E31_INT’. Scan count 1, logical reads 19296, physical reads 1, read-ahead reads 19260, …

SQL Server Execution Times:

CPU time = 1607 ms,  elapsed time = 19962 ms.

(1000000 row(s) affected)

Table ‘E32_TINYINT’. Scan count 1, logical reads 6471, physical reads 1, read-ahead reads 6477, …

SQL Server Execution Times:

CPU time = 1029 ms,  elapsed time = 16533 ms.

(1000000 row(s) affected)

Table ‘E33_BIT’. Scan count 1, logical reads 3109, physical reads 1, read-ahead reads 3096, …

SQL Server Execution Times:

CPU time = 1820 ms,  elapsed time = 17121 ms.

But the execution plan will show the opposite.

Эксперты3 SQL SERVER Практические советы по уменьшению размера таблицы базы данных SQL Server Мнение экспертов

So the negative effect from the decoding will not appear if a table contains less than 8 BIT columns. One must note that the BIT data type is hardly used in SQL Server metadata. More often the BINARY data type is used, however it requires manual manipulations for obtaining  specific values.

Rule # 6 — Delete Data That Is No Longer Required.

SQL Server supports a performance optimization mechanism called read-ahead. This mechanizm anticipates the data and index pages needed to fulfill a query execution plan and brings pages into the buffer cache before they are actually used by the query.

So if the table contains a lot of needless data, it may lead to unnecessary disk I/O. Besides, getting rid of needless data allows you to reduce the number of logical operations while reading from the Buffer Pool.

In conclusion, my advice is to be extremely careful while selecting data types for columns and try predicting future data loads.