Статьи

‘Индексирование’ документов JSON для эффективных запросов MySQL к данным JSON

Первоначально Написал Джастин Суонхарт

MySQL встречает NoSQL с JSON UDF

Я недавно вернулся из FOSDEM , в Брюсселе, Бельгия. Пока я был там, я увидел отличную беседу Светы Смирновой о том, что ее MySQL 5.7 Labs выпускает функции JSON UDF. Важно отметить, что, хотя UDF входит в версию 5.7, абсолютно возможно скомпилировать и использовать UDF с более ранними версиями MySQL, потому что интерфейс UDF не менялся в течение длительного времени. Тем не менее, UDF все равно следует рассматривать как уровень качества альфа / предварительного просмотра и пока не должен использоваться в производстве! Для этого примера я использую Percona Server 5.6 с UDF.

При этом для подтверждения концепции, которую я собираюсь представить здесь, используется только одна функция JSON (JSON_EXTRACT), и она достаточно хорошо сработала в моем тестировании, чтобы представить мою идею здесь. Функции JSON, скорее всего, скоро станут GA, и это полезный тест функции JSON_EXTRACT.

UDF позволяет вам анализировать, искать и манипулировать данными JSON внутри MySQL, приближая MySQL к возможностям хранилища документов.

Поскольку я использую Percona Server 5.6, мне нужно было скомпилировать и установить UDF. Вот шаги, которые я предпринял, чтобы скомпилировать плагин:

  1. $ cd mysql-json-udfs-0.3.3-labs-json-udfs-src
  2. $ cmake -DMYSQL_DIR = / usr / local / mysql.
  3. $ sudo make install
  4. $ sudo cp * .so / usr / local / mysql / lib / plugin

JSON UDF великолепны, но в чем проблема

The JSON functions work very well for manipulating individual JSON objects, but like all other functions, using JSON_EXTRACT in the WHERE clause will result in a full table scan. This means the functions are virtually useless for searching through large volumes of JSON data.  If you want to use MySQL as a document store, this is going to limit the usefulness in the extreme as the ability to extract key/value pairs from JSON documents is powerful, but without indexing it can’t scale well.

What can be done to index JSON in MySQL for efficient access?

The JSON UDF provides a JSON_EXTRACT function which can pull data out of a JSON document. There are two ways we can use this function to “index” the JSON data.

  1. Add extra columns to the table (or use a separate table, or tables) containing the JSON and populate the columns using JSON_EXTRACT in a trigger. The downside is that this slows down inserts and modifications of the documents significantly.
  2. Use Flexviews materialized views to maintain an index table separately and asynchronously. The upside is that insertion/modification speed is not affected, but there is slight delay before index is populated. This is similar to eventual consistency in a document store.

Writing triggers is an exercise I’ll leave up to the user. The rest of this post will discuss using Flexviews materialized views to create a JSON index.

What is Flexviews?

Flexviews can create ‘incrementally refreshable’ materialized views. This means that the views are able to be refreshed efficiently using changes captured by FlexCDC, the change data capture tool that ships with Flexviews. Since the view can be refreshed fast, it is possible to refresh it frequently and have a low latency index, but not one perfectly in sync with the base table at all times.

The materialized view is a real table that is indexed to provide fast access. Flexviews includes a SQL_API, or a set of stored procedures for defining and maintaining materialized views.

See this set of slides for an overview of Flexviews: http://www.slideshare.net/MySQLGeek/flexviews-materialized-views-for-my-sql

Demo/POC using materialized view as an index

Первым шагом к созданию постепенно обновляемого материализованного представления с помощью Flexviews является создание журнала изменений материализованного представления для всех таблиц, используемых в представлении. Функция CREATE_MVLOG ($ schema, $ table) создает журнал, и FlexCDC немедленно собирает изменения в нем.

mysql> call flexviews.create_mvlog('ssb_json','json');
Query OK, 1 row affected (0.01 sec)

Далее, материализованное имя представления и тип обновления должны быть зарегистрированы с помощью функции CREATE ($ schema, $ mvname, $ refreshtype):

mysql> call flexviews.create('ssb_json','json_idx','INCREMENTAL');
Query OK, 0 rows affected (0.00 sec)
-- save the just generated identifier for the view.  You can use GET_ID($schema,$mvname) later.
mysql> set @mvid := last_insert_id();
Query OK, 0 rows affected (0.00 sec)

Now one or more tables have to be added to the view using the ADD_TABLE($mvid, $schema, $table, $alias,$joinclause) function. This example will use only one table, but Flexviews supports joins too.

mysql> call flexviews.add_table(@mvid, 'ssb_json','json','json',null);
Query OK, 1 row affected (0.00 sec)

Expressions must be added to the view next. Since aggregation is not used in this example, the expressions should be ‘COLUMN’ type expressions. The function ADD_EXPR($mvid, $expression_type, $expression, $alias) is used to add expressions. Note that JSON_EXTRACT returns a TEXT column, so I’ve CAST the function to integer so that it can be indexed. Flexviews does not currently have a way to define prefix indexes.

mysql> call flexviews.add_expr(@mvid, 'COLUMN', "cast(json_extract(doc,'D_DateKey') as date)", 'D_DateKey');
Query OK, 1 row affected (0.00 sec)
mysql> call flexviews.add_expr(@mvid, 'COLUMN',"cast(json_extract(doc,'C_CustomerKey') as unsigned)", 'C_CustomerKey');
Query OK, 1 row affected (0.00 sec)
mysql> call flexviews.add_expr(@mvid, 'COLUMN',"cast(json_extract(doc,'S_SuppKey') as unsigned)", 'S_SuppKey');
Query OK, 1 row affected (0.01 sec)
mysql> call flexviews.add_expr(@mvid, 'COLUMN',"cast(json_extract(doc,'P_PartKey') as unsigned)", 'P_PartKey');
Query OK, 1 row affected (0.00 sec)
mysql> call flexviews.add_expr(@mvid, 'COLUMN',"cast(json_extract(doc,'LO_OrderKey') as unsigned)", 'LO_OrderKey');
Query OK, 1 row affected (0.00 sec)
mysql> call flexviews.add_expr(@mvid, 'COLUMN',"cast(json_extract(doc,'LO_LineNumber') as unsigned)", 'LO_LineNumber');
Query OK, 1 row affected (0.00 sec)

I’ve also projected out the ‘id’ column from the table, which is the primary key. This ties the index entries to the original row, so that the original document can be retrieved.

mysql> call flexviews.add_expr(@mvid, 'COLUMN', 'id', 'id');
Query OK, 1 row affected (0.00 sec)

Since we want to use the materialized view as an index, we need to index the columns we’ve added to it.

mysql> call flexviews.add_expr(@mvid, 'KEY',"LO_LineNumber", 'LO_LineNumber_Idx');
Query OK, 1 row affected (0.00 sec)
mysql> call flexviews.add_expr(@mvid, 'KEY',"LO_OrderKey", 'LO_OrderKey_Idx');
Query OK, 1 row affected (0.00 sec)
mysql> call flexviews.add_expr(@mvid, 'KEY',"P_PartKey", 'P_PartKey_Idx');
Query OK, 1 row affected (0.00 sec)
mysql> call flexviews.add_expr(@mvid, 'KEY',"S_SuppKey", 'S_SuppKey_Idx');
Query OK, 1 row affected (0.00 sec)
mysql> call flexviews.add_expr(@mvid, 'KEY',"D_DateKey", 'D_DateKey_Idx');
Query OK, 1 row affected (0.00 sec)
mysql> call flexviews.add_expr(@mvid, 'KEY',"C_CustomerKey", 'C_CustomerKey_Idx');
Query OK, 1 row affected (0.00 sec)

Finally, the view has to be created. There are 6 million rows in my table, the JSON functions are UDF so they are not as fast as built in functions, and I indexed a lot of things (six different indexes are being populated at once) so it takes some time to build the index:

mysql> call flexviews.enable(@mvid);
Query OK, 2 rows affected (35 min 53.17 sec)

After the materialized view is built, you can see it in the schema. Note there is also a delta table, which I will explain a bit later.

mysql> show tables;
+--------------------+
| Tables_in_ssb_json |
+--------------------+
| json               |
| json_idx           |
| json_idx_delta     |
+--------------------+
3 rows in set (0.00 sec)

Here is the table definition of json_idx, our materialized view. You can see it is indexed:

CREATE TABLE `json_idx` (
  `mview$pk` bigint(20) NOT NULL AUTO_INCREMENT,
  `D_DateKey` date DEFAULT NULL,
  `C_CustomerKey` bigint(21) unsigned DEFAULT NULL,
  `S_SuppKey` bigint(21) unsigned DEFAULT NULL,
  `P_PartKey` bigint(21) unsigned DEFAULT NULL,
  `LO_OrderKey` bigint(21) unsigned DEFAULT NULL,
  `LO_LineNumber` bigint(21) unsigned DEFAULT NULL,
  `id` bigint(20) NOT NULL DEFAULT '0',
  `mview$hash` int(10) unsigned DEFAULT NULL,
  PRIMARY KEY (`mview$pk`),
  KEY `LO_LineNumber_Idx` (`LO_LineNumber`),
  KEY `LO_OrderKey_Idx` (`LO_OrderKey`),
  KEY `P_PartKey_Idx` (`P_PartKey`),
  KEY `S_SuppKey_Idx` (`S_SuppKey`),
  KEY `D_DateKey_Idx` (`D_DateKey`),
  KEY `C_CustomerKey_Idx` (`C_CustomerKey`),
  KEY `mview$hash_key` (`mview$hash`)
) ENGINE=InnoDB AUTO_INCREMENT=6029221 DEFAULT CHARSET=latin1;

Here are some sample contents. You can see the integer values extracted out of the JSON:

mysql> select * from json_idx limit 10;
+----------+------------+---------------+-----------+-----------+-------------+---------------+----+------------+
| mview$pk | D_DateKey  | C_CustomerKey | S_SuppKey | P_PartKey | LO_OrderKey | LO_LineNumber | id | mview$hash |
+----------+------------+---------------+-----------+-----------+-------------+---------------+----+------------+
|        1 | 1996-08-08 |          6748 |         1 |    178778 |       35620 |             2 |  1 | 3151656687 |
|        2 | 1994-05-20 |          5272 |         1 |     52362 |      102790 |             4 |  2 | 2181615425 |
|        3 | 1995-05-04 |         22870 |         1 |    136407 |      146757 |             3 |  3 |  544130577 |
|        4 | 1996-06-16 |         12217 |         1 |    129103 |      151200 |             1 |  4 | 2969697839 |
|        5 | 1992-07-20 |         21757 |         1 |     35243 |      151745 |             1 |  5 | 1438921571 |
|        6 | 1997-08-16 |         18760 |         1 |    150766 |      159232 |             6 |  6 | 3941775529 |
|        7 | 1994-03-04 |           757 |         1 |     15750 |      188902 |             3 |  7 | 2142628088 |
|        8 | 1993-11-04 |         17830 |         1 |    192023 |      253828 |             5 |  8 | 3480147565 |
|        9 | 1993-07-12 |         16933 |         1 |     59997 |      269062 |             5 |  9 | 3572286145 |
|       10 | 1998-06-16 |         26660 |         1 |     30023 |      298272 |             3 | 10 | 1971966244 |
+----------+------------+---------------+-----------+-----------+-------------+---------------+----+------------+
10 rows in set (0.00 sec)

Now, there needs to be an easy way to use this index in a select statement. Since a JOIN is needed between the materialized view and the base table, a regular VIEW makes sense to access the data. We’ll call this the index view:

mysql> create view json_idx_v as select * from json natural join json_idx;
Query OK, 0 rows affected (0.00 sec)

And just for completeness, here is the contents of a row from our new index view:

mysql> select * from json_idx_v limit 1G
*************************** 1. row ***************************
           id: 1
          doc: {"LO_OrderKey":"35620","LO_LineNumber":"2","LO_CustKey":"6748","LO_PartKey":"178778","LO_SuppKey":"1","LO_OrderDateKey":"19960808","LO_OrderPriority":"3-MEDIUM","LO_ShipPriority":"0","LO_Quantity":"38","LO_ExtendedPrice":"7055726","LO_OrdTotalPrice":"14691804","LO_Discount":"8","LO_Revenue":"6491267","LO_SupplyCost":"111406","LO_Tax":"1","LO_CommitDateKey":"19960909","LO_ShipMode":"REG AIR","C_CustomerKey":"6748","C_Name":"Customer#000006748","C_Address":"RSPYBRlR7RX6 f7J8","C_City":"PERU     5","C_Nation":"PERU","C_Region":"AMERICA","C_Phone":"27-580-967-4556","C_MktSegment":"AUTOMOBILE","S_SuppKey":"1","S_Name":"Supplier#000000001","S_Address":"sdrGnXCDRcfriBvY0KL,i","S_City":"PERU     0","S_Nation":"PERU","S_Region":"AMERICA","S_Phone":"27-989-741-2988","D_DateKey":"19960808","D_Date":"Augest 8, 1996","D_DayOfWeek":"Friday","D_Month":"Augest","D_Year":"1996","D_YearMonthNum":"199608","D_YearMonth":"Aug1996","D_DayNumInWeek":"6","D_DayNumInMonth":"8","D_DayNumInYear":"221","D_MonthNumInYear":"8","D_WeekNumInYear":"32","D_SellingSeason":"Summer","D_LastDayInWeekFl":"0","D_LastDayInMonthFl":"1","D_HolidayFl":"0","D_WeekDayFl":"1","P_PartKey":"178778","P_Name":"turquoise powder","P_MFGR":"MFGR#1","P_Category":"MFGR#11","P_Brand":"MFGR#1125","P_Colour":"beige","P_Type":"STANDARD POLISHED NICKEL","P_Size":"25","P_Container":"JUMBO BAG"}
     mview$pk: 1
    D_DateKey: 1996-08-08
C_CustomerKey: 6748
    S_SuppKey: 1
    P_PartKey: 178778
  LO_OrderKey: 35620
LO_LineNumber: 2
   mview$hash: 3151656687
1 row in set (0.00 sec)

Using the UDF to find a document

The UDF does a full table scan, parsing all six million documents (TWICE!) as it goes along. Unsurprisingly, this is slow:

mysql> select * from json where json_extract(doc,'LO_OrderKey') = 35620 and json_extract(doc,'LO_LineNumber') = 2G
*************************** 1. row ***************************
id: 1
doc: {"LO_OrderKey":"35620","LO_LineNumber":"2","LO_CustKey":"6748","LO_PartKey":"178778","LO_SuppKey":"1","LO_OrderDateKey":"19960808","LO_OrderPriority":"3-MEDIUM","LO_ShipPriority":"0","LO_Quantity":"38","LO_ExtendedPrice":"7055726","LO_OrdTotalPrice":"14691804","LO_Discount":"8","LO_Revenue":"6491267","LO_SupplyCost":"111406","LO_Tax":"1","LO_CommitDateKey":"19960909","LO_ShipMode":"REG AIR","C_CustomerKey":"6748","C_Name":"Customer#000006748","C_Address":"RSPYBRlR7RX6 f7J8","C_City":"PERU 5","C_Nation":"PERU","C_Region":"AMERICA","C_Phone":"27-580-967-4556","C_MktSegment":"AUTOMOBILE","S_SuppKey":"1","S_Name":"Supplier#000000001","S_Address":"sdrGnXCDRcfriBvY0KL,i","S_City":"PERU 0","S_Nation":"PERU","S_Region":"AMERICA","S_Phone":"27-989-741-2988","D_DateKey":"19960808","D_Date":"Augest 8, 1996","D_DayOfWeek":"Friday","D_Month":"Augest","D_Year":"1996","D_YearMonthNum":"199608","D_YearMonth":"Aug1996","D_DayNumInWeek":"6","D_DayNumInMonth":"8","D_DayNumInYear":"221","D_MonthNumInYear":"8","D_WeekNumInYear":"32","D_SellingSeason":"Summer","D_LastDayInWeekFl":"0","D_LastDayInMonthFl":"1","D_HolidayFl":"0","D_WeekDayFl":"1","P_PartKey":"178778","P_Name":"turquoise powder","P_MFGR":"MFGR#1","P_Category":"MFGR#11","P_Brand":"MFGR#1125","P_Colour":"beige","P_Type":"STANDARD POLISHED NICKEL","P_Size":"25","P_Container":"JUMBO BAG"}
1 row in set (54.49 sec)
mysql> explain select * from json
                where json_extract(doc,'LO_OrderKey') = 35620
                  and json_extract(doc,'LO_LineNumber') = 2G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: json
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 5233236
        Extra: Using where
1 row in set (0.00 sec)

Using the index view to find a document

mysql> select id, doc from json_idx_v where LO_OrderKey = 35620 and LO_LineNumber = 2G
*************************** 1. row ***************************
id: 1
doc: {"LO_OrderKey":"35620","LO_LineNumber":"2","LO_CustKey":"6748","LO_PartKey":"178778","LO_SuppKey":"1","LO_OrderDateKey":"19960808","LO_OrderPriority":"3-MEDIUM","LO_ShipPriority":"0","LO_Quantity":"38","LO_ExtendedPrice":"7055726","LO_OrdTotalPrice":"14691804","LO_Discount":"8","LO_Revenue":"6491267","LO_SupplyCost":"111406","LO_Tax":"1","LO_CommitDateKey":"19960909","LO_ShipMode":"REG AIR","C_CustomerKey":"6748","C_Name":"Customer#000006748","C_Address":"RSPYBRlR7RX6 f7J8","C_City":"PERU 5","C_Nation":"PERU","C_Region":"AMERICA","C_Phone":"27-580-967-4556","C_MktSegment":"AUTOMOBILE","S_SuppKey":"1","S_Name":"Supplier#000000001","S_Address":"sdrGnXCDRcfriBvY0KL,i","S_City":"PERU 0","S_Nation":"PERU","S_Region":"AMERICA","S_Phone":"27-989-741-2988","D_DateKey":"19960808","D_Date":"Augest 8, 1996","D_DayOfWeek":"Friday","D_Month":"Augest","D_Year":"1996","D_YearMonthNum":"199608","D_YearMonth":"Aug1996","D_DayNumInWeek":"6","D_DayNumInMonth":"8","D_DayNumInYear":"221","D_MonthNumInYear":"8","D_WeekNumInYear":"32","D_SellingSeason":"Summer","D_LastDayInWeekFl":"0","D_LastDayInMonthFl":"1","D_HolidayFl":"0","D_WeekDayFl":"1","P_PartKey":"178778","P_Name":"turquoise powder","P_MFGR":"MFGR#1","P_Category":"MFGR#11","P_Brand":"MFGR#1125","P_Colour":"beige","P_Type":"STANDARD POLISHED NICKEL","P_Size":"25","P_Container":"JUMBO BAG"}
1 row in set (0.00 sec)
mysql> explain select id, doc from json_idx_v where LO_OrderKey = 35620 and LO_LineNumber = 2G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: json_idx
         type: ref
possible_keys: LO_LineNumber_Idx,LO_OrderKey_Idx
          key: LO_OrderKey_Idx
      key_len: 9
          ref: const
         rows: 4
        Extra: Using index condition; Using where
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: json
         type: eq_ref
possible_keys: PRIMARY
          key: PRIMARY
      key_len: 8
          ref: ssb_json.json_idx.id
         rows: 1
        Extra: NULL
2 rows in set (0.00 sec)

Keeping the index in sync

Flexviews materialized views need to be refreshed when the underlying table changes. Flexviews includes a REFRESH($mvid, $mode, $transaction_id) function.

I am going to remove one document from the table:

mysql> delete from json where id = 10000;
Query OK, 1 row affected (0.01 sec)

Note there is now one row in the materialized view change log. dml_type is -1 because it is a delete:

mysql> select * from flexviews.mvlog_f1673fac9814a93508a1c917566ecd4dG
*************************** 1. row ***************************
    dml_type: -1
      uow_id: 113
fv$server_id: 33
      fv$gsn: 1083
          id: 10000
         doc: {"LO_OrderKey":"3359521","LO_LineNumber":"2","LO_CustKey":"10306","LO_PartKey":"77997","LO_SuppKey":"4","LO_OrderDateKey":"19951010","LO_OrderPriority":"2-HIGH","LO_ShipPriority":"0","LO_Quantity":"43","LO_ExtendedPrice":"8492457","LO_OrdTotalPrice":"27032802","LO_Discount":"2","LO_Revenue":"8322607","LO_SupplyCost":"118499","LO_Tax":"4","LO_CommitDateKey":"19951228","LO_ShipMode":"FOB","C_CustomerKey":"10306","C_Name":"Customer#000010306","C_Address":"4UR9tz8","C_City":"ROMANIA  5","C_Nation":"ROMANIA","C_Region":"EUROPE","C_Phone":"29-361-986-3513","C_MktSegment":"BUILDING","S_SuppKey":"4","S_Name":"Supplier#000000004","S_Address":"qGTQJXogS83a7MB","S_City":"MOROCCO  4","S_Nation":"MOROCCO","S_Region":"AFRICA","S_Phone":"25-128-190-5944","D_DateKey":"19951010","D_Date":"Octorber 10, 1995","D_DayOfWeek":"Wednesday","D_Month":"Octorber","D_Year":"1995","D_YearMonthNum":"199510","D_YearMonth":"Oct1995","D_DayNumInWeek":"4","D_DayNumInMonth":"10","D_DayNumInYear":"283","D_MonthNumInYear":"10","D_WeekNumInYear":"41","D_SellingSeason":"Fall","D_LastDayInWeekFl":"0","D_LastDayInMonthFl":"1","D_HolidayFl":"0","D_WeekDayFl":"1","P_PartKey":"77997","P_Name":"burnished olive","P_MFGR":"MFGR#2","P_Category":"MFGR#24","P_Brand":"MFGR#2426","P_Colour":"orchid","P_Type":"MEDIUM PLATED TIN","P_Size":"16","P_Container":"WRAP PKG"}
1 row in set (0.01 sec)

Now we can verify the materialized view is out of date:

mysql> select * from json_idx where id = 10000;
+----------+------------+---------------+-----------+-----------+-------------+---------------+-------+------------+
| mview$pk | D_DateKey  | C_CustomerKey | S_SuppKey | P_PartKey | LO_OrderKey | LO_LineNumber | id    | mview$hash |
+----------+------------+---------------+-----------+-----------+-------------+---------------+-------+------------+
|    10000 | 1995-10-10 |         10306 |         4 |     77997 |     3359521 |             2 | 10000 | 2937185172 |
+----------+------------+---------------+-----------+-----------+-------------+---------------+-------+------------+
1 row in set (2.60 sec)

To bring the index up to date we must refresh it. Usually you will use the ‘BOTH’ mode to ‘COMPUTE’ and ‘APPLY’ the changes at the same time, but I am going to use COMPUTE mode to show you what ends up in the delta table:

mysql> select * from json_idx_deltaG
*************************** 1. row ***************************
     dml_type: -1
       uow_id: 113
       fv$gsn: 1083
    D_DateKey: 1995-10-10
C_CustomerKey: 10306
    S_SuppKey: 4
    P_PartKey: 77997
  LO_OrderKey: 3359521
LO_LineNumber: 2
           id: 10000
   mview$hash: 2937185172
1 row in set (0.00 sec)

Delta tables are similar to materialized view change log tables, except they contain insertions and deletions to the view contents. In this case, you can see dml_type is -1 and id = 10000, so the row from the view corresponding to the row we deleted, will be deleted when the change is applied.

Finally the change can be applied:

mysql> call flexviews.refresh(flexviews.get_id('ssb_json','json_idx'), 'APPLY',NULL);
Query OK, 2 rows affected (0.47 sec)
mysql> select * from json_idx where id = 10000; -- note, we didn't index id in the MV
Empty set (2.61 sec)

Finally, it makes sense to try to keep the index in sync as quickly as possible using a MySQL event:

DELIMITER ;;
CREATE EVENT IF NOT EXISTS flexviews.refresh_json_idx
ON SCHEDULE EVERY 1 SECOND
DO
BEGIN
  DECLARE v_got_lock tinyint default 0;
  SELECT GET_LOCK('JSON_IDX_LOCK', 0) INTO v_got_lock;
  IF v_got_lock = 1 THEN
    CALL flexviews.refresh(flexviews.get_id('ssb_json','json_idx'),'BOTH',NULL);
    SELECT RELEASE_LOCK('JSON_IDX_LOCK') INTO @discard;
  END IF;
END;;
DELIMITER ;

So there you have it. A way to index and quickly search through JSON documents and keep the index in sync automatically.