Первоначально Написал Джастин Суонхарт
MySQL встречает NoSQL с JSON UDF
Я недавно вернулся из FOSDEM , в Брюсселе, Бельгия. Пока я был там, я увидел отличную беседу Светы Смирновой о том, что ее MySQL 5.7 Labs выпускает функции JSON UDF. Важно отметить, что, хотя UDF входит в версию 5.7, абсолютно возможно скомпилировать и использовать UDF с более ранними версиями MySQL, потому что интерфейс UDF не менялся в течение длительного времени. Тем не менее, UDF все равно следует рассматривать как уровень качества альфа / предварительного просмотра и пока не должен использоваться в производстве! Для этого примера я использую Percona Server 5.6 с UDF.
При этом для подтверждения концепции, которую я собираюсь представить здесь, используется только одна функция JSON (JSON_EXTRACT), и она достаточно хорошо сработала в моем тестировании, чтобы представить мою идею здесь. Функции JSON, скорее всего, скоро станут GA, и это полезный тест функции JSON_EXTRACT.
UDF позволяет вам анализировать, искать и манипулировать данными JSON внутри MySQL, приближая MySQL к возможностям хранилища документов.
Поскольку я использую Percona Server 5.6, мне нужно было скомпилировать и установить UDF. Вот шаги, которые я предпринял, чтобы скомпилировать плагин:
- $ cd mysql-json-udfs-0.3.3-labs-json-udfs-src
- $ cmake -DMYSQL_DIR = / usr / local / mysql.
- $ sudo make install
- $ sudo cp * .so / usr / local / mysql / lib / plugin
JSON UDF великолепны, но в чем проблема
The JSON functions work very well for manipulating individual JSON objects, but like all other functions, using JSON_EXTRACT in the WHERE clause will result in a full table scan. This means the functions are virtually useless for searching through large volumes of JSON data. If you want to use MySQL as a document store, this is going to limit the usefulness in the extreme as the ability to extract key/value pairs from JSON documents is powerful, but without indexing it can’t scale well.
What can be done to index JSON in MySQL for efficient access?
The JSON UDF provides a JSON_EXTRACT function which can pull data out of a JSON document. There are two ways we can use this function to “index” the JSON data.
- Add extra columns to the table (or use a separate table, or tables) containing the JSON and populate the columns using JSON_EXTRACT in a trigger. The downside is that this slows down inserts and modifications of the documents significantly.
- Use Flexviews materialized views to maintain an index table separately and asynchronously. The upside is that insertion/modification speed is not affected, but there is slight delay before index is populated. This is similar to eventual consistency in a document store.
Writing triggers is an exercise I’ll leave up to the user. The rest of this post will discuss using Flexviews materialized views to create a JSON index.
What is Flexviews?
Flexviews can create ‘incrementally refreshable’ materialized views. This means that the views are able to be refreshed efficiently using changes captured by FlexCDC, the change data capture tool that ships with Flexviews. Since the view can be refreshed fast, it is possible to refresh it frequently and have a low latency index, but not one perfectly in sync with the base table at all times.
The materialized view is a real table that is indexed to provide fast access. Flexviews includes a SQL_API, or a set of stored procedures for defining and maintaining materialized views.
See this set of slides for an overview of Flexviews: http://www.slideshare.net/MySQLGeek/flexviews-materialized-views-for-my-sql
Demo/POC using materialized view as an index
Первым шагом к созданию постепенно обновляемого материализованного представления с помощью Flexviews является создание журнала изменений материализованного представления для всех таблиц, используемых в представлении. Функция CREATE_MVLOG ($ schema, $ table) создает журнал, и FlexCDC немедленно собирает изменения в нем.
mysql> call flexviews.create_mvlog('ssb_json','json'); Query OK, 1 row affected (0.01 sec)
Далее, материализованное имя представления и тип обновления должны быть зарегистрированы с помощью функции CREATE ($ schema, $ mvname, $ refreshtype):
mysql> call flexviews.create('ssb_json','json_idx','INCREMENTAL'); Query OK, 0 rows affected (0.00 sec) -- save the just generated identifier for the view. You can use GET_ID($schema,$mvname) later. mysql> set @mvid := last_insert_id(); Query OK, 0 rows affected (0.00 sec)
Now one or more tables have to be added to the view using the ADD_TABLE($mvid, $schema, $table, $alias,$joinclause) function. This example will use only one table, but Flexviews supports joins too.
mysql> call flexviews.add_table(@mvid, 'ssb_json','json','json',null); Query OK, 1 row affected (0.00 sec)
Expressions must be added to the view next. Since aggregation is not used in this example, the expressions should be ‘COLUMN’ type expressions. The function ADD_EXPR($mvid, $expression_type, $expression, $alias) is used to add expressions. Note that JSON_EXTRACT returns a TEXT column, so I’ve CAST the function to integer so that it can be indexed. Flexviews does not currently have a way to define prefix indexes.
mysql> call flexviews.add_expr(@mvid, 'COLUMN', "cast(json_extract(doc,'D_DateKey') as date)", 'D_DateKey'); Query OK, 1 row affected (0.00 sec) mysql> call flexviews.add_expr(@mvid, 'COLUMN',"cast(json_extract(doc,'C_CustomerKey') as unsigned)", 'C_CustomerKey'); Query OK, 1 row affected (0.00 sec) mysql> call flexviews.add_expr(@mvid, 'COLUMN',"cast(json_extract(doc,'S_SuppKey') as unsigned)", 'S_SuppKey'); Query OK, 1 row affected (0.01 sec) mysql> call flexviews.add_expr(@mvid, 'COLUMN',"cast(json_extract(doc,'P_PartKey') as unsigned)", 'P_PartKey'); Query OK, 1 row affected (0.00 sec) mysql> call flexviews.add_expr(@mvid, 'COLUMN',"cast(json_extract(doc,'LO_OrderKey') as unsigned)", 'LO_OrderKey'); Query OK, 1 row affected (0.00 sec) mysql> call flexviews.add_expr(@mvid, 'COLUMN',"cast(json_extract(doc,'LO_LineNumber') as unsigned)", 'LO_LineNumber'); Query OK, 1 row affected (0.00 sec)
I’ve also projected out the ‘id’ column from the table, which is the primary key. This ties the index entries to the original row, so that the original document can be retrieved.
mysql> call flexviews.add_expr(@mvid, 'COLUMN', 'id', 'id'); Query OK, 1 row affected (0.00 sec)
Since we want to use the materialized view as an index, we need to index the columns we’ve added to it.
mysql> call flexviews.add_expr(@mvid, 'KEY',"LO_LineNumber", 'LO_LineNumber_Idx'); Query OK, 1 row affected (0.00 sec) mysql> call flexviews.add_expr(@mvid, 'KEY',"LO_OrderKey", 'LO_OrderKey_Idx'); Query OK, 1 row affected (0.00 sec) mysql> call flexviews.add_expr(@mvid, 'KEY',"P_PartKey", 'P_PartKey_Idx'); Query OK, 1 row affected (0.00 sec) mysql> call flexviews.add_expr(@mvid, 'KEY',"S_SuppKey", 'S_SuppKey_Idx'); Query OK, 1 row affected (0.00 sec) mysql> call flexviews.add_expr(@mvid, 'KEY',"D_DateKey", 'D_DateKey_Idx'); Query OK, 1 row affected (0.00 sec) mysql> call flexviews.add_expr(@mvid, 'KEY',"C_CustomerKey", 'C_CustomerKey_Idx'); Query OK, 1 row affected (0.00 sec)
Finally, the view has to be created. There are 6 million rows in my table, the JSON functions are UDF so they are not as fast as built in functions, and I indexed a lot of things (six different indexes are being populated at once) so it takes some time to build the index:
mysql> call flexviews.enable(@mvid); Query OK, 2 rows affected (35 min 53.17 sec)
After the materialized view is built, you can see it in the schema. Note there is also a delta table, which I will explain a bit later.
mysql> show tables; +--------------------+ | Tables_in_ssb_json | +--------------------+ | json | | json_idx | | json_idx_delta | +--------------------+ 3 rows in set (0.00 sec)
Here is the table definition of json_idx, our materialized view. You can see it is indexed:
CREATE TABLE `json_idx` ( `mview$pk` bigint(20) NOT NULL AUTO_INCREMENT, `D_DateKey` date DEFAULT NULL, `C_CustomerKey` bigint(21) unsigned DEFAULT NULL, `S_SuppKey` bigint(21) unsigned DEFAULT NULL, `P_PartKey` bigint(21) unsigned DEFAULT NULL, `LO_OrderKey` bigint(21) unsigned DEFAULT NULL, `LO_LineNumber` bigint(21) unsigned DEFAULT NULL, `id` bigint(20) NOT NULL DEFAULT '0', `mview$hash` int(10) unsigned DEFAULT NULL, PRIMARY KEY (`mview$pk`), KEY `LO_LineNumber_Idx` (`LO_LineNumber`), KEY `LO_OrderKey_Idx` (`LO_OrderKey`), KEY `P_PartKey_Idx` (`P_PartKey`), KEY `S_SuppKey_Idx` (`S_SuppKey`), KEY `D_DateKey_Idx` (`D_DateKey`), KEY `C_CustomerKey_Idx` (`C_CustomerKey`), KEY `mview$hash_key` (`mview$hash`) ) ENGINE=InnoDB AUTO_INCREMENT=6029221 DEFAULT CHARSET=latin1;
Here are some sample contents. You can see the integer values extracted out of the JSON:
mysql> select * from json_idx limit 10; +----------+------------+---------------+-----------+-----------+-------------+---------------+----+------------+ | mview$pk | D_DateKey | C_CustomerKey | S_SuppKey | P_PartKey | LO_OrderKey | LO_LineNumber | id | mview$hash | +----------+------------+---------------+-----------+-----------+-------------+---------------+----+------------+ | 1 | 1996-08-08 | 6748 | 1 | 178778 | 35620 | 2 | 1 | 3151656687 | | 2 | 1994-05-20 | 5272 | 1 | 52362 | 102790 | 4 | 2 | 2181615425 | | 3 | 1995-05-04 | 22870 | 1 | 136407 | 146757 | 3 | 3 | 544130577 | | 4 | 1996-06-16 | 12217 | 1 | 129103 | 151200 | 1 | 4 | 2969697839 | | 5 | 1992-07-20 | 21757 | 1 | 35243 | 151745 | 1 | 5 | 1438921571 | | 6 | 1997-08-16 | 18760 | 1 | 150766 | 159232 | 6 | 6 | 3941775529 | | 7 | 1994-03-04 | 757 | 1 | 15750 | 188902 | 3 | 7 | 2142628088 | | 8 | 1993-11-04 | 17830 | 1 | 192023 | 253828 | 5 | 8 | 3480147565 | | 9 | 1993-07-12 | 16933 | 1 | 59997 | 269062 | 5 | 9 | 3572286145 | | 10 | 1998-06-16 | 26660 | 1 | 30023 | 298272 | 3 | 10 | 1971966244 | +----------+------------+---------------+-----------+-----------+-------------+---------------+----+------------+ 10 rows in set (0.00 sec)
Now, there needs to be an easy way to use this index in a select statement. Since a JOIN is needed between the materialized view and the base table, a regular VIEW makes sense to access the data. We’ll call this the index view:
mysql> create view json_idx_v as select * from json natural join json_idx; Query OK, 0 rows affected (0.00 sec)
And just for completeness, here is the contents of a row from our new index view:
mysql> select * from json_idx_v limit 1G *************************** 1. row *************************** id: 1 doc: {"LO_OrderKey":"35620","LO_LineNumber":"2","LO_CustKey":"6748","LO_PartKey":"178778","LO_SuppKey":"1","LO_OrderDateKey":"19960808","LO_OrderPriority":"3-MEDIUM","LO_ShipPriority":"0","LO_Quantity":"38","LO_ExtendedPrice":"7055726","LO_OrdTotalPrice":"14691804","LO_Discount":"8","LO_Revenue":"6491267","LO_SupplyCost":"111406","LO_Tax":"1","LO_CommitDateKey":"19960909","LO_ShipMode":"REG AIR","C_CustomerKey":"6748","C_Name":"Customer#000006748","C_Address":"RSPYBRlR7RX6 f7J8","C_City":"PERU 5","C_Nation":"PERU","C_Region":"AMERICA","C_Phone":"27-580-967-4556","C_MktSegment":"AUTOMOBILE","S_SuppKey":"1","S_Name":"Supplier#000000001","S_Address":"sdrGnXCDRcfriBvY0KL,i","S_City":"PERU 0","S_Nation":"PERU","S_Region":"AMERICA","S_Phone":"27-989-741-2988","D_DateKey":"19960808","D_Date":"Augest 8, 1996","D_DayOfWeek":"Friday","D_Month":"Augest","D_Year":"1996","D_YearMonthNum":"199608","D_YearMonth":"Aug1996","D_DayNumInWeek":"6","D_DayNumInMonth":"8","D_DayNumInYear":"221","D_MonthNumInYear":"8","D_WeekNumInYear":"32","D_SellingSeason":"Summer","D_LastDayInWeekFl":"0","D_LastDayInMonthFl":"1","D_HolidayFl":"0","D_WeekDayFl":"1","P_PartKey":"178778","P_Name":"turquoise powder","P_MFGR":"MFGR#1","P_Category":"MFGR#11","P_Brand":"MFGR#1125","P_Colour":"beige","P_Type":"STANDARD POLISHED NICKEL","P_Size":"25","P_Container":"JUMBO BAG"} mview$pk: 1 D_DateKey: 1996-08-08 C_CustomerKey: 6748 S_SuppKey: 1 P_PartKey: 178778 LO_OrderKey: 35620 LO_LineNumber: 2 mview$hash: 3151656687 1 row in set (0.00 sec)
Using the UDF to find a document
The UDF does a full table scan, parsing all six million documents (TWICE!) as it goes along. Unsurprisingly, this is slow:
mysql> select * from json where json_extract(doc,'LO_OrderKey') = 35620 and json_extract(doc,'LO_LineNumber') = 2G *************************** 1. row *************************** id: 1 doc: {"LO_OrderKey":"35620","LO_LineNumber":"2","LO_CustKey":"6748","LO_PartKey":"178778","LO_SuppKey":"1","LO_OrderDateKey":"19960808","LO_OrderPriority":"3-MEDIUM","LO_ShipPriority":"0","LO_Quantity":"38","LO_ExtendedPrice":"7055726","LO_OrdTotalPrice":"14691804","LO_Discount":"8","LO_Revenue":"6491267","LO_SupplyCost":"111406","LO_Tax":"1","LO_CommitDateKey":"19960909","LO_ShipMode":"REG AIR","C_CustomerKey":"6748","C_Name":"Customer#000006748","C_Address":"RSPYBRlR7RX6 f7J8","C_City":"PERU 5","C_Nation":"PERU","C_Region":"AMERICA","C_Phone":"27-580-967-4556","C_MktSegment":"AUTOMOBILE","S_SuppKey":"1","S_Name":"Supplier#000000001","S_Address":"sdrGnXCDRcfriBvY0KL,i","S_City":"PERU 0","S_Nation":"PERU","S_Region":"AMERICA","S_Phone":"27-989-741-2988","D_DateKey":"19960808","D_Date":"Augest 8, 1996","D_DayOfWeek":"Friday","D_Month":"Augest","D_Year":"1996","D_YearMonthNum":"199608","D_YearMonth":"Aug1996","D_DayNumInWeek":"6","D_DayNumInMonth":"8","D_DayNumInYear":"221","D_MonthNumInYear":"8","D_WeekNumInYear":"32","D_SellingSeason":"Summer","D_LastDayInWeekFl":"0","D_LastDayInMonthFl":"1","D_HolidayFl":"0","D_WeekDayFl":"1","P_PartKey":"178778","P_Name":"turquoise powder","P_MFGR":"MFGR#1","P_Category":"MFGR#11","P_Brand":"MFGR#1125","P_Colour":"beige","P_Type":"STANDARD POLISHED NICKEL","P_Size":"25","P_Container":"JUMBO BAG"} 1 row in set (54.49 sec) mysql> explain select * from json where json_extract(doc,'LO_OrderKey') = 35620 and json_extract(doc,'LO_LineNumber') = 2G *************************** 1. row *************************** id: 1 select_type: SIMPLE table: json type: ALL possible_keys: NULL key: NULL key_len: NULL ref: NULL rows: 5233236 Extra: Using where 1 row in set (0.00 sec)
Using the index view to find a document
mysql> select id, doc from json_idx_v where LO_OrderKey = 35620 and LO_LineNumber = 2G *************************** 1. row *************************** id: 1 doc: {"LO_OrderKey":"35620","LO_LineNumber":"2","LO_CustKey":"6748","LO_PartKey":"178778","LO_SuppKey":"1","LO_OrderDateKey":"19960808","LO_OrderPriority":"3-MEDIUM","LO_ShipPriority":"0","LO_Quantity":"38","LO_ExtendedPrice":"7055726","LO_OrdTotalPrice":"14691804","LO_Discount":"8","LO_Revenue":"6491267","LO_SupplyCost":"111406","LO_Tax":"1","LO_CommitDateKey":"19960909","LO_ShipMode":"REG AIR","C_CustomerKey":"6748","C_Name":"Customer#000006748","C_Address":"RSPYBRlR7RX6 f7J8","C_City":"PERU 5","C_Nation":"PERU","C_Region":"AMERICA","C_Phone":"27-580-967-4556","C_MktSegment":"AUTOMOBILE","S_SuppKey":"1","S_Name":"Supplier#000000001","S_Address":"sdrGnXCDRcfriBvY0KL,i","S_City":"PERU 0","S_Nation":"PERU","S_Region":"AMERICA","S_Phone":"27-989-741-2988","D_DateKey":"19960808","D_Date":"Augest 8, 1996","D_DayOfWeek":"Friday","D_Month":"Augest","D_Year":"1996","D_YearMonthNum":"199608","D_YearMonth":"Aug1996","D_DayNumInWeek":"6","D_DayNumInMonth":"8","D_DayNumInYear":"221","D_MonthNumInYear":"8","D_WeekNumInYear":"32","D_SellingSeason":"Summer","D_LastDayInWeekFl":"0","D_LastDayInMonthFl":"1","D_HolidayFl":"0","D_WeekDayFl":"1","P_PartKey":"178778","P_Name":"turquoise powder","P_MFGR":"MFGR#1","P_Category":"MFGR#11","P_Brand":"MFGR#1125","P_Colour":"beige","P_Type":"STANDARD POLISHED NICKEL","P_Size":"25","P_Container":"JUMBO BAG"} 1 row in set (0.00 sec) mysql> explain select id, doc from json_idx_v where LO_OrderKey = 35620 and LO_LineNumber = 2G *************************** 1. row *************************** id: 1 select_type: SIMPLE table: json_idx type: ref possible_keys: LO_LineNumber_Idx,LO_OrderKey_Idx key: LO_OrderKey_Idx key_len: 9 ref: const rows: 4 Extra: Using index condition; Using where *************************** 2. row *************************** id: 1 select_type: SIMPLE table: json type: eq_ref possible_keys: PRIMARY key: PRIMARY key_len: 8 ref: ssb_json.json_idx.id rows: 1 Extra: NULL 2 rows in set (0.00 sec)
Keeping the index in sync
Flexviews materialized views need to be refreshed when the underlying table changes. Flexviews includes a REFRESH($mvid, $mode, $transaction_id) function.
I am going to remove one document from the table:
mysql> delete from json where id = 10000; Query OK, 1 row affected (0.01 sec)
Note there is now one row in the materialized view change log. dml_type is -1 because it is a delete:
mysql> select * from flexviews.mvlog_f1673fac9814a93508a1c917566ecd4dG *************************** 1. row *************************** dml_type: -1 uow_id: 113 fv$server_id: 33 fv$gsn: 1083 id: 10000 doc: {"LO_OrderKey":"3359521","LO_LineNumber":"2","LO_CustKey":"10306","LO_PartKey":"77997","LO_SuppKey":"4","LO_OrderDateKey":"19951010","LO_OrderPriority":"2-HIGH","LO_ShipPriority":"0","LO_Quantity":"43","LO_ExtendedPrice":"8492457","LO_OrdTotalPrice":"27032802","LO_Discount":"2","LO_Revenue":"8322607","LO_SupplyCost":"118499","LO_Tax":"4","LO_CommitDateKey":"19951228","LO_ShipMode":"FOB","C_CustomerKey":"10306","C_Name":"Customer#000010306","C_Address":"4UR9tz8","C_City":"ROMANIA 5","C_Nation":"ROMANIA","C_Region":"EUROPE","C_Phone":"29-361-986-3513","C_MktSegment":"BUILDING","S_SuppKey":"4","S_Name":"Supplier#000000004","S_Address":"qGTQJXogS83a7MB","S_City":"MOROCCO 4","S_Nation":"MOROCCO","S_Region":"AFRICA","S_Phone":"25-128-190-5944","D_DateKey":"19951010","D_Date":"Octorber 10, 1995","D_DayOfWeek":"Wednesday","D_Month":"Octorber","D_Year":"1995","D_YearMonthNum":"199510","D_YearMonth":"Oct1995","D_DayNumInWeek":"4","D_DayNumInMonth":"10","D_DayNumInYear":"283","D_MonthNumInYear":"10","D_WeekNumInYear":"41","D_SellingSeason":"Fall","D_LastDayInWeekFl":"0","D_LastDayInMonthFl":"1","D_HolidayFl":"0","D_WeekDayFl":"1","P_PartKey":"77997","P_Name":"burnished olive","P_MFGR":"MFGR#2","P_Category":"MFGR#24","P_Brand":"MFGR#2426","P_Colour":"orchid","P_Type":"MEDIUM PLATED TIN","P_Size":"16","P_Container":"WRAP PKG"} 1 row in set (0.01 sec)
Now we can verify the materialized view is out of date:
mysql> select * from json_idx where id = 10000; +----------+------------+---------------+-----------+-----------+-------------+---------------+-------+------------+ | mview$pk | D_DateKey | C_CustomerKey | S_SuppKey | P_PartKey | LO_OrderKey | LO_LineNumber | id | mview$hash | +----------+------------+---------------+-----------+-----------+-------------+---------------+-------+------------+ | 10000 | 1995-10-10 | 10306 | 4 | 77997 | 3359521 | 2 | 10000 | 2937185172 | +----------+------------+---------------+-----------+-----------+-------------+---------------+-------+------------+ 1 row in set (2.60 sec)
To bring the index up to date we must refresh it. Usually you will use the ‘BOTH’ mode to ‘COMPUTE’ and ‘APPLY’ the changes at the same time, but I am going to use COMPUTE mode to show you what ends up in the delta table:
mysql> select * from json_idx_deltaG *************************** 1. row *************************** dml_type: -1 uow_id: 113 fv$gsn: 1083 D_DateKey: 1995-10-10 C_CustomerKey: 10306 S_SuppKey: 4 P_PartKey: 77997 LO_OrderKey: 3359521 LO_LineNumber: 2 id: 10000 mview$hash: 2937185172 1 row in set (0.00 sec)
Delta tables are similar to materialized view change log tables, except they contain insertions and deletions to the view contents. In this case, you can see dml_type is -1 and id = 10000, so the row from the view corresponding to the row we deleted, will be deleted when the change is applied.
Finally the change can be applied:
mysql> call flexviews.refresh(flexviews.get_id('ssb_json','json_idx'), 'APPLY',NULL); Query OK, 2 rows affected (0.47 sec) mysql> select * from json_idx where id = 10000; -- note, we didn't index id in the MV Empty set (2.61 sec)
Finally, it makes sense to try to keep the index in sync as quickly as possible using a MySQL event:
DELIMITER ;; CREATE EVENT IF NOT EXISTS flexviews.refresh_json_idx ON SCHEDULE EVERY 1 SECOND DO BEGIN DECLARE v_got_lock tinyint default 0; SELECT GET_LOCK('JSON_IDX_LOCK', 0) INTO v_got_lock; IF v_got_lock = 1 THEN CALL flexviews.refresh(flexviews.get_id('ssb_json','json_idx'),'BOTH',NULL); SELECT RELEASE_LOCK('JSON_IDX_LOCK') INTO @discard; END IF; END;; DELIMITER ;
So there you have it. A way to index and quickly search through JSON documents and keep the index in sync automatically.