Иногда вы не видите леса за деревьями. Но если вы это сделаете, вы, вероятно, используете базу данных графа.
Деревья являются одной из простых структур данных графов, направленных ациклических графов (DAG).
Для нашего примера мы используем временное дерево, которое мы хотим импортировать в базу данных.
Объем данных
Небольшой скрипт-соулвер (спасибо Марку ) позже, мы знаем, сколько узлов и rel (узлов-1) нам нужно будет импортировать, чтобы представить полный год вплоть до второго уровня.
1 year = 12 months = 365 days = 8.760 hours = 525.600 minutes = 31.536.000 seconds
Таким образом, мы должны импортировать около 32 млн. Узлов и 32 млн. Связей. Звучит достаточно хорошо.
Эти числа также позволяют нам настроить отображение памяти, чтобы мы могли сопоставить его с ожидаемыми размерами файлов хранилища и получить плавный импорт.
Снова давайте спросим соулвер:
32M nodes x 14 bytes in MB = 448MB 32M rels x 38 bytes in MB = 1.216 MB
Для свойств я беру 500 МБ для хорошей меры. Так что наши neo4j.properties будут выглядеть так:
neostore.nodestore.db.mapped_memory=500M neostore.relationshipstore.db.mapped_memory=1200M neostore.propertystore.db.mapped_memory=500M
Запуск импорта с кучей 4G в Unix / Mac и кучей 6,5 ГБ в Windows (помните, что отображение памяти находится внутри кучи в Windows), либо на работающий сервер, просто вызвав bin / neo4j-shell, либо в автономном режиме с тестовой базой данных.
bin/neo4j-shell -path time.db -config conf/neo4j.properties
Корень, стебель и первые ветви
My friend and colleague Rik Van Bruggen started this experiment and created the following Cypher queries to import the data.
We import the data using the Neo4j-Shell, but you can also do it one statement at a time with the Neo4j-Browser.
The first part of the tree (8k nodes) can be imported quickly in one go, it takes about 5 seconds to parse and execute.
BEGIN //Let's create the top of the tree, the Century CREATE (century:Century {id:21}); //Let's create 100 children of the Century, Years MATCH (century:Century {id:21}) WITH range(2000,2099) as YEARS, century FOREACH (year in YEARS | CREATE (:Year {id:year})-[:PART_OF]->(century) ); //for every Year, let's connect 12 children at a 3rd level, the Months MATCH (year:Year {id:2014}) WITH range(1,12) as MONTHS, year FOREACH (month in MONTHS | CREATE (:Month {id:month})-[:PART_OF]->(year) ); //for every Month, connect 30 Days, and add one day for the longer months MATCH (month:Month) WITH range(1,30) as DAYS, month FOREACH (day in DAYS | CREATE (:Day {id:day})-[:PART_OF]->(month)) WITH month WHERE month.id in [1,3,5,7,8,10,12] CREATE (:Day {id:31})-[:PART_OF]->(month); // remove 2 days from februrary MATCH (day:Day)-[r:PART_OF]->(month:Month) WHERE month.id=2 and day.id in [29,30] DELETE r,day; //for every Day, connect 24 different Hours (0-23) MATCH (day:Day) WITH range(0,23) as HOURS, day FOREACH (hour in HOURS | CREATE (:Hour {id:hour})-[:PART_OF]->(day) ); COMMIT
First levels of the graph
For the 500k minute nodes our heap (4GB) is also large enough, so let’s try it:
//for every Hour, connect 60 minutes MATCH (hour:Hour) FOREACH (minute in range(0,59) | create (:Minute {id:minute})-[:PART_OF]->(hour));
Nodes created: 525600 Relationships created: 525600 Properties set: 525600 Labels added: 525600 14044 ms
The import takes 14 seconds. As we have 60 times as many seconds as minutes, we can estimate that the import for seconds would take 60 x 14s in minutes = 14 minutes.
Raking in the Foliage
Next up is the real challenge of this exercise. Importing 32M in the transactional graph takes a bit, so, you might grab a coffee or watch two episodes of our familys favorite show Tom and the strawberry jam bread with honey.
We’re using PERIODIC COMMIT here to interally commit every 50k updates, so that we don’t overflow our heap with huge transactions. We discuss its fate a bit further down.
//for every Minute, connect 60 seconds USING PERIODIC COMMIT 50000 WITH range(0,59) as SECONDS MATCH (minute:Minute) FOREACH (second in SECONDS | CREATE (:Second {id:second})-[:PART_OF]->(minute));
Nodes created: 31536000 Relationships created: 31536000 Properties set: 31536000 Labels added: 31536000 742248 ms
In reality it was a bit faster, and took only 740s = 12 minutes. That gives us 12 minutes per 32M nodes and 32M rels, which are about 45k pairs or about 90k elements per second. Not so bad for transactional inserts with Cypher.
Chunking Inserts
After Milestone 1 this gets a bit more tedious. Periodic commit is now tied to LOAD CSV. As we can’t use it anymore, we have to create individual statements, each of which should insert about 50k pairs. For 32M elements to inserts that means 640 statements. If we have enough heap, we can increase the size per tx to about 500k which leaves us with 64 statements to execute.
We could do that by inserting 6 days at a time. You have to run this and increment the day range by 6 for every run.
//for every Minute, connect 60 seconds WITH range(0,59) as SECONDS, range(1,6) as DAYS MATCH (:Month {id:1})<-[:PART_OF]-(d:Day)<-[:PART_OF]-(:Hour)<-[:PART_OF]-(minute:Minute) WHERE d.id IN DAYS FOREACH (second in SECONDS | CREATE (:Second {id:second})-[:PART_OF]->(minute));
Nodes created: 518400 Relationships created: 518400 Properties set: 518400 Labels added: 518400 16623 ms
Cheating, Handle with Care
Or we can cheat, and use PERIODIC COMMIT with LOAD CSV but without actually loading data from a CSV file containing a single row. That file can be something hosted in a GitHub Gist
or a single row file on your filesystem.
Important | Please do only use this if you know what you’re doing. PERIODIC COMMIT ignores query structure and just commits internally after N updated elements. Be it nodes, relationships, properties or labels. So the commit will very probably happen somewhere in the middle of your statement. If something fails in between, you will have little chance to recover from that state if you don’t use additional flags (labels or properties) to track your progress. |
//for every Minute, connect 60 seconds using periodic commit 50000 LOAD CSV FROM http://gist.github.com/jexp/xxx/one-row.csv AS dummy WITH range(0,59) as SECONDS MATCH (minute:Minute) FOREACH (second IN SECONDS | CREATE (:Second {id:second})-[:PART_OF]->(minute));
Queries
Ok, now we’ve imported this giant of a tree, what can we do with it?
Here are a few graph queries that navigate along the tree.
Drill down to hours:
MATCH (y:Year {id:2014})<-[:PART_OF*3]-(hr) RETURN y,count(*); +-------------------------------+ | y | count(*) | +-------------------------------+ | Node[128]{id:2014} | 8760 | +-------------------------------+ 1 row 63 ms
How many minutes does the odd months have?
MATCH (y:Year {id:2014})<-[:PART_OF]-(m)<-[:PART_OF]-(d)<-[:PART_OF]-(hr)<-[:PART_OF]-(min) WHERE m.id % 2 = 1 RETURN y.id,m.id,count(*) ORDER BY y.id, m.id; +------------------------+ | y.id | m.id | count(*) | +------------------------+ | 2014 | 1 | 44640 | | 2014 | 3 | 44640 | | 2014 | 5 | 44640 | | 2014 | 7 | 44640 | | 2014 | 9 | 43200 | | 2014 | 11 | 43200 | +------------------------+ 6 rows 1169 ms
Find the path to a certain day in the tree
MATCH (y:Year {id:2014})<-[:PART_OF]-(m {id:12})<-[:PART_OF]-(d {id:24}) RETURN y.id,m.id,d.id; +--------------------+ | y.id | m.id | d.id | +--------------------+ | 2014 | 12 | 24 | +--------------------+ 1 row 3 ms
Grow a branch at a time: On-demand creation
This blog post showed how to generate a huge tree in the graph. In many cases pre-generating is not needed though, as you can create the tree on the fly while you are inserting data that’s connected to the leaves of the tree.
The MERGE clause helps here a lot. Besides uniquely creating individual nodes and relationships between existing nodes it can also create unique subgraphs.
That means only the starting point of your subgraph is taken into consideration for lookup, the remainder of the path is created on demand.
For the whole time-tree it would look like this:
MERGE (y:Year {id:2014}) MERGE (y)<-[:PART_OF]-(m:Month {id:1}) MERGE (m)<-[:PART_OF]-(d:Day {id:13}) MERGE (d)<-[:PART_OF]-(h:Hour {id:18}) MERGE (h)<-[:PART_OF]-(min:Minute {id:35}) MERGE (min)<-[:PART_OF]-(s:Second {id:59}) RETURN y.id,m.id,d.id,h.id,min.id,s.id; +-----------------------------------------------------------+ | y.id | m.id | d.id | h.id | min.id | s.id | +-----------------------------------------------------------+ | 2014 | 1 | 13 | 18 | 35 | 59 | +-----------------------------------------------------------+ 1 row Nodes created: 6 Relationships created: 5 Properties set: 6 Labels added: 6 7 ms // executed a second time +-----------------------------------------------------------+ | y.id | m.id | d.id | h.id | min.id | s.id | +-----------------------------------------------------------+ | 2014 | 1 | 13 | 18 | 35 | 59 | +-----------------------------------------------------------+ 1 row 4 ms
This allows us to hande much more sparsely filled trees as they only contain the paths from the root to the leaves that are actually needed.
So we can handle even more levels, down to milli-, nano- or microseconds.