Статьи

Загрузите CSV в Neo4j быстро и успешно

Начиная с версии 2.1 Neo4j обеспечивает готовую поддержку приема CSV. Команда LOAD CSV, добавленная в язык Cypher Query, является универсальным и мощным инструментом ETL.

Он позволяет вам вводить данные CSV с любого URL-адреса в дружественный поток параметров для вашей простой или сложной операции обновления графика, то есть… преобразования.

Но выслушай мои советы, прежде чем сразу приступить к их использованию. Есть несколько настроек и аспектов конфигурации, которые вы должны знать, чтобы быть успешными при первом запуске.

Объем данных: LOAD CSV был создан для поддержки около 1M строк на импорт, он все еще работает с 10M строками, но вам придется немного подождать, на 100M он попробует ваше терпение.

За исключением крошечных наборов данных, никогда не запускайте его без сохранения периодических фиксаций, что предотвращает переполнение крупными транзакциями доступной памяти базы данных (JVM Heap).

CSV, используемый в этом примере, довольно прост, но достаточен для того, чтобы показать некоторые проблемы и подчеркнуть, что на них работают люди и компании.

PersonName,"Company Name",year
"Kenny Bastani","Neo Technology",2013
"Michael Hunger","Neo Technology",2010
"James Ward","Heroku",2011
"Someone",,
"John","Doe.com","ninetynine"

ОС, ОЗУ и ДИСК

Сначала убедитесь, что машина, на которую вы хотите импортировать, пригодна для импорта данных. Это означает, что у вас должно быть немного доступной оперативной памяти, если ваша 8-гигабайтная машина имеет только 1 ГБ свободной оперативной памяти и уже использует подкачку 3 ГБ, это верный признак, который вы не должны пытаться использовать. Чем больше оперативной памяти, тем лучше, я бы рекомендовал зарезервировать как минимум 4 ГБ для импорта ваших данных.

Если вы работаете на SSD, вам повезло, на вращающихся дисках это определенно займет больше времени.

Запись Если вы используете Windows, будьте особенно внимательны. В Windows память базы данных (JVM Heap) используется для ОБА объектов и кешей Neo4j И для отображения памяти для файлов хранения. В Unix и OSX отображение памяти находится снаружи. Поэтому убедитесь, что в Windows достаточно памяти Java, выделенной для удовлетворения обоих аспектов.

Конфигурация памяти

Для вашей конфигурации памяти разумной отправной точкой является предоставление Neo4j 3-4 ГБ ОЗУ в виде кучи Java (включая размер отображения памяти в Windows) и разумного объема (например, 1 ГБ +) в качестве отображения памяти.

Чем больше у вас памяти, тем быстрее она будет импортировать ваши данные.

Обязательно отредактируйте conf / neo4j-wrapper.conf и установите:

# Uncomment these lines to set specific initial and maximum
# heap size in MB.
wrapper.java.initmemory=4096
wrapper.java.maxmemory=4096

В conf / neo4j.properties установлено:

# Default values for the low-level graph engine
neostore.nodestore.db.mapped_memory=50M
neostore.relationshipstore.db.mapped_memory=500M
neostore.propertystore.db.mapped_memory=100M
neostore.propertystore.db.strings.mapped_memory=100M
neostore.propertystore.db.arrays.mapped_memory=0M

Загрузка данных

If you load your CSV file over the network make sure it is fast enough to sustain the ingestion rate you’d like to have. Otherwise if possible download it, and use a file:// URL.

Note File URLs are tricky. On OSX and Unix use file:///path/to/data.csv, on Windows, please use file:c:/path/to/data.csv. Beware spaces in file-names and relative paths. Http-URLs are much easier. LOAD CSV afaik supports cookies, redirects and https.

Data Quality

There are some challenges when loading CSV data from “organic” files.

  1. You might not load the right file. On a local filesystem Cypher will complain if it can’t find it, but will happily ingest your latest movie-trilogy while trying to find the first comma. For remote URLs it will fail if the server returns a 404, but otherwise try to load HTML error pages, which might be fun.

  2. If you have the right file, make sure that it loads correctly. I.e. all the columsn and lines are separated by delimeters to your liking. You can easily try that by sampling the first few lines and see that they are returned correctly.

Initial checking for the first 5 lines

Does everything look ok, anything that looks dubious? Fix it in your input data!

LOAD CSV FROM "https://gist.githubusercontent.com/jexp/d788e117129c3730a042/raw/a147631215456d86a77edc7ec08c128b9ef05e3b/people_leading_empty.csv"
  AS line
WITH line LIMIT 4
RETURN line
+--------------------------------------------+
| line                                       |
+--------------------------------------------+
| [""]                                       |
| ["PersonName","Company Name","year"]       |
| ["Kenny Bastani","Neo Technology","2013"]  |
| ["Michael Hunger","Neo Technology","2010"] |
+--------------------------------------------+

Caveats are: Leading blank line, wrong or double line breaks.

Check for correct columns

LOAD CSV WITH HEADERS FROM "https://gist.githubusercontent.com/jexp/d788e117129c3730a042/raw/a147631215456d86a77edc7ec08c128b9ef05e3b/people_leading_empty.csv"
  AS line
WITH line LIMIT 5
RETURN line.PersonName, line.Company, line.company, line.`Company Name`
+---------------------------------------------------------------------+
| line.PersonName | line.Company | line.company | line.`Company Name` |
+---------------------------------------------------------------------+
| <null>          | <null>       | <null>       | <null>              |
| <null>          | <null>       | <null>       | <null>              |
| <null>          | <null>       | <null>       | <null>              |
| <null>          | <null>       | <null>       | <null>              |
| <null>          | <null>       | <null>       | <null>              |
+---------------------------------------------------------------------+

Leading empty line will break WITH HEADERS.

LOAD CSV WITH HEADERS FROM "https://gist.githubusercontent.com/jexp/d788e117129c3730a042/raw/c5f98f269b4a219ebb88bd3eb83d331f439a8201/people.csv"
  AS line
WITH line LIMIT 5
RETURN line.PersonName, line.Company, line.company, line.`Company Name`
+--------------------------------------------------------------------------+
| line.PersonName      | line.Company | line.company | line.`Company Name` |
+--------------------------------------------------------------------------+
| "Kenny Bastani"      | <null>       | <null>       | "Neo Technology"    |
| "Michael Hunger"     | <null>       | <null>       | "Neo Technology"    |
| "James Ward"         | <null>       | <null>       | "Heroku"            |
| "Matthew McCullough" | <null>       | <null>       | "GitHub"            |
| "Someone"            | <null>       | <null>       | ""                  |
+--------------------------------------------------------------------------+

This is better, still some misspelled columns.

Note Misspelled column name (it’s case sensitive), empty columns which are treated as null

You can skip null values by adding a

WHERE line.value IS NOT NULL after the LOAD CSV .. WITH line

If you don’t, you might end up with errors in Neo4j’s indexing subsystem trying to index null values.

If you want to you can alternatively provide defaults with

coalesce(value_or_null,value2_or_null,…, default-value).

Filter out NULL values

LOAD CSV WITH HEADERS FROM "https://gist.githubusercontent.com/jexp/d788e117129c3730a042/raw/c5f98f269b4a219ebb88bd3eb83d331f439a8201/people.csv"
  AS line
WITH line

WHERE line.PersonName IS NOT NULL

RETURN line.PersonName, coalesce(line.Company,"None")
+------------------------------------------------------+
| line.PersonName      | coalesce(line.Company,"None") |
+------------------------------------------------------+
| "Kenny Bastani"      | "None"                        |
| "Michael Hunger"     | "None"                        |
| "James Ward"         | "None"                        |
| "Matthew McCullough" | "None"                        |
| "Someone"            | "None"                        |
| "John"               | "None"                        |
+------------------------------------------------------+

Data Conversion

Cypher reads all CSV columns as Strings by default. You can use conversion functions like toInt(str), toFloat(str), or boolean expressions, split(str, » «) and substring(str,start,count) (e.g. for extracting day, month, year from a date-string). Note that the conversion functions return null if they can’t convert. So use the null handling from above to handle unclean data.

Note: Make sure to use the same conversion for the same value in all the places, both when creating nodes and relationships and also when using the CSV data to look them up again.

A helpful tip is to do the conversion upfront with WITH toInt(line.number) as number

LOAD CSV WITH HEADERS FROM "https://gist.githubusercontent.com/jexp/d788e117129c3730a042/raw/c5f98f269b4a219ebb88bd3eb83d331f439a8201/people.csv"
  AS line

WITH line, toInt(line.year) as year
RETURN line.PersonName, year
+-------------------------------+
| line.PersonName      | year   |
+-------------------------------+
| "Kenny Bastani"      | 2013   |
| "Michael Hunger"     | 2010   |
| "James Ward"         | 2011   |
| "Matthew McCullough" | 2010   |
| "Someone"            | <null> |
| "John"               | <null> |
+-------------------------------+

Field Separation

You can specify alternative field separators for your data. For a tab-separated file (.tsv) you can use … AS line FIELDTERMINATOR ‘\t\’ for semicolons +… AS line FIELDTERMINATOR ‘;’

Batch Your Transactions

This is really important

If you import more than 100k elements in a single transactions, it is very likely (depending on your available memory), that you’ll fail. And it might not be a quick death of your operation. It can drag on for quite a while desparately trying to recover memory to store its intermediate transaction data.

So make sure, to ALWAYS prefix your LOAD CSV with USING PERIODIC COMMIT 1000. The number given is the number of import rows after which a commit of the imported data happens. Depending on the complexity of your import operation, you might create from 100 elements per 1000 rows (if you have a lot of duplicates) up to 100,000 when you have complex operations that generate up to 100 nodes and relationships per row of input.

That’s why a commit size of 1000 might be a safe bet (or not).

There is also an issue within Neo4j’s index operations that makes it work better with smaller commit-sizes.

If you use LOAD CSV without any create or update operation, you cannot use PERIODIC COMMIT. If you use it from within an embedded Java-Application, make sure to _not start a manual transaction beforehand.

Batch Transactions after every 1000 Rows

USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM "https://gist.githubusercontent.com/jexp/d788e117129c3730a042/raw/c5f98f269b4a219ebb88bd3eb83d331f439a8201/people.csv"
  AS line
CREATE (:Person {name:line.PersonName})
Nodes created: 6
Properties set: 6

Indexing

Usually you import your nodes and relationships in one step, creating connections as you process each line. As most CSV files are representing denormalized tables, you’ll get a lot of duplication in them, especially for the joined entity tables (which will probably converted to nodes). So you want to use either MERGE or MATCH to lookup nodes in your graph database to connect or update them in a later step.

Make sure you created the neccessary indexes and constraints upfront, so that both operations can utilize them and lookup values really quickly.

Creating Indexes and Constraints Upfront

This example shows a mult-step import where people are created upfront from one source, and then later on only looked up to connect them to merged Companies.

CREATE CONSTRAINT ON (c:Company) ASSERT c.name IS UNIQUE;
CREATE INDEX ON :Person(name);

USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM "https://gist.githubusercontent.com/jexp/d788e117129c3730a042/raw/c5f98f269b4a219ebb88bd3eb83d331f439a8201/people.csv"
  AS line
CREATE (p:Person {name:line.PersonName});

USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM "https://gist.githubusercontent.com/jexp/d788e117129c3730a042/raw/c5f98f269b4a219ebb88bd3eb83d331f439a8201/people.csv"
  AS line
MATCH (p:Person {name:line.PersonName})
MERGE (c:Company {name:line.`Company Name`});
CREATE (p)-[:WORKS_FOR]->(c)

Test First

Use the same sampling approach as before, but now only take the first 10k or 100k lines and import them. Try importing into a throwaway test database with the neo4j-shell (see below).

If that goes well, remove the added data or clean the database (by deleting the db-directory).

Use the Neo4j-Shell for larger Imports

The Neo4j Browser is great for quick query and playing around with your import data, but if you really want to import millions of rows, go with the Neo4j shell.

If you downloaded the zip distribution of Neo4j, the shell can be found in bin/neo4j-shell (bin\Neo4jShell.bat on Windows). By default it connects to a running server but you can also specify a dedicated database directory with the -path people.db parameter. With -config conf/neo4j.properties you’d use the same config as the Neo4j server for that db-directory.

For importing lots of data you should probably edit the shell script and edit this line EXTRA_JVM_ARGUMENTS=»-Xmx4G -Xms4G -Xmn1G» to provide sensible Java heap settings.

You can add your import statements (including index creation) to a Cypher script and execute it with -file import.cql. Don’t forget the semicolons at the end of your statements.

If you run into errors, you might try export STACKTRACES=true and re-run the command to have a good error message to share when asking on StackOverflow.

The Import Query in all its Beauty

Clean out the database for my final import.

MATCH (n)
WITH n LIMIT 10000
OPTIONAL MATCH (n)-[r]->()
DELETE n,r
CREATE CONSTRAINT ON (c:Company) ASSERT c.name IS UNIQUE;
CREATE INDEX ON :Person(name);

USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM "https://gist.githubusercontent.com/jexp/d788e117129c3730a042/raw/1bd8c19bf8b49d9eb7149918cc11a34faf996dd8/people.tsv"
  AS line FIELDTERMINATOR '\t'

WITH line, toInt(line.year) as year, coalesce(line.`Company Name`,"None") as company

WHERE year IS NOT NULL

MERGE (p:Person {name:line.PersonName})
MERGE (c:Company {name:company})
CREATE (p)-[r:WORKS_FOR {since:year}]->(c)
RETURN p,r,c

Ready to go?

That’s it. If you hit any issue, I haven’t covered, here please don’t hesitate to reach out to me, either by commenting below or dropping me an emal to michael at neo4j.org

Real World Example

I just went to http://www.mockaroo.com to generate a CSV sample. I did a lastname and country tuple and generated 1M lines of them (18MB data), put them in my public dropbox folder.

The data looks like this.

last_name,country
Fuller,Saint Vincent and the Grenadines
Lynch,Israel
Crawford,Iceland
Fowler,Belgium
... 1M lines ...
Walker,Mali
Wilson,Turkey
Hart,Saint Vincent and the Grenadines
Fowler,Nigeria

Then I checked the CSV as outlined above, and used bin/neo4j-shell -path people.db -config conf/neo4j.properties -file import-names.cql

CREATE CONSTRAINT ON (c:Country) ASSERT c.name IS UNIQUE;
CREATE INDEX ON :Person(name);

USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM "https://dl.dropboxusercontent.com/u/14493611/name_country.csv"
  AS line

WITH line
WHERE line.last_name IS NOT NULL and line.country IS NOT NULL

MERGE (p:Person {name:line.last_name})
MERGE (c:Country {name:line.country})
CREATE (p)-[r:LIVES_IN]->(c);

The output shown here, you and also look at the the full log.

+-------------------+
| No data returned. |
+-------------------+
Nodes created: 499
Relationships created: 1000000
Properties set: 499
Labels added: 499
119200 ms

References