Статьи

uniVocity-parsers: мощная библиотека синтаксических анализаторов файлов CSV / TSV / с фиксированной шириной для Java

uniVocity-parsers is an open-source project CSV/TSV/Fixed-width file parser library in Java, providing many capabilities to read/write files with simplified API, and powerful features as shown below.

Unlike other libraries out there, uniVocity-parsers built its own architecture for parsing text files, which
focuses on maximum performance and flexibility while making it easy to extend and build new parsers.

Contents

Overview Installation Features Overview Reading CSV/TSV/Fixed-width Files Writing CSV/TSV/Fixed-width Files Performance and Flexibility Design and Implementations

1. Overview

I'm a Java developer working on a web-based system to evaluate telecommunication carriers' network and work out reports. In the system, the CSV format was heavily involved for the network-related data, such as real-time network status (online/offline) for the broadband subscribers, and real-time traffic for each subscriber. Generally the size of a single CSV file would exceed 1GB, with millions of rows included. And we were using the library JavaCSV as the CSV file parser. As growth in the capacity of carriers' network and the time duration our system monitors, the size of data in CSV increased so much. My team and I have to work out a solution to achieve better performance (even in seconds) in CSV files processing, and better extendability to provide much more customized functionality. We came across this library uniVocity-parsers  as a final solution after a lot of testing and analysis, and we found it great. In addition of better performance and extendability, the library provides developers with simplified APIs, detailed documents & tutorials and commercial support for highly customized functionality. This project is hosted at Github  with 62 stars & 8 forks (at the time of writing). Tremendous documents & tutorials are provided at here  and here. You can find more examples and news here as well. In addition, the well-known open-source project Apache Camel integrates uniVocity-parsers for reading and writing CSV/TSV/Fixed-width files. Find more details here.

2. Installation

I'm using version 1.5.1 , but refer to the official download page to see if there's a more recent version available. The project is also available in the maven central repository, so you can add this to your pom.xml:
<dependency>
    <groupId>com.univocity</groupId>
    <artifactId>univocity-parsers</artifactId>
    <version>1.5.1</version>
</dependency>

3. Features Overview

uniVocity-parsers provides a list of powerful features, which can fulfill all requirements you might have for processing tabular presentations of data.
Check the following overview chart for the features:

4. Reading Tabular Presentations Data

Read all rows of a csv

CsvParser parser = new CsvParser(new CsvParserSettings());
List<String[]> allRows = parser.parseAll(getReader("/examples/example.csv"));

For full list of demos in reading features, refer to: https://github.com/uniVocity/univocity-parsers#reading-csv 

5. Writing Tabular Presentations Data

Write data in CSV format with just 2 lines of code:

List<String[]> rows = someMethodToCreateRows();

CsvWriter writer = new CsvWriter(outputWriter, new CsvWriterSettings());
writer.writeRowsAndClose(rows);

For full list of demos in writing features, refer to: https://github.com/uniVocity/univocity-parsers/blob/master/README.md#writing 

6. Performance and Flexibility

Here is the performance comparison we tested for uniVocity-parsers and JavaCSV in our system:

Размер файла Продолжительность разбора JavaCSV Продолжительность разбора uniVocity-парсеров
10 МБ, 145453 строки 1138ms 836ms
100 МБ, 809008 строк 23s 6s
434 МБ, 4499959 строк 91S 28s
1 ГБ, 23803502 строки 245s 70-е годы

Вот несколько таблиц сравнения производительности практически для всех существующих библиотек анализаторов CSV .
И вы можете обнаружить, что uniVocity-парсеры значительно опередили другие библиотеки по производительности.


uniVocity-parsers достигли своей цели в производительности и гибкости с помощью следующих механизмов:


  • Чтение ввода в отдельном потоке (включите, вызвав CsvParserSettings.setReadInputOnSeparateThread () )
  • Параллельный процессор строк (см. ConcurrentRowProcessor, который реализует RowProcessor )
  • Расширить ColumnProcessor для обработки столбцов с помощью собственной бизнес-логики
  • Расширьте RowProcessor для чтения строк с вашей собственной бизнес-логикой

7. Разработка и внедрение

Куча процессоров в парсерах uniVocity — это основные модули, которые отвечают за чтение / запись данных в
строки и столбцы и выполнять преобразования данных.

Вот схема процессоров:



Вы можете легко создавать свои собственные процессоры, внедряя интерфейс RowProcessor или расширяя предоставляемые реализации.
В следующем примере я просто использовал анонимный класс:

CsvParserSettings settings = new CsvParserSettings();

settings.setRowProcessor(new RowProcessor() {

    /**
    * initialize whatever you need before processing the first row, with your own business logic
    **/
    @Override
    public void processStarted(ParsingContext context) {
        System.out.println("Started to process rows of data.");
    }

    /**
    * process the row with your own business logic
    **/
    StringBuilder stringBuilder = new StringBuilder();
    
    @Override
    public void rowProcessed(String[] row, ParsingContext context) {
        System.out.println("The row in line #" + context.currentLine() + ": ");
        for (String col : row) {
            stringBuilder.append(col).append("\t");
        }
    }

    /**
    * After all rows were processed, perform any cleanup you need
    **/
    @Override
    public void processEnded(ParsingContext context) {
        System.out.println("Finished processing rows of data.");
        System.out.println(stringBuilder);
    }
});

CsvParser parser = new CsvParser(settings);
List<String[]> allRows = parser.parseAll(new FileReader("/myFile.csv"));

Библиотека предлагает намного больше возможностей. Я рекомендую вам посмотреть, как это действительно изменило наш проект.