Статьи

Как получить / извлечь информацию метаданных из аудиофайлов, используя Java и Apache Tika API?

Я думаю, я пишу этот пост после долгого времени. На этот раз я пишу об API Apache Tika, который был моим другом, и я попытался извлечь / извлечь информацию метаданных из поддерживаемых им аудиофайлов — .mp3, .aiff, .au, .midi, .wav.

Чтобы прояснить ситуацию, вот снимок экрана информации, представленной Windows Vista об аудиофайле:

Мы хотели извлечь это, используя Java, и с помощью googling обнаружили, что Apache Tika поможет. Нам понадобились эти метаданные для индексации аудиофайлов, чтобы их можно было искать в поисковом приложении, которое мы создаем с помощью Apache Lucene .

Вот пример Java-программы, которая извлекает метаданные из mp3-файла:

package singz.samples.search.audio.metadata;

import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.InputStream;

import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
import org.apache.tika.parser.mp3.Mp3Parser;
import org.xml.sax.ContentHandler;
import org.xml.sax.SAXException;
import org.xml.sax.helpers.DefaultHandler;

/**
* @author Singaram Subramanian
* Extract metadata of an audio file using Apache Tika API
*
*/

public class AudioMetadataExtractorDemo {

public static void main(String[] args) {

// This audio file has metadata embedded in XMP (Extensible Metadata Platform) standard
// created by Adobe Systems Inc. XMP standardizes the definition, creation, and
// processing of extensible metadata.

String audioFileLoc = "C:\\Pop\\BackstreetBoys_ShowMeTheMeaningOfBeingLonely.mp3";

try {

InputStream input = new FileInputStream(new File(audioFileLoc));
ContentHandler handler = new DefaultHandler();
Metadata metadata = new Metadata();
Parser parser = new Mp3Parser();
ParseContext parseCtx = new ParseContext();
parser.parse(input, handler, metadata, parseCtx);
input.close();

// List all metadata
String[] metadataNames = metadata.names();

for(String name : metadataNames){
System.out.println(name + ": " + metadata.get(name));
}

// Retrieve the necessary info from metadata
// Names - title, xmpDM:artist etc. - mentioned below may differ based
// on the standard used for processing and storing standardized and/or
// proprietary information relating to the contents of a file.

System.out.println("Title: " + metadata.get("title"));
System.out.println("Artists: " + metadata.get("xmpDM:artist"));
System.out.println("Genre: " + metadata.get("xmpDM:genre"));

} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
} catch (SAXException e) {
e.printStackTrace();
} catch (TikaException e) {
e.printStackTrace();
}
}
}

Maven POM XML

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
	xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
	<modelVersion>4.0.0</modelVersion>

	<groupId>singz.samples.search.audio</groupId>
	<artifactId>AudioMetadataExtractor</artifactId>
	<version>0.0.1</version>
	<packaging>jar</packaging>

	<name>AudioMetadataExtractor</name>
	<url>http://maven.apache.org</url>

	<properties>
		<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
	</properties>

	<dependencies>
		<dependency>
			<groupId>org.apache.tika</groupId>
			<artifactId>tika-core</artifactId>
			<version>0.10</version>
		</dependency>

		<dependency>
			<groupId>org.apache.tika</groupId>
			<artifactId>tika-parsers</artifactId>
			<version>0.10</version>
		</dependency>
	</dependencies>
</project>

Выход

xmpDM:releaseDate: 2001
xmpDM:audioChannelType: Stereo
xmpDM:album: Top 100 Pop
Author: Backstreet Boys
xmpDM:artist: Backstreet Boys
channels: 2
xmpDM:audioSampleRate: 44100
xmpDM:logComment: eng
xmpDM:trackNumber: 04
version: MPEG 3 Layer III Version 1
xmpDM:composer: null
xmpDM:audioCompressor: MP3
title: Show Me The Meaning Of Being Lonely
samplerate: 44100
xmpDM:genre: Pop
Content-Type: audio/mpeg
Title: Show Me The Meaning Of Being Lonely
Artists: Backstreet Boys
Genre: Pop

About Apache Tika

http://tika.apache.org/index.html

“The Apache Tika™ toolkit detects and extracts metadata and structured text content from various documents using existing parser libraries.”

http://www.lucidimagination.com/devzone/technical-articles/content-extraction-tika#article.tika

“Apache Tika is a content type detection and content extraction framework. Tika provides a general application programming interface that can be used to detect the content type of a document and also parse textual content and metadata from several document formats. Tika does not try to understand the full variety of different document formats by itself but instead delegates the real work to various existing parser libraries such as Apache POI for Microsoft formats, PDFBox for Adobe PDF, Neko HTML for HTML etc.

The grand idea behind Tika is that it offers a generic interface for parsing multiple formats. The Tika API hides the technical differences of the various parser implementations. This means that you don’t have to learn and consume one API for every format you use but can instead use a single API – The Tika API. Internally Tika usually delegates the parsing work to existing parsing libraries and adapts the parse result so that client applications can easily manage variety of formats.

Tika aims to be efficient in using available resources (mainly RAM) while parsing. The Tika API is stream oriented so that the parsed source document does not need to be loaded into memory all at once but only as it is needed. Ultimately, however, the amount of resources consumed is mandated by the parser libraries that Tika uses.

At the time of writing this, Tika supports directly around 30 document formats. See list of supported document formats . The list of supported document formats is not limited by Tika in any way. In the simplest case you can add support for new document formats by implementing a thin adapter that that implements the Parser interface for the new document format.”

About XMP standard

http://en.wikipedia.org/wiki/Extensible_Metadata_Platform

“The Adobe Extensible Metadata Platform (XMP) is a standard, created by 
Adobe Systems Inc., for processing and storing standardized and proprietary information relating to the contents of a file.

XMP standardizes the definition, creation, and processing of extensible metadata. Serialized XMP can be embedded into a significant number of popular file formats, without breaking their readability by non-XMP-aware applications. Embedding metadata avoids many problems that occur when metadata is stored separately. XMP is used in PDFphotography and photo editing applications.

XMP can be used in several file formats such as PDFJPEGJPEG 2000JPEG XRGIFPNGHTMLTIFFAdobe IllustratorPSDMP3MP4Audio Video InterleaveWAVRF64,Audio Interchange File FormatPostScriptEncapsulated PostScript, and proposed for DjVu. In a typical edited JPEG file, XMP information is typically included alongside Exif and IPTC Information Interchange Model data.”

 

From http://singztechmusings.wordpress.com/2011/10/17/how-to-retrieveextract-metadata-information-from-audio-files-using-java-and-apache-tika-api/