jsoup HTMLParser и анализ ссылок Dzone с использованием селекторов CSS в Java

Я работал над задачей разбора некоторых веб-сервисов Amazon. Есть много способов разобрать его, используя DOM / SAX / Stax. Все они требуют некоторого количества кодирования. Мне нужно было быстрое решение, и я наконец-то нашел JSoup с открытым исходным кодом HTML Parser (другой HTML-парсер, который мне нравится, это HTMLParser ). В этой статье я собираюсь объяснить, как я собираюсь анализировать HTML-ссылки DZone в Java.

Я буду получать описания всех ссылок в Dzone, используя код

Примечание. Это не лучший способ чтения ссылок из Dzone (вместо этого вы можете использовать RSS-каналы). Этот учебник проведет вас через селекторы CSS для Java

Все запросы на нумерацию страниц DZone выглядят так

http://www.dzone.com/links/?type=html&p=2

я использовал библиотеку Java с открытым исходным кодом, чтобы разобрать это и извлечь текстовое описание ссылки ( суп )

Вот примеры тегов, которые мы имеем в ответе dzone

<a name="link-613399">
</a>

<div class="linkblock frontpage " id="link-613399">
	<div id="thumb_613399" class="thumb">
		<a onmouseup="track(this, 'twitter4j_oauth_on_android', ''); "
		 href="http://www.xoriant.com/blog/mobile-application-development/twitter4j-oauth-on-android.html">
			<img width="120" height="90"
			 src="http://cdn.dzone.com/links/images/thumbs/120x90/613399-1307624607000.jpg"
				class="thumbnail" alt="Link 613399 thumbnail"
				onmouseover="return OLgetAJAX('/links/themes/reader/jsps/nodecoration/thumb-load.jsp?linkId=613399', OLcmdExT1,
 300, 'bigThumbBody');"
				onmouseout="OLclearAJAX(); nd(100);" />
		</a>
	</div>
	<div id="hidden_thumb_613399">

	</div>
	<div class="tools">
	</div>
	<div class="details">
		<div class="vwidget" id="vwidget-613399">
			<a id="upcount-613399" href="#" class="upcount"
				onclick="showLoginDialog(613399, null); return false">7</a>

			<a id="downcount-613399" href="#"
				onclick="showLoginDialog(613399, null); return false;" class="downcount">0</a>
		</div>
		<h3>
			<a onmouseup="track(this, 'twitter4j_oauth_on_android', ''); "
			 href="http://www.xoriant.com/blog/mobile-application-development/twitter4j-oauth-on-android.html"
				rel="bookmark"> Twitter4j OAuth on Android</a>
		</h3>
		<p class="voteblock">
			<a href="/links/users/profile/811805.html">
				<img width="24" height="24"
				 src="http://cdn.dzone.com/links/images/std/avatars/default_24.gif"
					class="avatar" alt="User 811805 avatar" />
			</a>
		</p>
		<p class="fineprint byline">
			<a href="/links/users/profile/811805.html">RituR</a>
			via
			<a href="/links/search.html?query=domain%3Axoriant.com">xoriant.com</a>
		</p>
		<p class="fineprint byline">
			<b>Promoted: </b>
			Jun 08 / 17:27. Views:
			520, Clicks: 266
		</p>
		<p class="description">
			OAuth is an open protocol
			which allows the users to share their private information and assets
			like photos, videos etc. with another site...&nbsp;
			<a href='/links/twitter4j_oauth_on_android.html'>more&nbsp;&raquo;
			</a>
		</p>
		<p class="fineprint stats">
			<a
			 href="http://twitter.com/home?status=RT+%40DZone+%22Twitter4j+OAuth+on+Android%22+http%3A%2F%2Fdzone.com%2FTBxR"
				class="twitter">Tweet</a>
			<a href="/links/twitter4j_oauth_on_android.html" class="comment">0
				Comments</a>
			<span class="linkUnsaved" id="save-link-613399"
				onclick="showLoginDialog(613399); return false;">Save</span>
			<span class="linkUnshared" id="share-link-613399"
				onclick="showLoginDialog(613399); return false;">Share</span>
			Tags:
			<a href="/links/tag/mobile.html" class="tags" rel="tag">mobile</a>
			,
			<a href="/links/tag/standards.html" class="tags" rel="tag">standards</a>
		</p>

	</div>
</div>

Чтобы получить описание, мы должны получить данные из элемента «P» с классом «description», который фактически присутствует в DIV с классом «details». Вот как мы можем сделать это в Java

/**
 *
 */
package com.linkwithweb.parser;

import java.io.File;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

/****************************************************************
 * Description
 * jsoup elements support a CSS (or jquery) like selector syntax to find matching elements, that allows very powerful and robust queries.
 *
 * The select method is available in a Document, Element, or in Elements. It is contextual, so you can filter by selecting from a specific element, or
 * by chaining select calls.
 *
 * Select returns a list of Elements (as Elements), which provides a range of methods to extract and manipulate the results.
 *
 * Selector overview
 * tagname: find elements by tag, e.g. a
 * ns|tag: find elements by tag in a namespace, e.g. fb|name finds <fb:name> elements
 * #id: find elements by ID, e.g. #logo
 * .class: find elements by class name, e.g. .masthead
 * [attribute]: elements with attribute, e.g. [href]
 * [^attr]: elements with an attribute name prefix, e.g. [^data-] finds elements with HTML5 dataset attributes
 * [attr=value]: elements with attribute value, e.g. [width=500]
 * [attr^=value], [attr$=value], [attr*=value]: elements with attributes that start with, end with, or contain the value, e.g. [href*=/path/]
 * [attr~=regex]: elements with attribute values that match the regular expression; e.g. img[src~=(?i)\.(png|jpe?g)]
 * : all elements, e.g. *
 * Selector combinations
 * el#id: elements with ID, e.g. div#logo
 * el.class: elements with class, e.g. div.masthead
 * el[attr]: elements with attribute, e.g. a[href]
 * Any combination, e.g. a[href].highlight
 * ancestor child: child elements that descend from ancestor, e.g. .body p finds p elements anywhere under a block with class "body"
 * parent > child: child elements that descend directly from parent, e.g. div.content > p finds p elements; and body > * finds the direct children of
 * the body tag
 * siblingA + siblingB: finds sibling B element immediately preceded by sibling A, e.g. div.head + div
 * siblingA ~ siblingX: finds sibling X element preceded by sibling A, e.g. h1 ~ p
 * el, el, el: group multiple selectors, find unique elements that match any of the selectors; e.g. div.masthead, div.logo
 * Pseudo selectors
 * :lt(n): find elements whose sibling index (i.e. its position in the DOM tree relative to its parent) is less than n; e.g. td:lt(3)
 * :gt(n): find elements whose sibling index is greater than n; e.g. div p:gt(2)
 * :eq(n): find elements whose sibling index is equal to n; e.g. form input:eq(1)
 * :has(seletor): find elements that contain elements matching the selector; e.g. div:has(p)
 * :not(selector): find elements that do not match the selector; e.g. div:not(.logo)
 * :contains(text): find elements that contain the given text. The search is case-insensitive; e.g. p:contains(jsoup)
 * :containsOwn(text): find elements that directly contain the given text
 * :matches(regex): find elements whose text matches the specified regular expression; e.g. div:matches((?i)login)
 * :matchesOwn(regex): find elements whose own text matches the specified regular expression
 * Note that the above indexed pseudo-selectors are 0-based, that is, the first element is at index 0, the second at 1, etc
 * See the Selector API reference for the full supported list and details.
 *
 * @author Ashwin Kumar
 *
 */
public class HTMLParser {

	/**
	 * @param args
	 */
	public static void main(String[] args) {
		try {
			File input = new File("input/dZoneLinks.xml");
			Document doc = Jsoup.parse(input, "UTF-8",
					"http://www.dzone.com/links/?type=html&p=2");

			Elements descriptions = doc.select("div.details > p.description"); // get all description elements in this HTML file
			/*
			 * Elements pngs = doc.select("img[src$=.png]");
			 * // img with src ending .png
			 *
			 * Element masthead = doc.select("div.masthead").first();
			 */
			// div with

			// Elements resultLinks = doc.select("h3.r > a"); // direct a after h3
			/**
			 * Iterate over all descriptions and display them
			 */
			for (Element element : descriptions) {
				System.out.println(element.ownText());
				System.out.println("--------------");
			}

		} catch (Exception e) {
			e.printStackTrace();
		}
	}

}

Mavenized код был зарегистрирован в SVN в следующем месте

http://code.google.com/p/linkwithweb/source/browse/trunk/Utilities/HTMLParser

Njoy разбор легко с помощью jsoup

От http://ashwinrayaprolu.wordpress.com/2011/06/09/jsoup-htmlparser-and-parsing-dzone-links-using-css-selectors-in-java/

jsoup HTMLParser и анализ ссылок Dzone с использованием селекторов CSS в Java

Категории

Последние статьи

Рефакторинг Hudson God Class

Альтернативы синтаксиса Java лямбда

Morphia и MongoDB: развивающиеся структуры документов

OpenShift Express: развертывание приложения Java EE (с поддержкой AS7)

Интеграция jqGrid, REST, AJAX и Spring MVC