Статьи

Pipes: структура потока данных для Gremlin — GraphDB Traversal

Pipes — это структура потока данных, разработанная TinkerPop . Обхода графа язык Гремлин является Groovy основанное предметно-ориентированный язык для обработки BluePrints -Enabled баз данных графа с трубами. С момента выпуска Pipes 0.7 1 августа 2011 года большая часть функциональности в Gremlin была обобщена и доступна через Pipes. Это открыло двери для других языков JVM (например, JRuby , Jython , Clojureи т. д.) в качестве основных языков для DSL с обходом графа. Чтобы продвинуть это направление, этот пост расскажет о Трубе с точки зрения Гремлин.

Пример набора данных графика

График представляет собой структуру данных из вершин и ребер. Свойство граф является популярным термином для модели данных графа , который имеет меченые края и произвольные пары ключа / значение , ассоциированные с обеими вершинами и ребрами (далее здесь ). На языке теории графов граф свойств известен как направленный, атрибутивный, многореляционный граф.

Все примеры обхода графа, представленные в этом посте, используют диаграмму, представленную выше. Минимально расширенная версия этого набора данных распространяется с Gremlin и доступна в этом месте (представлено в GraphML ). Для читателей, которые загрузили Gremlin 1.2 , можно загрузить этот график и просмотреть каждый из следующих примеров с помощью консоли Gremlin . База данных графиков в памяти TinkerGraph используется повсеместно. Однако обратите внимание, что Gremlin, Pipes и Blueprints в настоящее время работают для следующих поставщиков баз данных графов: Neo4j , OrientDB , DEX и хранилища RDF на основе Sail (например,Стардог ).

	gremlin$ ./gremlin.sh
	 
	         \,,,/
	         (o o)
	-----oOOo-(_)-oOOo-----
	gremlin> g = new TinkerGraph()
	==>tinkergraph[vertices:0 edges:0]
	gremlin> g.loadGraphML('data/graph-example-1.xml')
	==>null
	gremlin> 

Три типа труб

Трубы API представляет собой набор атомных «труб.» Труба — это простая функция, которая отображает входные данные в выходные данные. Вход в канал называется «начало», а выход называется «конец». Общая подпись трубы — Pipe <S, E> . Объединяя трубы в трубу, можно выполнять сложные вычисления. Существует 3 основных типа труб.

  1. transform: Given an object of type S, emit an object of type E (see JavaDoc).
  2. filter: Given an object of type S, emit it or not (see JavaDoc).
  3. sideEffect: Given an object of type S, emit it, but yield some computational side-effect in the process (see JavaDoc).

The sections to follow will explain these general forms with specific examples over the example dataset diagrammed previous.

Transform Pipes

During a transformation, an input of type S is mapped/transformed to an output of type E. Below is a basic example where all the steps/pipes used are transformational. The term “step” is used in Gremlin (see documentation). The term “pipe” is used in Pipes. The two terms are nearly analogous.

	gremlin> g.v(1).out.name
	==>vadas
	==>lop
	==>josh

In Gremlin, g.v(1) is a method call that returns vertex 1 from the graph. That vertex is passed to the out step (OutPipe) which yields vertices 2, 3, and 4. Finally, the name step (PropertyPipe(“name”)) yields the string name properties of those 2, 3, and 4 vertices.

	gremlin> g.v(1).out
	==>v[2]
	==>v[3]
	==>v[4]
	gremlin> g.v(1).out.name
	==>vadas
	==>lop
	==>josh

It is important to note that Pipes is not “set-based” in the sense that vertex 2, 3, and 4 are generated before passing those vertices off to the name step (the next stage in the pipeline). On the contrary, vertex 2 is generated, then vadas is generated. Next vertex 3 is generated and then lop. Finally, vertex 4 is generated and then josh. Pipes is a lazy evaluation framework in that as the pipeline is iterated for results, the mappings are computing as needed.

One of the core methods of a pipe is List Pipe.getPath(). This method returns the transformations that an object underwent through a pipeline. This method is demonstrated using the Gremlin step paths (PathPipe).

	gremlin> g.v(1).out.name.paths
	==>[v[1], v[2], vadas]
	==>[v[1], v[3], lop]
	==>[v[1], v[4], josh]

Note that any Gremlin expression has a string representation that exposes the underlying pipes being used to represent the expression.

	gremlin> g.v(1).out.name.paths.toString()
	==>[OutPipe, PropertyPipe(name), PathPipe] 

Filter Pipes

Filter steps/pipes are used to remove particular objects from the flow of a computation. The example below demonstrates one of the significant new features of Pipes 0.7. In Pipes 0.7, there is a PipeClosure (see JavaDoc) interface which can be used with closure-based pipes to provide dynamic functionality. In the examples below filter and transform are the most generic type of filter- and transformation-based pipes, where the filtering and transformation algorithm are specified with a Groovy/Java code snippet. In essence, the provided closure “closes” the meaning/function of the pipe.

Closures typically appear in languages in which functions are first-class values—in other words, such languages allow functions to be passed as arguments, returned from function calls, bound to variable names, etc., just like simpler types such as strings and integers. — Wikipedia

gremlin> g.v(1).out('knows')
	==>v[2]
	==>v[4]
	gremlin> g.v(1).out('knows').filter{it.age < 30}
	==>v[2]
	gremlin> g.v(1).out('knows').filter{it.age < 30}.name
	==>vadas
	gremlin> g.v(1).out('knows').filter{it.age < 30}.name.transform{it.length()}
	==>5
	gremlin> g.v(1).out('knows').filter{it.age < 30}.name.transform{it.length()}.toString()
	==>[OutPipe([knows]), FilterClosurePipe, PropertyPipe(name), TransformClosurePipe]

This example demonstrates how other JVM languages like JRuby (which support closures) and Clojure (which supports anonymous functions) can take advantage of the Pipes framework. In Gremlin, a Groovy closure is wrapped in a GroovyPipeClosure which makes the Groovy closure behave according to the PipeClosure interface and thus, pluggable into Pipes. Any other JVM language with similar features can capitalize on this same pattern.

Side-Effect Pipes

The third type of pipe is a side-effect pipe. A side-effect pipe implements SideEffectPipe<S,T> and takes an object of type S and emits that object. However, in the process, it yields some computational side-effect that can, for instance, be used later in the traversal computation. That side-effect object is of type T. The example below makes use of the most generic side-effect pipe: SideEffectClosurePipe. The example below calculates vertex 1′s co-developers — that is, those vertices that have worked with vertex 1 on a project. Note that a vertex can not be their own co-developer and thus, sideEffect and filter are used appropriately to express this abstract co-developer relationship.

	gremlin> g.v(1).sideEffect{x=it}
	==>v[1]
	gremlin> g.v(1).sideEffect{x=it}.out('created')
	==>v[3]
	gremlin> g.v(1).sideEffect{x=it}.out('created').in('created')
	==>v[1]
	==>v[4]
	==>v[6]
	gremlin> g.v(1).sideEffect{x=it}.out('created').in('created').filter{it != x}
	==>v[4]
	==>v[6]
	gremlin> g.v(1).sideEffect{x=it}.out('created').in('created').filter{it != x}.toString()
	==>[SideEffectClosurePipe, OutPipe([created]), InPipe([created]), FilterClosurePipe]

In the example above, the variable x stores the current vertex being pushed through the SideEffectClosurePipe. This variable is later accessed by the FilterClosurePipe to filter out vertices that match the source of the traversal. In other words, path traversals that make use of history can be effected using such a pattern (i.e. non-regular graph traversals).

Branch and Meta Pipes

There are two pipe sub-types that should be discussed: branch and meta pipes. A branch pipe is any pipe that alters the flow of a computation based on some decision criteria (see JavaDoc). The ifThenElse step of Gremlin demonstrates the behavior of the closure-based IfThenElsePipe.

	gremlin> g.v(1).out('knows').ifThenElse{it.age < 30}{it.name}{it.out('created').name}          
	==>vadas
	==>ripple
	==>lop

The ifThenElse step takes 3 closures: a condition-closure, a then-closure, and finally, an else-closure. Other branch pipes include copySplit and the respective merging pipes which take multiple branches and yield a single flow: fairMerge and exhaustMerge.

The meta-pipes are perhaps the most bewildering yet most powerful of the pipes in Pipes (see JavaDoc). A meta-pipe is a pipe that maintains pipes internal to it. As such, the meta-pipe can reason on the results of its internal pipes and then effect an appropriate behavior at the meta-pipe level. To better explain the concept of a MetaPipe, backtracking and looping are demonstrated.

	gremlin> g.v(1).out('knows').name
	==>vadas
	==>josh
	gremlin> g.v(1).out('knows').name.filter{it[0]=='v'}
	==>vadas
	gremlin> g.v(1).out('knows').name.filter{it[0]=='v'}.back(2)
	==>v[2]

The above Gremlin code snippet and diagram demonstrate the behavior of back (BackFilterPipe). If an object reaches back, emit the object that it was n steps ago. In the example above, n is 2 and thus, two steps ago from the string vadas was vertex 2. While this linear sequence articulates backtracking, the diagram below better explains how this computation is implemented in Pipes.

 

	gremlin> g.v(1).out('knows').name.filter{it[0]=='v'}.back(2).toString()
	==>[OutPipe([knows]), BackFilterPipe[[PropertyPipe(name), FilterClosurePipe]]]

 While not diagrammed, it is possible to use named-steps instead of integers (e.g. 2) to denote how many steps to back up to. As an expression becomes more complex, using named-steps makes the expression more readable and manageable.

	gremlin> g.v(1).out('knows').as('here').name.filter{it[0]=='v'}.back('here')
	==>v[2]

Finally, to conclude this section, the meta-pipe LoopPipe is explained. LoopPipe is both a meta-pipe and a branch pipe in that is can control the flow of objects and, like BackPipe makes use of internal pipes during its operation.

	gremlin> g.v(1).out.loop(1){it.loops < 3}
	==>v[5]
	==>v[3]

In the example above, the loop step passes the current object to the provided “while”-based condition-closure. If the closure yields true, the object is inserted into the pipe from n steps ago. Moreoever, metadata is appended to the object. In the example above the loops metadata denotes how many times the object path has gone through the loop pipe. Thus, when the object path has gone through 3 times, the condition-closure fails and the object is passed on to the next stage of the computation/pipeline.

	gremlin> g.v(1).out.loop(1){it.loops < 3}.name
	==>ripple
	==>lop

While a linear interpretation is possible, underneath the hood, loop is a meta-pipe. The diagram below better represents this.

 

	gremlin> g.v(1).out.loop(1){it.loops < 3}.name.toString()
	==>[LoopPipe[[OutPipe]], PropertyPipe(name)]

Conclusion

There are numerous pipes that come with the Pipes distribution. These pipes can be mixed and matched to yield lazy graph traversals. A description of all the pipes in Pipes 0.7 is provided in the associated JavaDoc. Note that it is easy to create a pipe by simply extending AbstractPipe and implementing the protected method AbstractPipe.processNextStart(). This is handy for developers wanting to create domain-specific pipes and use them in Pipes-based DSLs like Gremlin. Two trivial examples are provided below.

public class NumCharsPipe extends AbstractPipe<String,Integer> {
	  public Integer processNextStart() {
	    return this.starts.next().length();
	  }
	}
	 
	public class MoreThanFourCharsPipe extends AbstractPipe<String,String> implements FilterPipe<String> {
	  public String processNextStart() {
	    while(true) {
	      String s = this.starts.next();
	      if(s.length() > 4)
	        return s;
	    }
	  }
	}
	gremlin> Gremlin.defineStep('numChars', [Pipe], {new NumCharsPipe()})
	==>null
	gremlin> Gremlin.defineStep('moreThan4Chars', [Pipe], {new MoreThanFourCharsPipe()})
	==>null
	gremlin> g.v(1).out.name.numChars                                               
	==>5
	==>3
	==>4
	gremlin> g.v(1).out.name.moreThan4Chars
	==>vadas

 

NOTE: Gremlin provides an easier way to implement user-defined steps. The example above is provided to demonstrate the boundary between the Java-based Pipes and the Groovy-based Gremlin.

The TinkerPop crew plans to further open up the Pipes data flow model to other JVM languages (for example, see Pacer). Pipes 0.7 introduced the flexible PipeClosure concept which will make it easy for DSL designers to make use of Pipes as the backend engine of their graph traversal language. Moreover, with this model, there is a central point for optimization and feature development which will hopefully ensure a bright future for lazy graph traversal languages implemented on the JVM.

 Source: http://markorodriguez.com/2011/08/03/on-the-nature-of-pipes/