Vertex-centric graph processing: the what and why

17 Dec 2017, 01:41

Vertex-centric graph processing is a new programming abstraction/model for writing graph algorithms. I got to know vertex-centric graph processing because my final year project is centered around this topic. I find it to be an interesting idea that enables large-scale graph processing with ease. In this post, I would like to talk about what vertex-centric graph processing is and why it is useful.

What is vertex-centric graph processing?

Vertex-centric graph processing, as the name suggests, is a new programming abstraction for processing graphs that is centered around the vertices. Traditionally, when we implement a graph algorithm, we can freely iterate through all vertices, all edges, and maybe create some additional data structures (e.g. the predecessor and distance arrays for a shortest path algorithm, a priority queue for Dijkstra’s algorithm). Basically, you have a global view of the graph and everything else. In the vertex-centric model, however, we write algorithms from a single vertex’s point of view (some people call it thinking-like-a-vertex). For any vertex, the only information it has is its neighbor list and its own properties, which varies by algorithm depending on what needs to be stored.

A vertex-centric algorithm consists of a so-called vertex program. This program is executed by each vertex. It can manipulate any local information, send messages to neighboring vertices, and receive messages from them. An algorithm runs iteratively. In each iteration, the vertex program is executed by the vertices, and messages are exchanged between vertices. The algorithm terminates when no messages are sent from any vertex, indicating a halt.

An example with single-source shortest path

Let’s take a look at how the single-source shortest path (SSSP) problem can be solved in the vertex-centric model. First, the property stored on each vertex is its distance to the source vertex. Initially, this value is positive infinity for all vertices except for the source, which is 0. Next, let’s define the vertex program (in pseudocode).

function VertexProgram
    // receive and merge incoming messages
    incoming_msgs <- ReceiveMessages()
    merged_msg <- Reduce(incoming_msgs, MIN)
    
    // update vertex property
    if dist > merged_msg then
        dist <- merged_msg
    end if
    
    // send messages to neighbors
    for each neighbor
        if dist + edge_weight < neighbor_dist then
            SendMessage(neighbor, dist + edge_weight)
        end if
    end for
end function

In this vertex program, each vertex sends its own distance plus the edge weight (for the edge connecting to a particular neighbor) to its neighbors. You can imagine this as telling its neighbors: “hey, checkout this value, it may be lower than what you have”. In the next iteration, each vertex receives a bunch of messages from its neighbors and checks whether any of those values are lower than what it has. If there is a chance to update its current distance, it takes the lowest from the messages and update its distance with that.

The algorithm keeps running and shorter distances from the source keep propagating to more nodes. It will terminate when the shortest distance from source reaches the furthest node from source. The number of iterations equals the number of edges on that longest shortest path.

Why vertex-centric?

After all those explanations, you may be wondering why we need a new programming model, which seems to be more restrictive. You are right, it is indeed more restrictive. However, it makes distributed graph processing much much easier.

First of all, the algorithm can be parallelized because all vertices can execute the vertex program in parallel (in the same iteration).

Second, with a limited set of operations available to the vertex program, distributed graph processing frameworks can be developed. This would enable all kinds of features you want to see in a distributed system, like fault-tolerance. This also improves programmer’s productivity because you only need to write a few lines of code (see the example above) and you don’t have to reinvent the wheels to achieve fault-tolerance and etc. Imagine how many lines of code you have to write if you want to implement a distributed graph algorithm from scratch that has all the properties a distributed system should have!

Third, as a bonus, this can enable some additional optimizations because the computation and communication patterns are constrained.

In conclusion, I think this is a very nice attempt to make distributed graph processing easier. Companies like Facebook and Google are already using such frameworks in production to process large-scale graphs. However, there are still certain problems in these frameworks and much research has to be done to further improve the idea of vertex-centric graph processing.