博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
GraphFrames, Spark上的图计算库(英)
阅读量:6280 次
发布时间:2019-06-22

本文共 5344 字,大约阅读时间需要 17 分钟。

hot3.png

An overview of Spark's new GraphFrames, a graph processing library based on DataFrames, built in a collaboration between Databricks, UC Berkeley's AMPLab, and MIT.

By Joseph Bradley, Tim Hunter, Ankur Dave*, Xiangrui Meng, Databricks, *UC Berkeley AMPLab.

Databricks is excited to announce the release of GraphFrames, a graph processing library for Apache Spark. Collaborating with UC Berkeley and MIT, we have built a graph library based on DataFrames. GraphFrames benefit from the scalability and high performance of DataFrames, and they provide a uniform API for graph processing available from Scala, Java, and Python.

What are GraphFrames?

GraphFrames support general graph processing, similar to Apache Spark’s GraphX library. However, GraphFrames are built on top of Spark DataFrames, resulting in some key advantages:

  • Python, Java & Scala APIs: GraphFrames provide uniform APIs for all 3 languages. For the first time, all algorithms in GraphX are available from Python & Java.

  • Powerful queries: GraphFrames allow users to phrase queries in the familiar, powerful APIs of Spark SQL and DataFrames.

  • Saving & loading graphs: GraphFrames fully support , allowing writing and reading graphs using many formats like Parquet, JSON, and CSV.

In GraphFrames, vertices and edges are represented as DataFrames, allowing us to store arbitrary data with each vertex and edge.

An Example Social Network

Say we have a social network with users connected by relationships. We can represent the network as a , which is a set of vertices (users) and edges (connections between users). A toy example is shown below.

Click on the image to see the full example notebook

We might then ask questions such as “Which users are most influential?” or “Users A and B do not know each other, but should they be introduced?” These types of questions can be answered using graph queries and algorithms.

GraphFrames can store data with each vertex and edge. In a social network, each user might have an age and name, and each connection might have a relationship type.

Click on the table to see the full example notebook

Simple Queries are Simple

GraphFrames make it easy to express queries over graphs. Since GraphFrame vertices and edges are stored as DataFrames, many queries are just DataFrame (or SQL) queries.

Example:

How many users in our social network have “age” > 35?
We can query the vertices DataFrame:
g.vertices.filter("age > 35")

Example:

How many users have at least 2 followers?
We can combine the built-in inDegrees method with a DataFrame query.
g.inDegrees.filter("inDegree >= 2")

Graph Algorithms Support Complex Workflows

GraphFrames support the full set of algorithms available in GraphX, in all 3 language APIs. Results from graph algorithms are either DataFrames or GraphFrames. For example, what are the most important users? We can run PageRank:

results = g.pageRank(resetProbability=0.15, maxIter=10)display(results.vertices)

Click on the table to see the full example notebook

GraphFrames also support new algorithms:

  • Breadth-first search (BFS): Find shortest paths from one set of vertices to another

  • Motif finding: Search for structural patterns in a graph

Motif finding lets us make powerful queries. For example, to recommend whom to follow, we might search for triplets of users A,B,C where A follows B and B follows C, but A does not follow C.

# Motif: A->B->C but not A->Cresults = g.find("(A)-[]->(B); (B)-[]->(C); !(A)-[]->(C)")# Filter out loops (with DataFrame operation)results = results.filter("A.id != C.id")# Select recommendations for A to follow Cresults = results.select("A", "C")display(results)

Click on the table to see the full example notebook

The full set of GraphX algorithms supported by GraphFrames is:

  • PageRank: Identify important vertices in a graph

  • Shortest paths: Find shortest paths from each vertex to landmark vertices

  • Connected components: Group vertices into connected subgraphs

  • Strongly connected components: Soft version of connected components

  • Triangle count: Count the number of triangles each vertex is part of

  • Label Propagation Algorithm (LPA): Detect communities in a graph

GraphFrames Integrate with GraphX

GraphFrames fully integrate with GraphX via conversions between the two representations, without any data loss. We can convert our social network to a GraphX graph and back to a GraphFrame.

val gx: Graph[Row, Row] = g.toGraphX()val g2: GraphFrame = GraphFrame.fromGraphX(gx)

See the for more details on these conversions.

What's Next?

Graph-specific optimizations for DataFrames are under active research and development. Watch Ankur Dave’s to learn more. We plan to include some of these optimizations in GraphFrames for its next release!

Get started with these tutorial notebooks in and in the . If you do not have access to the beta yet, .

Download the GraphFrames package from the . GraphFrames are compatible with Spark 1.4, 1.5, and 1.6.

Learn more in the .

The code is available on under the Apache 2.0 license. We welcome contributions! Check the for ideas to work on.

About: Databricks was founded by the team at UC Berkeley AMPLab that created and continues to drive Apache Spark. Their vision is to make big data simple for data scientists, engineers, developers, and business users alike.

. Reposted with permission.

Related:

转载于:https://my.oschina.net/u/2306127/blog/634220

你可能感兴趣的文章
Morris ajax
查看>>
【Docker学习笔记(四)】通过Nginx镜像快速搭建静态网站
查看>>
ORA-12514: TNS: 监听程序当前无法识别连接描述符中请求的服务
查看>>
<转>云主机配置OpenStack使用spice的方法
查看>>
java jvm GC 各个区内存参数设置
查看>>
[使用帮助] PHPCMS V9内容模块PC标签调用说明
查看>>
关于FreeBSD的CVSROOT的配置
查看>>
基于RBAC权限管理
查看>>
基于Internet的软件工程策略
查看>>
数学公式的英语读法
查看>>
留德十年
查看>>
迷人的卡耐基说话术
查看>>
PHP导出table为xls出现乱码解决方法
查看>>
PHP问题 —— 丢失SESSION
查看>>
Java中Object类的equals()和hashCode()方法深入解析
查看>>
数据库
查看>>
Vue------第二天(计算属性、侦听器、绑定Class、绑定Style)
查看>>
dojo.mixin(混合进)、dojo.extend、dojo.declare
查看>>
Python 数据类型
查看>>
iOS--环信集成并修改头像和昵称(需要自己的服务器)
查看>>