上面我们学习了RDD如何转换,即一个RDD转换成另外一个RDD,但是转换完成之后并没有立刻执行,仅仅是记住了数据集的逻辑操作,只有当执行了Action动作之后才会真正触发Spark作业,进行算子的计算
执行操作有:
- reduce(func)
- collect()
- count()
- first()
- take(n)
- takeSample(withReplacement, num, [seed])
- takeOrdered(n, [ordering])
- saveAsTextFile(path)
- countByKey()
- foreach(func)
reduce:使用函数func聚合数据集元素,返回执行结果
from pyspark import SparkContext
if __name__ == '__main__':
sc = SparkContext(appName="rddAction", master="local[*]")
data = [1, 2, 3, 4, 5]
rdd = sc.parallelize(data)
print(rdd.reduce(lambda x,y : x+y))
# 15
sc.stop()
collect:将计算结果回收到Driver端,当数据量较大时执行会造成oom
from pyspark import SparkContext
if __name__ == '__main__':
sc = SparkContext(appName="rddAction", master="local[*]")
data = [1, 2, 3, 4, 5]
rdd = sc.parallelize(data)
print(rdd.collect())
# [1, 2, 3, 4, 5]
sc.stop()
count:返回数据集元素个数,执行过程中会将数据回收到Driver端进行统计
from pyspark import SparkContext
if __name__ == '__main__':
sc = SparkContext(appName="rddAction", master="local[*]")
data = [1, 2, 3, 4, 5]
rdd = sc.parallelize(data)
print(rdd.count())
# 5
sc.stop()
first:返回数据集中的第一个元素,类似于take(1)
from pyspark import SparkContext
if __name__ == '__main__':
sc = SparkContext(appName="rddAction", master="local[*]")
data = [1, 2, 3, 4, 5]
rdd = sc.parallelize(data)
print(rdd.first())
# 1
sc.stop()
take:返回数据集中的前n个元素的数组
from pyspark import SparkContext
if __name__ == '__main__':
sc = SparkContext(appName="rddAction", master="local[*]")
data = [1, 2, 3, 4, 5]
rdd = sc.parallelize(data)
print(rdd.take(3))
# [1, 2, 3]
sc.stop()
takeSample:返回数据集中num个随机元素,seed指定随机数生成器种子
from pyspark import SparkContext
if __name__ == '__main__':
sc = SparkContext(appName="rddAction", master="local[*]")
data = [1, 2, 3, 4, 5]
rdd = sc.parallelize(data)
print(rdd.takeSample(True, 3, 1314))
# [5, 2, 3]
sc.stop()
takeOrdered:使用自然排序或自定义比较器返回数据集中的前n个元素
from pyspark import SparkContext
if __name__ == '__main__':
sc = SparkContext(appName="rddAction", master="local[*]")
data = [5, 1, 4, 2, 3]
rdd = sc.parallelize(data)
print(rdd.takeOrdered(3))
# [1, 2, 3]
print(rdd.takeOrdered(3, key=lambda x: -x))
# [5, 4, 3]
sc.stop()
saveAsTextFile:将数据集元素作为文本文件写入文件系统(如:本地文件系统,HDFS等)
from pyspark import SparkContext
if __name__ == '__main__':
sc = SparkContext(appName="rddAction", master="local[*]")
data = [1, 2, 3, 4, 5]
rdd = sc.parallelize(data)
rdd.saveAsTextFile("file:///home/data")
sc.stop()
countByKey:统计(K,V)对中每个K的个数
from pyspark import SparkContext
if __name__ == '__main__':
sc = SparkContext(appName="rddAction", master="local[*]")
data = [('a', 1), ('b', 2), ('a', 3)]
rdd = sc.parallelize(data)
print(sorted(rdd.countByKey().items()))
# [('a', 2), ('b', 1)]
sc.stop()
foreach:对RDD每个元素执行指定函数
from pyspark import SparkContext
def f(x):
print(x)
if __name__ == '__main__':
sc = SparkContext(appName="rddAction", master="local[*]")
data = [1, 2, 3]
rdd = sc.parallelize(data)
rdd.foreach(f)
# 1 2 3
sc.stop()
至此,所有action动作学习完毕
Spark学习目录:
- Spark学习实例1(Python):单词统计 Word Count
- Spark学习实例2(Python):加载数据源Load Data Source
- Spark学习实例3(Python):保存数据Save Data
- Spark学习实例4(Python):RDD转换 Transformations
- Spark学习实例5(Python):RDD执行 Actions
- Spark学习实例6(Python):共享变量Shared Variables
- Spark学习实例7(Python):RDD、DataFrame、DataSet相互转换
- Spark学习实例8(Python):输入源实时处理 Input Sources Streaming
- Spark学习实例9(Python):窗口操作 Window Operations