Spark Operator Runtime¶
Using spark operator for running spark job over k8s.
spark-on-k8s-operator allows Spark applications to be defined in a declarative manner and supports one-time Spark applications with
SparkApplication and cron-scheduled applications with
When sending a request with MLRun to Spark operator the request contains your full application configuration including the code and dependencies to run (packaged as a docker image or specified via URIs), the infrastructure parameters, (e.g. the memory, CPU, and storage volume specs to allocate to each Spark executor), and the Spark configuration.
Kubernetes takes this request and starts the Spark driver in a Kubernetes pod (a k8s abstraction, just a docker container in this case). The Spark driver can then directly talk back to the Kubernetes master to request executor pods, scaling them up and down at runtime according to the load if dynamic allocation is enabled. Kubernetes takes care of the bin-packing of the pods onto Kubernetes nodes (the physical VMs), and will dynamically scale the various node pools to meet the requirements.
When using Spark operator the resources will be allocated per task, means scale down to zero when the task is done.
import mlrun import os # set up new spark function with spark operator # command will use our spark code which needs to be located on our file system # the name param can have only non capital letters (k8s convention) read_csv_filepath = os.path.join(os.path.abspath('.'), 'spark_read_csv.py') sj = mlrun.new_function(kind='spark', command=read_csv_filepath, name='sparkreadcsv') # set spark driver config (gpu_type & gpus=<number_of_gpus> supported too) sj.with_driver_limits(cpu="1300m") sj.with_driver_requests(cpu=1, mem="512m") # set spark executor config (gpu_type & gpus=<number_of_gpus> are supported too) sj.with_executor_limits(cpu="1400m") sj.with_executor_requests(cpu=1, mem="512m") # adds fuse, daemon & iguazio's jars support sj.with_igz_spark() # args are also supported sj.spec.args = ['-spark.eventLog.enabled','true'] # add python module sj.spec.build.commands = ['pip install matplotlib'] # Number of executors sj.spec.replicas = 2
# Rebuilds the image with MLRun - needed in order to support artifactlogging etc sj.deploy()
# Run task while setting the artifact path on which our run artifact (in any) will be saved sj.run(artifact_path='/User')
Spark Code (spark_read_csv.py)¶
from pyspark.sql import SparkSession from mlrun import get_or_create_ctx context = get_or_create_ctx("spark-function") # build spark session spark = SparkSession.builder.appName("Spark job").getOrCreate() # read csv df = spark.read.load('iris.csv', format="csv", sep=",", header="true") # sample for logging df_to_log = df.describe().toPandas() # log final report context.log_dataset("df_sample", df=df_to_log, format="csv") spark.stop()