Last updated on 16th July 2024
This page explains how to setup and use the Simudyne SDK to run on top of Spark, for distributing your models.
The first requirement is to install Spark , running standalone or on top of Hadoop YARN. The required version is Spark 2.2.
We recommend using the version of Spark running on Cloudera products : https://www.cloudera.com/products/open-source/apache-hadoop/apache-spark.html
Once Spark is installed you can check it is running correctly launching the Spark-shell in a terminal :
./bin/spark-shell
You need to identify your Spark master URL which points towards the master node of your cluster. Above, the master URL indicates Spark is running locally (master = local[*]).
The master URL should generally be a spark://host:port
type of URL on a standalone cluster.
To run your model, you will need to build a fatJar file which will carry your model, the Simudyne SDK and all the necessary dependencies. You will then need to upload it to the master node of your cluster where you can submit your Spark jobs.
Some more dependencies and shading are needed for your project and some will be provided by Spark.
pom.xml
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>simudyne</groupId>
<artifactId>simudyne-maven-java</artifactId>
<version>1.0-SNAPSHOT</version>
<properties>
<maven.compiler.source>1.8</maven.compiler.source>
<maven.compiler.target>1.8</maven.compiler.target>
<simudyne.version>2.2.0</simudyne.version>
</properties>
<repositories>
<repository>
<id>simudyne.jfrog.io</id>
<name>simudyne.jfrog.io</name>
<url>https://simudyne.jfrog.io/simudyne/releases</url>
</repository>
</repositories>
<dependencies>
<dependency>
<groupId>simudyne</groupId>
<artifactId>simudyne-nexus-server_2.11</artifactId>
<version>${simudyne.version}</version>
</dependency>
<dependency>
<groupId>simudyne</groupId>
<artifactId>simudyne-core_2.11</artifactId>
<version>${simudyne.version}</version>
</dependency>
<dependency>
<groupId>simudyne</groupId>
<artifactId>simudyne-core-abm_2.11</artifactId>
<version>${simudyne.version}</version>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<groupId>org.codehaus.mojo</groupId>
<artifactId>exec-maven-plugin</artifactId>
<version>1.2.1</version>
<configuration>
<mainClass>Main</mainClass>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>3.1.0</version>
<configuration>
<artifactSet>
<excludes>
<exclude>META-INF/*</exclude>
<exclude>com.fasterxml.jackson.core:jackson-databind</exclude>
<exclude>com.fasterxml.jackson.module:jackson-module-scala_2.11</exclude>
<exclude>com.twitter:chill_2.11</exclude>
<exclude>io.netty:netty-all</exclude>
<exclude>org.apache.commons:commons-math3</exclude>
<exclude>org.apache.commons:commons-lang3</exclude>
<exclude>commons-net</exclude>
<exclude>org.slf4j:slf4j-api</exclude>
<exclude>org.xerial.snappy:snappy-java</exclude>
<exclude>org.scala-lang:scala-library</exclude>
</excludes>
</artifactSet>
<transformers>
<transformer implementation="org.apache.maven.plugins.shade.resource.AppendingTransformer">
<resource>reference.conf</resource>
</transformer>
</transformers>
<relocations>
<relocation>
<pattern>org.json4s</pattern>
<shadedPattern>shaded.json4s</shadedPattern>
</relocation>
</relocations>
</configuration>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>
Here is the command to build your fatJar file :
Maven
mvn -s settings.xml compile package
You can then upload this jar file to your master node as well as the simudyneSDK.properties file and you license file.
By default the license file will be looked for in the directory you called spark-submit
from, you can change this behaviour using the configuration core.license-file
.
The spark model runner allows you to run the same simulation multiple time on a spark cluster to create a distributed multirun.
Running a distributed multirun simulation depends on the package core-runner-spark
which needs to be imported in your project:
pom.xml
<dependency>
<groupId>simudyne</groupId>
<artifactId>simudyne-core-runner-spark_2.11</artifactId>
<version>${simudyne.version}</version>
</dependency>
To enable Simudyne SDKto use the Spark runner, you need to uncomment the following line in your properties file:
simudyneSDK.properties
### CORE-RUNNER ###
# core-runner.runner-backend = simudyne.core.exec.runner.spark.SparkRunnerBackend
Then you need to configure the properties related to core-runner-spark. You have two possibilities to configure them:
spark-submit
command (recommended)core-runner-spark
properties in the simudyneSDK.properties
fileSome properties are already listed with default values in simudyneSDK.properties
:
### CORE-RUNNER-SPARK ###
core-runner-spark.master-url = local[*]
core-runner-spark.log-level = WARN
# core-runner-spark.executor.memory = 2g
# core-runner-spark.partitions = 24
#core-runner-spark.task.cpus = 1
You must be aware that a property set in the simudyneSDK.properties
file will override the one passed to the spark-submit
, for this reason we recommend only using properties based on configuration for testing, and to use spark-submit based configuration when possible.
You can then submit your job using spark-submit
. Here is a example with some configurations options:
spark-submit --class Main --master <sparkMasterURL> --deploy-mode client --files simudyneSDK.properties,mylicense.license path-to-fat-jar.jar
This command will run the main function of the class Main and distribute it on Spark. You can then access the console through the config parameters nexus-server.hostname
and nexus-server.port
.
They default to localhost
and 8080
. You can also interact with the server through the REST API
spark-submit allows you to configure Spark. You need to choose a configuration that best suits your cluster. To learn more about Spark configuration, refer to the official documentation.
Some useful resources can be found on Cloudera's website.