Spark Setup

Last updated on 16th July 2024

Deploying from Development to Spark

This page explains how to setup and use the Simudyne SDK to run on top of Spark, for distributing your models.

Specific Versions

Spark 2 - use version 2.3.x of Simudyne SDK
Spark 3 - use version 2.5.2+ of Simudyne SDK

Please note that you should not use version 2.4.0-2.5.0 of the SDK if you wish to use Spark. This is because version 2.4 uses Scala 12 which is only supported by version 3 of Spark, and the libraries to complement this Spark 3 support are included in version 2.5.2+

To run your model, you will need to build a fatJar file which will carry your model, the Simudyne SDK and all the necessary dependencies. You will then need to upload it to the master node of your cluster where you can submit your Spark jobs.

Some more dependencies and shading are needed for your project and some will be provided by Spark.

pom.xml

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>simudyne</groupId>
    <artifactId>simudyne-maven-java</artifactId>
    <version>1.0-SNAPSHOT</version>

    <properties>
        <maven.compiler.source>1.8</maven.compiler.source>
        <maven.compiler.target>1.8</maven.compiler.target>
        <simudyne.version>2.5.2</simudyne.version>
    </properties>

    <repositories>
        <repository>
            <id>simudyne.jfrog.io</id>
            <name>simudyne.jfrog.io</name>
            <url>https://simudyne.jfrog.io/simudyne/releases</url>
        </repository>
    </repositories>

    <dependencies>
        <dependency>
            <groupId>simudyne</groupId>
            <artifactId>simudyne-nexus-server_2.12</artifactId>
            <version>${simudyne.version}</version>
        </dependency>
        <dependency>
            <groupId>simudyne</groupId>
            <artifactId>simudyne-core_2.12</artifactId>
            <version>${simudyne.version}</version>
        </dependency>
        <dependency>
            <groupId>simudyne</groupId>
            <artifactId>simudyne-core-abm_2.12</artifactId>
            <version>${simudyne.version}</version>
        </dependency>
		<dependency>
            <groupId>simudyne</groupId>
            <artifactId>simudyne-core-graph_2.12</artifactId>
            <version>${simudyne.version}</version>
        </dependency>
        <dependency>
            <groupId>simudyne</groupId>
            <artifactId>simudyne-core-graph-spark_2.12</artifactId>
            <version>${simudyne.version}</version>
        </dependency>
        <dependency>
            <groupId>simudyne</groupId>
            <artifactId>simudyne-core-runner-spark_2.12</artifactId>
            <version>${simudyne.version}</version>
        </dependency>
        <dependency>
            <groupId>simudyne</groupId>
            <artifactId>simudyne-core-abm-testkit_2.12</artifactId>
            <version>${simudyne.version}</version>
        </dependency>
        <dependency>
            <groupId>org.slf4j</groupId>
            <artifactId>slf4j-log4j12</artifactId>
            <version>1.7.25</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-common</artifactId>
            <version>2.7.5</version>
        </dependency>
        <dependency>
            <groupId>org.apache.avro</groupId>
            <artifactId>avro</artifactId>
            <version>1.8.2</version>
        </dependency>
    </dependencies>

    <build>
        <plugins>
            <plugin>
                <groupId>org.codehaus.mojo</groupId>
                <artifactId>exec-maven-plugin</artifactId>
                <version>1.2.1</version>
                <configuration>
                    <mainClass>Main</mainClass>
                </configuration>
            </plugin>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-shade-plugin</artifactId>
                <version>3.2.1</version>
                <executions>
                    <execution>
                        <phase>package</phase>
                        <goals>
                            <goal>shade</goal>
                        </goals>
                        <configuration>
                            <outputDirectory>target</outputDirectory>
                            <shadedArtifactAttached>true</shadedArtifactAttached>
                            <shadedClassifierName>allinone</shadedClassifierName>
                            <filters>
                                <filter>
                                    <artifact>*:*</artifact>
                                    <excludes>
                                        <exclude>META-INF/*.SF</exclude>
                                        <exclude>META-INF/*.DSA</exclude>
                                        <exclude>META-INF/*.RSA</exclude>
                                    </excludes>
                                </filter>
                            </filters>
                            <transformers>
                                <transformer implementation="org.apache.maven.plugins.shade.resource.AppendingTransformer">
                                    <resource>reference.conf</resource>
                                </transformer>
                                <transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
                                    <main-Class>Main</main-Class>
                                </transformer>
                            </transformers>
                        </configuration>
                    </execution>
                </executions>
            </plugin>
        </plugins>
    </build>
</project>

Specific dependencies will be needed depending on your usage of Spark. Please read the related sections:

Here is the command to build your fatJar file :

Maven

mvn -s settings.xml compile package

You can then upload this jar file to your master node as well as the simudyneSDK.properties file and you license file. By default the license file will be looked for in the directory you called spark3-submit from, you can change this behaviour using the configuration core.license-file.

Spark Model Runner

The spark model runner allows you to run the same simulation multiple time on a spark cluster to create a distributed multirun. Running a distributed multirun simulation depends on the package core-runner-spark which needs to be imported in your project:

pom.xml

<dependency>
    <groupId>simudyne</groupId>
    <artifactId>simudyne-core-runner-spark_2.12</artifactId>
    <version>${simudyne.version}</version>
</dependency>

To enable Simudyne SDK to use the Spark runner, you need to uncomment the following line in your properties file:

simudyneSDK.properties

### CORE-RUNNER ###
# core-runner.runner-backend = simudyne.core.exec.runner.spark.SparkRunnerBackend

Then you need to configure the properties related to core-runner-spark. You have two possibilities to configure them:

Set configuration parameters as command parameters when using spark3-submit command (recommended)
Modify core-runner-spark properties in the simudyneSDK.properties file

Some properties are already listed with default values in simudyneSDK.properties:

### CORE-RUNNER-SPARK ###
core-runner-spark.master-url = local[*]
core-runner-spark.log-level = WARN
# core-runner-spark.executor.memory = 2g
# core-runner-spark.partitions = 24
#core-runner-spark.task.cpus = 1

You must be aware that a property set in the simudyneSDK.properties file will override the one passed to the spark3-submit, for this reason we recommend only using properties based on configuration for testing, and to use spark-submit based configuration when possible.

You can then submit your job using spark3-submit. Here is a example with some configurations options:

spark3-submit --class Main --master <sparkMasterURL>  --deploy-mode client --files simudyneSDK.properties,mylicense.license path-to-fat-jar.jar

Including the Simudyne SDK properties and license

The Simudyne SDK cannot run without the config field and license on every node, in order to do this, we pass the files to every node using the parameter `--files` (as in the command above). Edit the command to pass the full path to the actual properties files and license file so these can be found and copied onto the clusters nodes.

They default to localhost and 8080. You can also interact with the server through the REST API

spark3-submit allows you to configure Spark. You need to choose a configuration that best suits your cluster. To learn more about Spark configuration, refer to the official documentation.

Some useful resources can be found on Cloudera's website.

If you wish to use the runner directly, and not use the console nor the REST API, you can use the SparkModelRunner

Spark Distributed Graph

The distributed graph backend allows you to run large graphs on a cluster of machines. Running a distributed graph simulation depends on the core-graph-distributed package which needs to be imported in your project:

pom.xml

<dependency>
    <groupId>simudyne</groupId>
    <artifactId>simudyne-core-graph-distributed_2.12</artifactId>
    <version>${simudyne.version}</version>
</dependency>

To let Spark manage nodes of the Distributed Graph implementation, add the following:

simudyneSDK.properties

core-abm.backend-implementation=simudyne.core.graph.spark.SparkGraphBackend
feature.immutable-schema=false

Please be aware that properties set in simudyneSDK.properties file takes precedence over options passed to spark3-submit.

You can then submit your job using spark3-submit. Here is an example:

spark3-submit --class Main --master yarn  --deploy-mode client --files simudyneSDK.properties,licenseKey hdfs://path/name-of-the-fat-jar.jar

This command will run the main function of the class Main and distribute it on Spark. You can then access the console through the config parameters nexus-server.hostname and nexus-server.port. They default to localhost and 8080. You can also interact with the server through the REST API

spark3-submit allows you to configure Spark. You need to choose a configuration that best suits your cluster. To learn more about Spark configuration, refer to the official documentation.

Some useful resources can be found on Cloudera's website.

Limitations

The Distributed graph comes with limitations, most notable being:

no support for Immutable Schema, please set feature.immutable-schema=false

no support for dynamic agent creation/stopping

no support for SystemMessages

You also might want to disable the health check for models by setting nexus-server.health-check to false in order to avoid their distribution.

For more see tuning and debug

Distributed Requirements Spark Tuning & Model Debugging