Distributed Requirements

Last updated on 6th May 2024

This is an experimental feature!

Experimental features are early versions for users to test before final release. We work hard to ensure that every available Simudyne SDK feature is thoroughly tested, but these experimental features may have minor bugs we still need to work on.

If you have any comments or find any bugs, please share with support@simudyne.com.

Running on Hadoop

The Simudyne SDK only requires that you have a Hadoop based environment with HDFS, Yarn, and Spark 3.2+ installed onto your cluster. Optionally if you wish to use Parquet with Hive that will also need to be installed see here on how to configure.

Specific Versions
  • Spark 2 - use version 2.3.x of Simudyne SDK
  • Spark 3 - use version 2.5.2+ of Simudyne SDK

Please note that you should not use version 2.4.0-2.5.0 of the SDK if you wish to use Spark. This is because version 2.4 uses Scala 12 which is only supported by version 3 of Spark, and the libraries to complement this Spark 3 support are included in version 2.5.2+

Because of the usage of the driver/worker nodes Java 8+ is required on all nodes, however most recent versions of Hadoop installs should satisfy this.

Our recommended setup is to make usage of Cloudera CDP for connecting to your existing or Azure/AWS environments. A Data Engineering template with inclusion of Spark is recommended for usage with Simudyne's SDK.

However, as long as you have a valid Hadoop cluster with Spark, Spark on Yarn, and HDFS you should be able to work with the SDK. The core requirement is being able to submit a spark job that includes a singularly packaged jar file of your simulation along with any configuration or data lakes included.

Setting up Spark

The first requirement is to install Spark , running standalone or on top of Hadoop YARN. **The required version is Spark 3.2+ **

We recommend using the version of Spark running on Cloudera products : https://www.cloudera.com/products/open-source/apache-hadoop/apache-spark.html

Once Spark is installed you can check it is running correctly launching the Spark-shell in a terminal :


bash spark shell

You need to identify your Spark master URL which points towards the master node of your cluster. Above, the master URL indicates Spark is running locally (master = local[*]). The master URL should generally be a spark://host:port type of URL on a standalone cluster.