Parquet

Last updated on 26th March 2024

The Simudyne SDK can export all simulation data to Parquet files for further analysis. For example, these Parquet files can be used as an input into a Jupyter Notebook to allow the user to perform further exploratory analysis of their simulation data.

Working with Parquet on Windows

You will need a file named winutils.exe to be able to use Parquet on Windows.
You can find it in the hadoop-winutils directory here, or you can copy-paste the following URL into your browser : http://content.simudyne.com/$web/hadoop-winutils-master.zip.

Once you have downloaded the hadoop-winutils , run the Winutils_setup.bat batch file to set your environment variable accordingly.

If you already have an installed version of Hadoop and just lack the winutils.exe, you can add it to your C:\hadoop-x.x.x\bin directory manually.

When using parquet on Windows, the system will try to access ...\hadoop-winutils\bin (or ...\hadoop-x.x.x\bin if you already had hadoop installed) to find the file winutils.exe. If you are getting error messages like Shell Failed to locate the winutils binary in the hadoop binary path, check that your HADOOP_HOME environment variable is set and that your winutils.exe is located in the bin directory inside the directory of the HADOOP_HOME destination. For instance, if the location of your hadoop-winutils directory is C:\hadoop-winutils, then HADOOP_HOME must be C:\hadoop-winutils.

Enabling Parquet Output

The Simudyne SDK will not export files to Parquet by default. To enable, set the value of the config field core.parquet-export.enabled in the simudyneSDK.properties file to true. (More about Model Config.)

The path to create the Parquet files in should be provided in the config field core.export-path. This can be an HDFS path, or a local file system path. If no value is specified for core.export-path, the Parquet files will be dumped to a tmp directory, or the HDFS home if running with spark.

Furthermore, there are two additional details both for local Parquet and Hive output that a user may wish to change. These are core.data-export.generic-flush and core.data-export.values-flush. These typically would be the same values (the option to change is left to the user for altering default export or custom channels) and refers to how many records will be outputting to a single file, or in the case of Hive how many entries are sent in a single query.