Last updated on 16th July 2024
The Simudyne SDK can export all simulation data to Parquet files for further analysis. For example, these Parquet files can be used as an input into a Jupyter Notebook to allow the user to perform further exploratory analysis of their simulation data.
You will need a file named winutils.exe
to be able to use Parquet on Windows.
You can find it in the hadoop-winutils
directory here,
or you can copy-paste the following URL into your browser : http://content.simudyne.com/$web/hadoop-winutils-master.zip
.
Once you have downloaded the hadoop-winutils
, run the Winutils_setup.bat
batch file to set your environment variable accordingly.
If you already have an installed version of Hadoop and just lack the winutils.exe
, you can add it to your C:\hadoop-x.x.x\bin
directory manually.
When using parquet on Windows, the system will try to access ...\hadoop-winutils\bin
(or ...\hadoop-x.x.x\bin
if you already had hadoop installed) to find the file winutils.exe
.
If you are getting error messages like Shell Failed to locate the winutils binary in the hadoop binary path
,
check that your HADOOP_HOME
environment variable is set and that your winutils.exe
is located in the bin
directory inside the directory of the HADOOP_HOME
destination.
For instance, if the location of your hadoop-winutils
directory is C:\hadoop-winutils
, then HADOOP_HOME
must be C:\hadoop-winutils
.
The Simudyne SDK will not export files to Parquet by default. To enable, set the value of the config field core.parquet-export.enabled
in the simudyneSDK.properties
file to true. (More about Model Config.)
The path to create the Parquet files in should be provided in the config field core.export-path
. This can be an HDFS path, or a local file system path. If no value is specified for core.export-path
, the Parquet files will be dumped to a tmp directory, or the HDFS home if running with spark.
Furthermore, there are two additional details both for local Parquet and Hive output that a user may wish to change. These are core.data-export.generic-flush
and core.data-export.values-flush
. These typically would be the same values (the option to change is left to the user for altering default export or custom channels) and refers to how many records will be outputting to a single file, or in the case of Hive how many entries are sent in a single query.