Last updated on 16th July 2024
Data about the current state of the simulation can be retrieved as a JSON via the REST API. The Simudyne SDK can also export all simulation data to Parquet files for further analysis. For example, these Parquet files can be used as an input into a Jupyter Notebook to allow the user to perform further exploratory analysis of their simulation data.
Once you have downloaded the hadoop-winutils
, run the Winutils_setup.bat
batch file to set your environment variable accordingly.
If you already have an installed version of Hadoop and just lack the winutils.exe
, you can add it to your C:\hadoop-x.x.x\bin
directory manually.
When using parquet on Windows, the system will try to access ...\hadoop-winutils\bin
(or ...\hadoop-x.x.x\bin
if you already had hadoop installed) to find the file winutils.exe
.
If you are getting error messages like Shell Failed to locate the winutils binary in the hadoop binary path
,
check that your HADOOP_HOME
environment variable is set and that your winutils.exe
is located in the bin
directory inside the directory of the HADOOP_HOME
destination.
For instance, if the location of your hadoop-winutils
directory is C:\hadoop-winutils
, then HADOOP_HOME
must be C:\hadoop-winutils
.
The Simudyne SDK will not export files to Parquet by default. To enable, set the value of the config field core.parquet-export.enabled
in the SimudyneSDK.properties
file to true. (More about Model Config.)
The path to create the Parquet files in should be provided in the config field core.parquet-export-path
. This can be an HDFS path, or a local file system path. If no value is specified for core.parquet-export-path
, the Parquet files will be dumped to a tmp directory, or the HDFS home if running with spark.
Furthermore, there are two additional details both for local Parquet and Hive output that a user may wish to change. These are core.data-export.generic-flush
and core.data-export.values-flush
. These typically would be the same values (the option to change is left to the user for altering default export or custom channels) and refers to how many records will be outputting to a single file, or in the case of Hive how many entries are sent in a single query.
By default, when running a batch run, Agent and Link data is not serialised, and so not output to parquet. This is to reduce the amount of data being held in memory when sending the batch results to the console. If the data is being output to parquet and does not need to be viewed on the console, the in memory data storage can be turned off allowing the Simudyne SDK to export Agent and Link data to parquet as well as the general Model data. This is done by setting the config field core.return-data
to false
.
For large model runs that produce a lot of data, setting this config field to false will also reduce the amount of memory being held by the simulation, which can help avoid potential OutOfMemory exceptions and improve the efficiency of the model.
If the data does not need to be displayed on the console, but Agent and Link data is not needed, the config fields core-abm.serialize.agents
and core-abm.serialize.links
should be set to false, to avoid generating uncessary data.
Scenario runs do not hold the data in memory because they are not managed by the console, and the data cannot be viewed on the console. This means that Agent and Link data is serialised by default, and so should be explicitly turned off if not needed. (Use the config fields core-abm.serialize.agents
and core-abm.serialize.links
to control this.)
Data export format for scenario runs is controlled via the POST request sent to start the scenario run. (See the scenario REST specification for more details on the POST request here.)
By default the scenario will output data as JSON files. To specify the output format as parquet, set the 'format' field in the 'output' section of the POST request.
{
//Other scenario json fields
"output": {"uri": "/path/to/export/to" , "format": "parquet"}
}
The model sampler will always output data to parquet. As with scenarios, the data is not held in memory, so Agent and Link data is serialised by default and should be explicity turned off if not needed using the config fields core-abm.serialize.agents
and core-abm.serialize.links
.
In most cases, it will be unnecessary to output parquet data when running interactive runs. Therefore, by default parquet data will not be exported when running interactive runs, even if the config field core.parquet-export
is true. If parquet output is required for interactive runs, the config field feature.interactive-parquet-output
should be set to true, in addition to the config fields core.parquet-export.enabled
and core.parquet-export-path
.
When running an interactive run, the parquet files will be closed (and ready for reading) when the interactive run is deleted or restarted.
Multiple Parquet files could be created for each simulation run. The root Parquet file, will contain all output fields related to the model. This includes the values of global fields and accumulators. All complex objects that are nested in the JSON output of the simulation are flattened. For example, if a model's JSON contained nested fields as follows:
Model JSON output
{
"someValue": 23,
"system":{
"aglobal" : 24,
"anAccumulator": {
"count": 25,
"value": 26
}
}
}
The Parquet root file created for this would contain the following fields:
| someValue
| system__aglobal
| system__anAccumulator__count
| system__anAccumulator__value
|
The field name is made up of the path to the JSON field where every element is seperated by a double underscore __
.
If a model's JSON output contains arrays of objects, such as Agents
or Links
, these are exported to separate Parquet files. (One file per Agent or Link type.) The name of the Parquet file will be the path to the agent.
Model JSON output with agents and links
{
"someValue": 23,
"system":{
"Agents" : {
"Cell": [
{
"alive": false,
"_id": 0
},
{
"alive": true,
"_id": 1
}
]
},
"Links": {
"Neighbour" : [
{
"_to" :123,
"_from": 256
},
{
"_to": 256,
"_from": 123
}
]
}
}
}
Parquet files:
root
| someValue
|
root__system__Agents__Cell
| alive
| _id
|
root__system__Links__Neighbour
| _to
| _from
|
Every parquet table will also include a field tick
which tells you which tick this data was produced for and a seed
field that tells you the random number generator seed being used to for this run of the simulation.
When exporting data to parquet, the folder layout can be specified in the config using the config field core.parquet-export.folder-structure
. There are two options supported for this field, group-by-type
and group-by-run
. If no value is specified, it will default to group-by-type
.
When the parquet folder structure is group by type, folders are created for each parquet table type, and a parquet file for each run is created inside these folders.
For this example, the root export directory passed through the config field core.parquet-export-path
is /exportFolder.
Group by type batch output folders
/exportFolder/
{simulation_id}/
runs/
root/
run000.parquet
run001.parquet
run002.parquet
root__system__Agents__Cell
run000.parquet
run001.parquet
run002.parquet
metadata.json
finished.json
Group by type scenario output folders
/exportFolder/
{simulation_id}/
runs/
root/
scenario0run0001.parquet
scenario0run0002.parquet
scenario0run0003.parquet
root__system__Agents__Cell
scenario0run0001.parquet
scenario0run0002.parquet
scenario0run0003.parquet
metadata.json
finished.json
The model sampler output folders will match the scenario output folders.
When the parquet folder structure is group by runs, folders are created for each simulation run, and a parquet file for each table type is created inside these folders.
For this example, the root export directory passed through the config field core.parquet-export-path
is /exportFolder.
Group by run batch output folders
/exportFolder/
{simulation_id}/
runs/
run000/
root001.parquet
root__system__Agents__Cell001.parquet
run001/
root001.parquet
root__system__Agents__Cell001.parquet
run002/
root001.parquet
root__system__Agents__Cell001.parquet
metadata.json
finished.json
Group by run scenario output folders
/exportFolder/
{simulation_id}/
runs/
scenario0.run0/
root001.parquet
root__system__Agents__Cell001.parquet
metadata.json
finished.json
The model sampler output folders will match the scenario output folders.
A metadata file is added to the data export giving details about the data. The metadata contains
This is an empty file created at the end of a run to let you know that no new parquet files will be created in this directory.