Parquet data export

Last updated on 29th April 2024

Data about the current state of the simulation can be retrieved as a JSON via the REST API. The Simudyne SDK can also export all simulation data to Parquet files for further analysis.

Exporting the data to Parquet

The Simudyne SDK will not export files to Parquet by default. To enable, set the value of the config field core.parquet-export in the SimudyneSDK.properties file to true. (More about Model Config.)

The path to create the Parquet files in should be provided in the config field core.parquet-export-path. This can be a HDFS path, or a local file system path. If no value is specified for core.parquet-export-path, the Parquet files will be dumped to a tmp directory, or the hdfs home if running with spark.

Note for Windows users
You will need a file named `winutils.exe` to be able to use Parquet on Windows.

You can find it in the hadoop-winutils directory here, or you can copy-paste the following URL into your browser : http://content.simudyne.com/$web/hadoop-winutils-master.zip.

Once you have downloaded the hadoop-winutils , run the Winutils_setup.bat batch file to set your environment variable accordingly.

If you already have an installed version of Hadoop and just lack the winutils.exe, you can add it to your C:\hadoop-x.x.x\bin directory manually.

When using Parquet on Windows, the system will try to access ...\hadoop-winutils\bin (or ...\hadoop-x.x.x\bin if you already had hadoop installed) to find the file winutils.exe. If you are getting error messages like Shell Failed to locate the winutils binary in the hadoop binary path, check that your HADOOP_HOME environment variable is set and that your winutils.exe is located in the bin directory inside the directory of the HADOOP_HOME destination. For instance, if the location of your hadoop-winutils directory is C:\hadoop-winutils, then HADOOP_HOME must be C:\hadoop-winutils.

Files created

Multiple Parquet files could be created for each simulation run. The root Parquet file, named root.parquet will contain all output fields related to the model. This includes the values of global fields and accumulators. All complex objects that are nested in the JSON output of the simulation are flattened. For example, if a model's JSON contained nested fields as follows

Model JSON output

{
  "someValue": 23,
  "system":{
    "aglobal" : 24,
    "anAccumulator": {
      "count": 25,
      "value": 26
    }
  }
}  

The Parquet root file created for this would contain the following fields.

| someValue | system__aglobal | system__anAccumulator__count | system__anAccumulator__value |

The field name is made up of the path to the JSON field where every element is seperated by a double underscore __.

If a model's JSON output contains arrays of objects, such as Agents or Links, these are exported to seperate Parquet files. (One file per Agent or Link type.) The name of the Parquet file will be the path to the agent.

Model JSON output with agents and links

{
  "someValue": 23,
  "system":{
    "Agents" : {
      "Cell": [
        {
          "alive": false,
          "_id": 0
        },
        {
          "alive": true,
          "_id": 1
        }
      ]
    },
    "Links": {
      "Neighbour" : [
        {
          "_to" :123,
          "_from": 256
        },
        {
          "_to": 256,
          "_from": 123
        }
      ]
    }
  }
}  

Parquet files:

root

| someValue |

root__system__Agents__Cell

| alive | _id |

root__system__Links__Neighbour

| _to | _from |

Fields added

Every Parquet table will also include a field tick which tells you which tick this data was produced for and a field seed that tells you the random number generator seed being used to for this run of the simulation.

Output directory structure

Folders are created in the root export directory passed through the config field core.parquet-export-path, to seperate and identify the data for different runs.

Scenario run folders

exportFolder/
    {simulation_id}/
        runs/
            scenario0.run0/
              root.parquet
              root\_\_system.parquet
            metadata.json
  • exportFolder -> This is the root export directory
  • {simulation_id} -> This is the UUID created for every run of the simulation (This is the ID used with the REST API)
  • runs -> The root folder for all Parquet run data
  • scenario0.run0 -> The data for each scenario and run will be in its own folder
  • root.parquet, root__system.parquet -> The Parquet files created.
  • metadata.json -> A file containing some metadata about the data produced.
  • DONE -> An empty file created to signal that no new data will be added to this directory.

Batch/Interactive run folders

exportFolder/
    {simulation_id}/
        runs/
            run0/
              root.parquet
              root\_\_system.parquet
            metadata.json
            DONE
  • exportFolder -> This is the root export directory
  • {simulation_id} -> This is the UUID created for every run of the simulation (This is the ID used with the REST API)
  • runs -> The root folder for all Parquet run data
  • run0 -> The data for each run of the simulation will be in its own folder
  • root.parquet, root__system.parquet -> The Parquet files created.
  • metadata.json -> A file containing some metadata about the data produced.
  • DONE -> An empty file created to signal that no new data will be added to this directory.

Metadata.json

A metadata file is added to the data export giving details about the data. The metadata contains

  • model_name -> The name of the model that we can use to query the API
  • source -> Simudyne
  • source_version -> The version of The Simudyne SDK that produced this data
  • format -> Parquet
  • creation_date -> The date this data was produced
  • schema -> The nested schema that matches this data output
  • custom -> Custom data that can be passed through in the create simulation request