Last updated on 16th July 2024
Apache Hive provides a SQL-like interface to query data built on top of Apache Hadoop, and allows Hadoop users to extract this data for further analysis with ease and at scale. The Simudyne SDK (as of version 2.4) allows user to specify the output of their data to be stored in Hive tables in a Parquet format.
The Simudyne SDK will not export to Hive by default. To enable, set the value of the config field core.hive-export.enabled
in the SimudyneSDK.properties
file to true.
You must also set core.hive-export-path=hive2://localhost:10000/default
changing the localhost and port as required. The default refers to the table the data will be populated too.
Also required are the fields core.export.username
and core.export.password
in order for authentication to the Hive server to be completed.
Furthermore, there are two additional details both for local Parquet and Hive output that a user may wish to change. These are core.data-export.generic-flush
and core.data-export.values-flush
. These typically would be the same values (the option to change is left to the user for altering default export or custom channels) and refers to how many records will be outputting to a single file, or in the case of Hive how many entries are sent in a single query.
As the underlying format of the Hive output is the same as a local parquet output, please refer to (Parquet Data Export.) for more information on the difference between Batch and Scenario runs, and how to group by different structures. Note this will create parquet output to your Hive table in the same manner.
By default Agent and Link data is not serialised, and so not output to parquet. This is to reduce the amount of data being held in memory when sending the batch results to the console. If the data is being output to parquet and does not need to be viewed on the console, the in memory data storage can be turned off allowing the Simudyne SDK to export Agent and Link data to parquet as well as the general Model data. This is done by setting the config field core.return-data
to false
.
For large model runs that produce a lot of data, setting this config field to false will also reduce the amount of memory being held by the simulation, which can help avoid potential OutOfMemory exceptions and improve the efficiency of the model.
If the data does not need to be displayed on the console, but Agent and Link data is not needed, the config fields core-abm.serialize.agents
and core-abm.serialize.links
should be set to false, to avoid generating uncessary data.
While a defualt values exists for core.data-export.generic-flush
and core.data-export.values-flush
, control of these values is handed to users as differences in model output and machine performance can differ vastly. Effectively, this flush is what will exist in memory before being written to a file or sent to a Hive Query.
While increasing this value will result in less files (subsequent files are created with the same name/run structure but with an _n
value appended allowing further commands/scripts to either parse or combine these files) it will directly affect memory.
As memory tends to be a bottleneck for larger scale simulations, you should adjust this value if you are having either failed batch runs, or are encountering issues where GC overhead limits are encountered.