Hive via Parquet

Last updated on 22nd April 2024

Apache Hive provides a SQL-like interface to query data built on top of Apache Hadoop, and allows Hadoop users to extract this data for further analysis with ease and at scale. The Simudyne SDK (as of version 2.4) allows user to specify the output of their data to be stored in Hive tables in a Parquet format.

Hive Parquet Export

The Simudyne SDK will not export to Hive by default. To enable, set the value of the config field core.hive-export.enabled in the simudyneSDK.properties file to true.

You must also set core.hive-export-path=hive2://localhost:10000/default changing the localhost and port as required. The default refers to the table the data will be populated too.

Also required are the fields core.export.username and core.export.password in order for authentication to the Hive server to be completed.

Furthermore, there are two additional details both for local Parquet and Hive output that a user may wish to change. These are core.data-export.generic-flush and core.data-export.values-flush. These typically would be the same values (the option to change is left to the user for altering default export or custom channels) and refers to how many records will be outputting to a single file, or in the case of Hive how many entries are sent in a single query.

(More about Model Config.)

Required Config

As provided with any tutorials or default `simudyneSDK.properties` files the parameters for Hive must be set to default, and cannot function without as well having the export path, username, and password. If you are using 2.4 please make sure these parameters exist and if not in use are set to false, but they are required lookups.
As well because of the usage of the username/password you will not be able to output both to SQL and Hive by default.

Hive Output Structure

As the underlying format of the Hive output is the same as a local parquet output, please refer to (Parquet Data Export.) for more information on the difference between Batch and Scenario runs, and how to group by different structures. Note this will create parquet output to your Hive table in the same manner.

By default Agent and Link data is not serialised, and so not output to parquet. This is to reduce the amount of data being held in memory when sending the batch results to the console. If the data is being output to parquet and does not need to be viewed on the console, the in memory data storage can be turned off allowing the Simudyne SDK to export Agent and Link data to parquet as well as the general Model data. This is done by setting the config field core.return-data to false.

For large model runs that produce a lot of data, setting this config field to false will also reduce the amount of memory being held by the simulation, which can help avoid potential OutOfMemory exceptions and improve the efficiency of the model.

If the data does not need to be displayed on the console, but Agent and Link data is not needed, the config fields core-abm.serialize.agents and core-abm.serialize.links should be set to false, to avoid generating uncessary data.

Controlling Flush Size

While a defualt values exists for core.data-export.generic-flush and core.data-export.values-flush, control of these values is handed to users as differences in model output and machine performance can differ vastly. Effectively, this flush is what will exist in memory before being written to a file or sent to a Hive Query.

While increasing this value will result in less files (subsequent files are created with the same name/run structure but with an _n value appended allowing further commands/scripts to either parse or combine these files) it will directly affect memory.

As memory tends to be a bottleneck for larger scale simulations, you should adjust this value if you are having either failed batch runs, or are encountering issues where GC overhead limits are encountered.