Avro Schema

Last updated on 16th July 2024

The model schema is used to describe the model's data output. Before reading output from a model, it's often helpful to read the Avro model schema first to know which tables and fields to expect.

What will a schema contain

The schema object will contain a list of fields containing data about the model in its current state. Some of these fields are fields which you can use to input data into the model, some are used only to output data from the model, and some can be used to input and export data.

Avro descriptor fields

  • Name -> this tells you the name of the field in the output.
  • Type -> this tells you what kind of data you will see in this field. Type could be a primitive type (long, double, float, boolean, int, string), or a complex type (record, array, enum).
  • Fields -> if the type is a record, the data this field is describing is an object with some sub-fields, whose types we describe in a list under the title 'fields'.
  • Symbols -> Enum types will have a list of possible values for the enums - under symbols.
  • Items -> Array types will have a field 'items' which is used to describe the type of the items in the array.

More information on the general schema format can be found on the Avro website. https://avro.apache.org/docs/1.8.1/spec.html

Simudyne specific descriptor fields

Every field record may contain descriptor fields that are used to pass extra information about the fields.

exportDataset

This will be present for all primitive and enum fields, as well as arrays. This tells you the name of the dataset this field will be exported to (in the case of Parquet, the name of the parquet file).

exportField

This will be present for all primitive and enum fields. This tells you the name this field will be given in the dataset after export.

displayName

The user might specify a display name for this output field. This is a user-friendly name that might contain spaces or special characters so cannot be used as the field name.

displayGroup

Input and output fields are given a display group so they can be sorted or grouped. A lower display group should appear higher on the list of inputs.

kind

The kind of a field is the annotation specified for the field in the model.

  • Input
  • Constant
  • Output
  • Variable → This field is an output field. It is used to report some data from the model. A field that is of kind Variable can also be an input field so that the field value can be set before setup to some starting value. A variable field that can be set with some starting value will also have "initializable": true present in the record.
  • Custom → This field will be a complex object, with its own nested fields. Some of these nested fields will have their own kinds.

All top-level fields in the Simulation object will have a kind, but if some of these fields are records, the fields they contain might not have a kind. We will discuss later what sort of object might contain fields without a kind and how to read them.

initializable

This may be used where kind → Variable. It shows that the field can be given set a starting value before setup.

subtype

This might be used for a field with type record or array. The subtype lets you know what type of object this field will be, and the fields it will contain.

Possible subtypes and the fields these objects will contain

SDLongAccumulator / SDDoubleAccumulator → Accumulators are used to report data. Accumulators will always have two fields:

  • value → The value of the accumulator at this point. This will either be of type 'long', or 'double'.
  • count → The number of individual values that the accumulator has had added to it. This will always be of type 'long'.

AgentStatistics → Agent statistics are used to report data about specific fields inside agents. Agent statistics will contain an array of fields, with each of these fields being objects with their own fields as follows:

  • mean → the mean of this field.
  • max → the max value of this field.
  • min
  • stddev → the standard deviation of this field.
  • sum
  • variance
  • count

All of these fields will be of type 'double'.

SDAgent → This will be of type array and will contain a list of records, one for each agent in the system. Each record in this array will contain a field of name _id with the unique id of this agent and might contain some fields with kinds, used to report data about specific agents.

SDLink -> This will be of type array and will contain a list of records, one for each link in the system. Each record in this array will contain two fields

  • _to -> a field of type 'long' giving the id of the agent this link connects to.
  • _from -> a field of type 'long' giving the id of the agent this link is from.

SDAgentSystem → This may contain;

  • Some fields with kinds. These are fields that are unique to this agent system. In the case where there are multiple agent systems, multiple agent systems could contain fields with the same names.
  • A field with the name Agents of type record and subtype SDAgentList. Every sub record in this record will be of type SDAgent.
  • A field with the name Links of type record and subtype SDLinkList. Every sub record in this record will be of type SDLink.