Generate Elixir typedstructs and various useful helper functions from AVRO schemas at compile time.
While there exists a handful of libraries to encode and decode AVRO messages in Elixir, all of them consume schemas at runtime, which has the advantage of flexibilty e.g. this approach can be used with a schema registry, but you lose the any compile time type safety for your types.
Avrogen generates Elixir code from AVRO schemas, turning each record into module
containing a typedstruct
and a bunch of helper functions to encode and decode
the struct to and from AVRO binary format.
For example, the following schema...
{
"type": "record",
"namespace": "foo",
"name": "Bar",
"fields": [
{ "name": "baz", "type": ["null", "string"] },
{ "name": "qux", "type": "int" }
]
}
... generates a module foo/Bar.ex
which (with documentation and various bits
of implementation omitted for the sake of brevity) looks like this:
defmodule Avro.Foo.Bar do
use TypedStruct
@expected_keys MapSet.new(["baz", "qux"])
@pii_fields MapSet.new([])
typedstruct do
field :baz, nil | String.t()
field :qux, integer(), enforce: true
end
def avro_fqn(), do: "foo.Bar"
def to_avro_map(...) do ... end
def from_avro_map(...) do ... end
def pii_fields(), do: @pii_fields
def drop_pii(...) do ... end
def random_instance(rand_state) do ... end
end
The main feature here is the typedstruct
, which allows us to initialize this
module using the struct syntax:
%Avro.Foo.Bar{
baz: "quux",
qux: 12
}
The other helper functions provide extra functionality which are explained below.
If available in Hex, the package can be installed
by adding avrogen
to your list of dependencies in mix.exs
:
def deps do
[
{:avrogen, "~> 0.4.3", organization: "prima"}
]
end
The easiest way to use avrogen is to add :avro_code_generator
to your list of
compilers in your mix.exs
file, making sure to place it before the other
compilers so all Elixir code is in place before the Elixir compiler runs.
compilers: [:avro_code_generator | Mix.compilers()]
You'll also need to tell the elixir compiler to build the generated code, which
can be acheived by adding the generated
dir (the default destination
directory) to your elixirc_paths
.
elixirc_paths: ["lib", "generated"]
Now, you can create a new directory called schemas
at the root of your project
and put some .avsc
files in there. They will be built and compiled whenever
things need to get recompiled, so just run your mix commands as usual.
While the defaults might be OK for some folks, you can configure the generator
task from your mix.exs file, using the avro_code_generator_opts
key.
E.g.
avro_code_generator_opts: [
paths: ["schemas/*.avsc"],
dest: "generated",
schema_root: "schema",
module_prefix: "Avro",
scoped_embed_paths: ["priv/schema/events.*.avsc"],
schema_resolution_mode: :flat
]
The options are:
-
paths
- an array of file paths or wildcards to locate schema files. Defaults to"schemas/*.avsc"
. -
dest
- A directory of where to put the generated elixir code. Defaults to"generated"
. -
schema_root
- The root of the schema directory, this is the root dir that will be used to resolve schemas located in other files. Defaults to"schemas"
-
module_prefix
- String to place at the front of generated elixir modules. Defaults to"Avro"
. -
schema_resolution_mode
- Tells the code generator how to resolve external schemas to a filename. Defaults to:flat
. -
scope_embed_paths
- the glob patterns of the files where any embedded scopes should have the generated module path contain the encompasing types.For example, for the following schema
{ "name": "Event", "namespace": "events", "type": "record", "fields": [ { "name": "details", "type": { "name": "Subtype", "type": "record", "fields": [ ... ] } } ] }
If this file is included in the scoped_embed_paths, then the generated module for
Subtype
would be calledEvents.Event.Subtype
otherwise it would beEvents.Subtype
. This option is useful when you have naming clashes in embedded schema subtypes, or if you simply want to namespace subtypes to avoid potential future clashes
Firstly, you'll need to start the Schema Registry process by adding the following entry to your Application file:
@impl true
def start(_type, _args) do
children = [
...
# Start a schema registry
{Avrogen.Schema.SchemaRegistry, Application.get_application(__MODULE__)},
...
]
...
end
Now you can create new records in code using the full module name, which is comprised of your prefix + the namespace + name of the record.
E.g. the record:
{
"namespace": "foo",
"name": "Bar",
"fields": [
{"name": "quz", "type": ...}
]
}
... will result in a module called Avro.Foo.Bar
, which can be used like any
other normal struct:
message = %Avro.Foo.Bar{quz: ...}
Encode this module to a binary using the Avrogen.encode_schemaless/1
function:
{:ok, bytes} = Avrogen.encode_schemaless(message)
You can decode it back into a struct using the Avrogen.decode_schemaless/2
function, mind that you'll need to pass in the module name as avro binaries
don't encode their type.
{:ok, message} = Avrogen.decode_schemaless(Avro.Foo.Bar, bytes)
Each generated module comes with the @derive Jason.Encoder
attribute, which
tells JSON that the struct can be encoded by simply serializing everything other
than the __struct__
field. See https://hexdocs.pm/jason/Jason.Encoder.html for
more details.
Thus, converting a message to JSON is as simple as:
iex> message = %Avro.Test.Person{name: "Dave", age: 37}
iex> Jason.encode(message)
{:ok, "{\"age\":37,\"name\":\"Dave\"}"}
Note: This is not the same as the official AVRO JSON encoding spec, and is mainly used for debugging / making messages human readable.
Schemas commonly depend on other schemas, which can be located in a different file.
The code generator has two different modes for file resolution: tree mode and flat mode.
Both modes work from a root directory passed in via the schema_root
code
generator option (see above).
In :flat
mode, schemas are expected to be a flat list of files in the root dir
like so:
name.space.SchemaName -> root/name.space.SchemaName.avsc
This is how libraries like python's fastavro
expect schemas to be laid out.
In :tree
mode, the namespace is split into directories like so:
name.space.SchemaName -> root/name/space/SchemaName.avsc
This is how libraries like avrora
expect schemas to be laid out.
The table below lists supported primitive AVRO types, and their corresponding Elixir type:
AVRO Type | Elixir Type |
---|---|
null |
nil |
int |
integer |
double |
float |
string |
String.t() |
bytes |
binary |
array |
list |
map |
%{ String.t() => Value.t() } |
The following logical types are also supported:
AVRO Logical Type | AVRO Underlying Type | Elixir Type |
---|---|---|
uuid |
string |
String |
big_decimal |
string |
Decimal |
decimal |
string |
Decimal |
decimal |
bytes |
Decimal |
date |
int |
Date |
date |
string |
Date |
iso_date |
string |
Date |
datetime |
string |
DateTime |
iso_datetime |
string |
DateTime |
time-millis |
int |
Time |
time-micros |
long |
Time |
timestamp-millis |
long |
DateTime |
timestamp-micros |
long |
DateTime |
local-timestamp-millis |
long |
NaiveDateTime |
local-timestamp-micros |
long |
NaiveDateTime |
The following AVRO types are not supported (yet):
float
(usedouble
)long
(useint
)fixed
Avrogen introduces an unofficial extension to AVRO schema specification which
can be used to mark record's fields as PII (Personally Identifiable
Information). Each generated record module gets a drop_pii/1
function which
recursively strips away all fields marked as PII in the record, and any records
contained within.
Mark a field as PII by adding pii: true
option to the field. For example
imagine you are storing names and ages of people, and the name is PII (but the
age isn't).
{
"type": "record",
"name": "Person",
"namespace": "example",
"fields": [
{ "name": "name", "type": ["null", "string"], "pii": true },
{ "name": "age", "type": "int" }
]
}
Then you can simply call drop_pii/1
on your record to replace all the PII
fields with nil
like so:
ex> person = %Avro.Example.Person{name: "John Smith", age: 38}
%Avro.Example.Person{age: 38, name: "John Smith"}
ex> Avro.Example.Person.drop_pii(person)
%Avro.Example.Person{age: 38, name: nil}
Note: Fields marked as PII must be of a union type containing a null.
The AVRO spec specifies that any extra fields in schemas are ignored, so schemas containing this extension are backwards compatible with other AVRO parsers, as they will just ignore this field.
Each generated module contains a function to create a random instance of the record/enum. This can be useful for fuzz testing, among other things.
E.g. Using the Person
example above, the generated module will contain the
following function:
def random_instance(rand_state) do
# ...
end
The function expects to be given an erlang random state type object, which can be seeded in one of many ways depending on what you want to do with it. The simplest way to create this random state is to generate it with the default generator -
:rand.seed(:default)
, as demonstrated below.
You can use this random_instance/1
function to generate random instances of
the module's struct, for example:
iex> state = :rand.seed(:default)
iex> {state, person} = Avro.Example.Person.random_instance(state)
iex> person
%Avro.Foo.Bar{
name: <<29, 120, 54, 75, 84, 54, 70, 29, 48, 68, 87, 87>>,
age: 1812334491
}
In this example, person
is a random instance of the Avro.Example.Person
record, and state
is the mutated state which can be used again to pass to the
next call to random_instance/1
.
The various types produce random values according to the following rules:
Type | Rule |
---|---|
null |
Always produces nil . |
union |
Random instance of any of the types within the union, where each type is equally likely to be chosen. |
string |
Random utf8 binary of up to 1000 codepoints, where each codepoint lies in the range 0 <= codepoint < 10,000 . |
int , double , big_decimal (logical type) |
Random value in the range -2,147,483,648 <= value < 2,147,483,648 . |
iso_date and iso_datetime (logical types) |
Random value in the range 1970-01-01T00:00:00 <= value < 2045-01-01T00:00:00 . |
enum |
One of the symbols selected at random, with each symbol having an equal probability of showing up. |
array |
Random list of up to 10 elements, where the value of each element is a random instance of the array's element type. |
record |
Random instance of that record, where each field of the record is generated randomly according to the above rules. |
You can control how random you really want these random instances to be using some more unofficial extensions to the avro spec. For example, you can specify the max and min values of int type fields using the "range" specifier like so:
{
"name": "age",
"type": "int",
"range": {
"int": {
"max": 80,
"min": 16
}
}
}
Now when you call random_instance/1
, the age field will be limited to the
range 16 <= age < 80
.
Strings can be formatted according to semantic formatting. Currently the only supported type is "postcode", but support for more types may well be added in the future.
E.g.
{
"name": "postcode",
"type": "string",
"range": {
"string": {
"semantic_type": "postcode"
}
}
}
Now, the postcode field will be limited to random postcodes (e.g. BS23 7SX
),
rather than completely random strings.
AVRO's avsc format is not always the easiest format to maintain. Because it uses
JSON, variables and comments are not allowed. Thus, avrogen
comes with a
schema generator tool to help ease the process.
This tool can be optionally used by adding it to the list of compilers for your
project's mix.exs
file like so:
compilers: [:avro_schema_generator | Mix.compilers()]
Note: If you are also using the avro_code_generator
, then you will need to put
the schema generator before the code generator, as the code generator requires
schemas in order to do its job.
Then, also in your mix.exs
file, configure the code generator using the
following options:
avro_schema_generator_opts: [
paths: ["exs_schemas/**/*.exs"],
dest: "schemas",
schema_resolution_mode: :flat
],
Where the options are as follows:
paths
: A wildcard expression which matches the location of your schema definition files. Defaults toexs_schemas/**/*.exs
.dest
: Where to put generated schema files. Defaults toschemas
.schema_resolution_mode
: How to structure the dest dir (see the schema resolution section above). Options are:flat
or:tree
. Defaults to:flat
.
So what goes in these schema definition files? All files should contain a single
module which implements the Avrogen.Schema.SchemaModule
behaviour. For
example:
defmodule Person do
alias Avrogen.Schema.SchemaModule
@behaviour SchemaModule
@impl SchemaModule
def schema_name(), do: "application_data.v2"
@impl SchemaModule
def avsc(), do: avro_schema()
# "type": "record",
# "name": "Person",
# "namespace": "example",
# "fields": [
# {"name": "name", "type": ["null", "string"], "pii": true},
# {"name": "age", "type": "int"}
# ]
@person %{
type: :record,
name: "Person ",
namespace: "example",
doc: "Describes a person.",
fields: [
%{
name: :name,
type: [:null, :string],
doc: """
The name of the person.
"""
},
%{
name: :age,
type: :int,
doc: """
The age of the person.
"""
}
]
}
@avro_spec [
@person
]
def avro_schema() do
Jason.encode!(@avro_spec)
end
@impl SchemaModule
def avro_schema_elixir() do
@avro_spec
end
end
When you next compile the code with e.g. mix compile
, the following avsc
schema will be generated:
{
"doc": "Describes a person.",
"fields": [
{
"doc": "The name of the person.\n",
"name": "name",
"type": [
"null",
"string"
]
},
{
"doc": "The age of the person.\n",
"name": "age",
"type": "int"
}
],
"name": "Person ",
"namespace": "example",
"type": "record"
}
There's not much magic here, but it should be evident how elixir variables and
constructs can be used to reduce repetition in the definitions of the schemas.
It's worth noting that these schema definitions are used to define lists of
schemas in one go. Each individual schema is pulled out and placed into the
target destination, and the file name is structured like so:
<dest>/<namespace>.<name>.avsc
, which is the appropriate format for the avro
code generator to use later down the line.
Note that once you enable this tool, it completely takes over the dest
directory, so any other files found in here will most likely be removed.
Avrora is an Elixir library for encoding/decoding avro messages, with options to integrate with a schema registry.
Avrora can work in conjunction with avrogen quite nicely, with avrogen generating the elixir code, and avrora handling communication with the schema registry and encoding/decoding of messages.
Avrogen expects schemas to be stored in the filesystem in a "tree" style format,
so make sure to set the option schema_resolution_mode
to :tree
for both
generators. Once you have configured the avrora cache (see docs on their
README), you can then use
avrogen's typedstructs to create the messages and do some basic type/key
checking, and avrora to encode/decode the messages.
For example, to encode...
%module{} = message = %Avro.Test.Person{name: "John Smith", age: 38}
name = module.avro_fqn()
map = module.to_avro_map(message)
{:ok, bytes} = Avrora.encode(map, schema_name: name)
# Do something with the bytes
... and then to decode ...
{:ok, [decoded]} = Avrora.decode(bytes, schema_name: Avro.Test.Person.avro_fqn())
{:ok, message} = Avro.Test.Person.from_avro_map(decoded)
# Do something with the message
Note: Of course, this assumes you know the decoded message type ahead of time. You can ask Avrora to infer the message type using magic headers by calling
decode/1
rather thandecode/2
(omitting theschema_name
option), but it doesn't disclose the inferred schema name to the caller, which is not particularly useful.