Add JSON schema descriptions for BQ and ES #18

cdzombak · 2018-04-06T16:27:50Z

This PR adds two new output formats, docs-es and docs-bq. These formats are a JSON description of the schema, with a similar format between the two schemas. See https://gist.github.com/cdzombak/1d47457fd2b82f6e7bbe95508786c4a7 for example output based on the current ZTag schema.

This includes…

The possible values for Enum leaves
The doc field for records and leaves
A new examples field, available on records and leaves
The "detail type" for leaves. This is the type in our schema, and it'll be useful in cases where eg. BigQuery represents something as a string, but we internally represent it as a type that communicates more info (like an enum).
A new cascading category property on records and leaves

The category field is intended to be a human-usable categorization of a record and its subrecords. This value cascades, so if you set it on a record it'll apply that category to all subrecords & leaves. A subrecord or leaf can also set a different category for itself and its children.

I have also removed some unused code/ivars/arguments, unimplemented methods, and soon-to-be-unused output formats (we'll use these annotated outputs to build better documentation tools).

closes #21

zakird · 2018-04-06T17:53:17Z

zschema/__main__.py

@@ -27,26 +28,15 @@ def main():
        print json.dumps(record.to_bigquery())
    elif command == "elasticsearch":
        print json.dumps(record.to_es(recname))
+    elif command == "es-annotated":
+        print json.dumps(record.to_es(recname, annotated=True))


What's es-annotated mean? Why would we output annotated data to elasticsearch? What is annotated data?

cdzombak · 2018-04-09T20:17:42Z

Test failure is fixed in #22.

justinbastress · 2018-04-10T15:16:35Z

zschema/compounds.py

@@ -10,9 +10,10 @@ def _is_valid_object(name, object_):

 class ListOf(Keyable):

-    def __init__(self, object_, max_items=10):
+    def __init__(self, object_, max_items=10, category=None):


It looks like ListOf doesn't recognize doc; should that be pushed up to Keyable?

Maybe I've been looking at this wrong; for the primitive types at least, everything works because e.g. String() returns a new instance each time -- "this field is an instance of the String type with these options".

So maybe the problem is using SubRecord(), which returns an instance of the type instead of returning a constructor that can be used like String() ; i.e. we need something like Certificate = SubRecordType({ ... definition ...}), then "IssuerCert": Certificate(doc="The server certificate's issuer"). That might fit better with the current model than trying to introduce separate docs for fields and types.

To circle back, ListOf() already works like this (unless you wanted to create an "alias" of a list type -- in which case you would need e.g. CertPool = ListOfType(Certificate(doc="A certificate pool entry."))

This actually dovetails with a broader issue I've come across, where we are attaching things like doc to the types, where they should really be attached to fields.

An easy example is for certificates; you would probably want different docs on the issuer certificate and the subject certificate (though the docs for the fields inside would obviously be the same between them).

We could hack around this by making different "types" for each field, but that doesn't seem right.

Another example -- the SSH logs include both the client EndpointId and the server EndpointId, which are the same struct at the go level, but in the schema, one contains AnalyzedString() types, while the other has String() types.

It looks like ListOf doesn't recognize doc; should that be pushed up to Keyable?

My initial thinking here was that a list just lists things of a given type, so the documentation from the type being listed ought to be sufficient. Thinking more, that is wrong: a list of strings has a distinct meaning from a string. It makes sense for a list itself to have documentation describing what the list actually does — the context for the objects of this type — and that should be easy to add (though even if some part of this moves up to Keyable, we'll still have to modify the initializer for ListOf).

So maybe the problem is using SubRecord(), which returns an instance of the type instead of returning a constructor that can be used like String() ; i.e. we need something like Certificate = SubRecordType({ ... definition ...}), then "IssuerCert": Certificate(doc="The server certificate's issuer"). That might fit better with the current model than trying to introduce separate docs for fields and types.

…

This actually dovetails with a broader issue I've come across, where we are attaching things like doc to the types, where they should really be attached to fields.

If I can rephrase, to be sure we're on the same page: a record definition that gets reused, like Certificate, currently cannot have different documentation depending on the context where it is reused.

This is a definite limitation that it's worth calling out. My thinking here is basically "hopefully it's clear from context what the (certificate, etc.) is," eg. the subject and issuer certs can be differentiated by the fact that one is named "subject" and the other is named "issuer."

Removing this limitation would be a major change that would, I think, require some nontrivial changes to zschema and major changes to our schema. I am not sure we want to undertake that right now.

Removing this limitation would be a major change that would, I think, require some nontrivial changes to zschema and major changes to our schema. I am not sure we want to undertake that right now.

I agree with that, @cdzombak; nice to have, and should happen, but out of scope of this PR.

It makes sense for a list itself to have documentation describing what the list actually does — the context for the objects of this type — and that should be easy to add (though even if some part of this moves up to Keyable, we'll still have to modify the initializer for ListOf).

@justinbastress & @cdzombak: do we want to implement that as part of this PR?

do we want to implement that as part of this PR?

I'll work on it now.

5be9e26 introduces a doc field for ListOf.

justinbastress · 2018-04-10T15:45:04Z

zschema/compounds.py

@@ -33,9 +34,22 @@ def to_bigquery(self, name):
        retv["mode"] = "REPEATED"
        return retv

+    def docs_bq(self, parent_category=None):
+        retv = self.object_.docs_bq()
+        category = self.category if self.category else parent_category


Should this be pulled out into a method (e.g. get_category())?

Should this be pulled out into a method (e.g. get_category())?

Good note, @justinbastress. While it might be an intuitive restructuring, it wouldn't gain us much since we still have to fallback to parent_category passed into this method. Let's just leave the code as is (it could be made terser with category = self.category or parent_category , but meh).

^ what Andrew said. I do like the more concise suggestion, implemented in 5136ebf

justinbastress · 2018-04-10T16:56:28Z

zschema/compounds.py

+        retv = {
+            "category": category,
+            "doc": self.doc,
+            "type": self.__class__.__name__,


It seems it would be nice if this could be overridden...

__class__.__name__ doesn't keep package information (?)

Constrains our type names to match python class name rules (and constrains our class names to match our type names rules?)

Beyond the scope of this PR, but I just noticed it.

It seems it would be nice if this could be overridden
…
Beyond the scope of this PR, but I just noticed it.

Good point, @justinbastress. It looks like coupling to __class__.__name__ has been the idiom for a while (e.g., an initial commit for the CLI), but an area for future improvement.

justinbastress · 2018-04-10T18:02:42Z

zschema/leaves.py

+            "doc": self.doc,
+            "required": self.required,
+        }
+        if hasattr(self, "values_s") and len(self.values_s):


Why are retv["values"] and retv["examples"] mutually exclusive?

It would be nice if the values could be individually documented (something like "Algorithms": Enum(documented_values={"value1": "docs for value1"}), except more natural)

It se ms reasonable to let different types define their own type of '.doc' property. The default could be to just hand back._doc but could also do something else like compilea list of enumerated values into a doc string.

Why are retv["values"] and retv["examples"] mutually exclusive?

If there is an exhaustive list of possible values, I see no reason to support a list of example values as well.

It would be nice if the values could be individually documented (something like "Algorithms": Enum(documented_values={"value1": "docs for value1"}), except more natural)

Maybe, though I'm unconvinced this is worth the work to implement. The two examples of enums I recall offhand in our schema are elliptic curves and certificate types (leaf/intermediate/root), both of which seem plenty self-explanatory given the existence of a docstring for the field and a list of possible values.

It se ms reasonable to let different types define their own type of '.doc' property. The default could be to just hand back._doc but could also do something else like compilea list of enumerated values into a doc string.

@zakird I'm not sure I follow what you're suggesting here. As of this PR, types can have a docstring and a list of example values (or a list of possible values, for enums). What are you suggesting should change?

…ecords & leaves

@andrewsardone

h/t @andrewsardone

This is so we can assert on the format when testing our new "docs" output. This also required updating the BigQuery inline fixture. Co-authored-by: Chris Dzombak <[email protected]>

Co-authored-by: Chris Dzombak <[email protected]>

andrewsardone

I'm 👍 to merge once the tests go green. Things 🆗 on your end, @justinbastress?

I say we ❕

andrewsardone · 2018-04-10T20:19:16Z

Minor point of order

I don't think this PR closes #21, though they are related. #21 is about adding support for JSON Schema proper as an output format, whereas this PR outputs a zschema-specific media type. The issue remains an open backlog item.

justinbastress

Ran through with my schema updates -- Looks good.

cdzombak requested review from andrewsardone and zakird April 6, 2018 16:27

zakird reviewed Apr 6, 2018

View reviewed changes

cdzombak changed the title ~~Add human-annotated Elastic Search and BigQuery output formats~~ Add JSON schema descriptions for BQ and ES Apr 9, 2018

cdzombak force-pushed the cdz/add-human-accessible-outputs branch from 552ea55 to b294950 Compare April 9, 2018 20:46

justinbastress reviewed Apr 10, 2018

View reviewed changes

cdzombak added 15 commits April 10, 2018 15:37

Remove html output formats

dcb5085

Improve usage docs

90a5397

Add an annotated elasticsearch output, which includes docs for (sub)r…

b14cbd8

…ecords & leaves

Remove unimplemented ‘text’ command

9d1ec4e

Include a more detailed type for annotated ES output

dece722

Add annotated bigquery output

7d36532

Include possible enum values in annotated docs

e5d0263

Allow leaves to include a list of examples

b61f45a

Remove unused Leaf.to_autocomplete method

909ab52

Add a cascading “category” property for every field in annotated output

0ad3719

Implement new docs-es schema output

6f60bfd

Implement new docs-bq output

83809cd

[minor] more concise Python

87952a2

h/t @andrewsardone

Allow ListOf/NestedListOf to have docs

69be485

Indicate when an ES field is a list in doc output

1b0cfb7

cdzombak force-pushed the cdz/add-human-accessible-outputs branch from 1ffd929 to 1b0cfb7 Compare April 10, 2018 19:37

andrewsardone and others added 2 commits April 10, 2018 16:14

Add docs to tests schema

eaef702

This is so we can assert on the format when testing our new "docs" output. This also required updating the BigQuery inline fixture. Co-authored-by: Chris Dzombak <[email protected]>

Add minimal assertion tests around new “docs” format

62a58a8

Co-authored-by: Chris Dzombak <[email protected]>

andrewsardone approved these changes Apr 10, 2018

View reviewed changes

justinbastress approved these changes Apr 10, 2018

View reviewed changes

cdzombak merged commit 0e3d8fd into master Apr 10, 2018

cdzombak deleted the cdz/add-human-accessible-outputs branch April 10, 2018 20:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add JSON schema descriptions for BQ and ES #18

Add JSON schema descriptions for BQ and ES #18

cdzombak commented Apr 6, 2018 •

edited

Loading

zakird Apr 6, 2018 •

edited

Loading

cdzombak commented Apr 9, 2018

justinbastress Apr 10, 2018

justinbastress Apr 10, 2018 •

edited

Loading

justinbastress Apr 10, 2018

cdzombak Apr 10, 2018

andrewsardone Apr 10, 2018

cdzombak Apr 10, 2018

cdzombak Apr 10, 2018

justinbastress Apr 10, 2018 •

edited

Loading

andrewsardone Apr 10, 2018

cdzombak Apr 10, 2018

justinbastress Apr 10, 2018 •

edited

Loading

andrewsardone Apr 10, 2018

justinbastress Apr 10, 2018

zakird Apr 10, 2018

cdzombak Apr 10, 2018

andrewsardone left a comment

andrewsardone commented Apr 10, 2018 •

edited

Loading

justinbastress left a comment

Add JSON schema descriptions for BQ and ES #18

Add JSON schema descriptions for BQ and ES #18

Conversation

cdzombak commented Apr 6, 2018 • edited Loading

zakird Apr 6, 2018 • edited Loading

Choose a reason for hiding this comment

cdzombak commented Apr 9, 2018

Choose a reason for hiding this comment

justinbastress Apr 10, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

justinbastress Apr 10, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

justinbastress Apr 10, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andrewsardone left a comment

Choose a reason for hiding this comment

andrewsardone commented Apr 10, 2018 • edited Loading

Minor point of order

justinbastress left a comment

Choose a reason for hiding this comment

cdzombak commented Apr 6, 2018 •

edited

Loading

zakird Apr 6, 2018 •

edited

Loading

justinbastress Apr 10, 2018 •

edited

Loading

justinbastress Apr 10, 2018 •

edited

Loading

justinbastress Apr 10, 2018 •

edited

Loading

andrewsardone commented Apr 10, 2018 •

edited

Loading