Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add JSON schema descriptions for BQ and ES #18

Merged
merged 17 commits into from
Apr 10, 2018

Conversation

cdzombak
Copy link
Contributor

@cdzombak cdzombak commented Apr 6, 2018

This PR adds two new output formats, docs-es and docs-bq. These formats are a JSON description of the schema, with a similar format between the two schemas. See https://gist.github.com/cdzombak/1d47457fd2b82f6e7bbe95508786c4a7 for example output based on the current ZTag schema.

This includes…

  • The possible values for Enum leaves
  • The doc field for records and leaves
  • A new examples field, available on records and leaves
  • The "detail type" for leaves. This is the type in our schema, and it'll be useful in cases where eg. BigQuery represents something as a string, but we internally represent it as a type that communicates more info (like an enum).
  • A new cascading category property on records and leaves

The category field is intended to be a human-usable categorization of a record and its subrecords. This value cascades, so if you set it on a record it'll apply that category to all subrecords & leaves. A subrecord or leaf can also set a different category for itself and its children.

I have also removed some unused code/ivars/arguments, unimplemented methods, and soon-to-be-unused output formats (we'll use these annotated outputs to build better documentation tools).

closes #21

@cdzombak cdzombak requested review from andrewsardone and zakird April 6, 2018 16:27
@@ -27,26 +28,15 @@ def main():
print json.dumps(record.to_bigquery())
elif command == "elasticsearch":
print json.dumps(record.to_es(recname))
elif command == "es-annotated":
print json.dumps(record.to_es(recname, annotated=True))
Copy link
Member

@zakird zakird Apr 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's es-annotated mean? Why would we output annotated data to elasticsearch? What is annotated data?

@cdzombak cdzombak changed the title Add human-annotated Elastic Search and BigQuery output formats Add JSON schema descriptions for BQ and ES Apr 9, 2018
@cdzombak
Copy link
Contributor Author

cdzombak commented Apr 9, 2018

Test failure is fixed in #22.

@cdzombak cdzombak force-pushed the cdz/add-human-accessible-outputs branch from 552ea55 to b294950 Compare April 9, 2018 20:46
@@ -10,9 +10,10 @@ def _is_valid_object(name, object_):

class ListOf(Keyable):

def __init__(self, object_, max_items=10):
def __init__(self, object_, max_items=10, category=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like ListOf doesn't recognize doc; should that be pushed up to Keyable?

Copy link
Contributor

@justinbastress justinbastress Apr 10, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I've been looking at this wrong; for the primitive types at least, everything works because e.g. String() returns a new instance each time -- "this field is an instance of the String type with these options".

So maybe the problem is using SubRecord(), which returns an instance of the type instead of returning a constructor that can be used like String() ; i.e. we need something like Certificate = SubRecordType({ ... definition ...}), then "IssuerCert": Certificate(doc="The server certificate's issuer"). That might fit better with the current model than trying to introduce separate docs for fields and types.

To circle back, ListOf() already works like this (unless you wanted to create an "alias" of a list type -- in which case you would need e.g. CertPool = ListOfType(Certificate(doc="A certificate pool entry."))

This actually dovetails with a broader issue I've come across, where we are attaching things like doc to the types, where they should really be attached to fields.

An easy example is for certificates; you would probably want different docs on the issuer certificate and the subject certificate (though the docs for the fields inside would obviously be the same between them).

We could hack around this by making different "types" for each field, but that doesn't seem right.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another example -- the SSH logs include both the client EndpointId and the server EndpointId, which are the same struct at the go level, but in the schema, one contains AnalyzedString() types, while the other has String() types.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like ListOf doesn't recognize doc; should that be pushed up to Keyable?

My initial thinking here was that a list just lists things of a given type, so the documentation from the type being listed ought to be sufficient. Thinking more, that is wrong: a list of strings has a distinct meaning from a string. It makes sense for a list itself to have documentation describing what the list actually does — the context for the objects of this type — and that should be easy to add (though even if some part of this moves up to Keyable, we'll still have to modify the initializer for ListOf).

So maybe the problem is using SubRecord(), which returns an instance of the type instead of returning a constructor that can be used like String() ; i.e. we need something like Certificate = SubRecordType({ ... definition ...}), then "IssuerCert": Certificate(doc="The server certificate's issuer"). That might fit better with the current model than trying to introduce separate docs for fields and types.

This actually dovetails with a broader issue I've come across, where we are attaching things like doc to the types, where they should really be attached to fields.

If I can rephrase, to be sure we're on the same page: a record definition that gets reused, like Certificate, currently cannot have different documentation depending on the context where it is reused.

This is a definite limitation that it's worth calling out. My thinking here is basically "hopefully it's clear from context what the (certificate, etc.) is," eg. the subject and issuer certs can be differentiated by the fact that one is named "subject" and the other is named "issuer."

Removing this limitation would be a major change that would, I think, require some nontrivial changes to zschema and major changes to our schema. I am not sure we want to undertake that right now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing this limitation would be a major change that would, I think, require some nontrivial changes to zschema and major changes to our schema. I am not sure we want to undertake that right now.

I agree with that, @cdzombak; nice to have, and should happen, but out of scope of this PR.

It makes sense for a list itself to have documentation describing what the list actually does — the context for the objects of this type — and that should be easy to add (though even if some part of this moves up to Keyable, we'll still have to modify the initializer for ListOf).

@justinbastress & @cdzombak: do we want to implement that as part of this PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want to implement that as part of this PR?

I'll work on it now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5be9e26 introduces a doc field for ListOf.

@@ -33,9 +34,22 @@ def to_bigquery(self, name):
retv["mode"] = "REPEATED"
return retv

def docs_bq(self, parent_category=None):
retv = self.object_.docs_bq()
category = self.category if self.category else parent_category
Copy link
Contributor

@justinbastress justinbastress Apr 10, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be pulled out into a method (e.g. get_category())?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be pulled out into a method (e.g. get_category())?

Good note, @justinbastress. While it might be an intuitive restructuring, it wouldn't gain us much since we still have to fallback to parent_category passed into this method. Let's just leave the code as is (it could be made terser with category = self.category or parent_category , but meh).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^ what Andrew said. I do like the more concise suggestion, implemented in 5136ebf

retv = {
"category": category,
"doc": self.doc,
"type": self.__class__.__name__,
Copy link
Contributor

@justinbastress justinbastress Apr 10, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems it would be nice if this could be overridden...

  • __class__.__name__ doesn't keep package information (?)
  • Constrains our type names to match python class name rules (and constrains our class names to match our type names rules?)

Beyond the scope of this PR, but I just noticed it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems it would be nice if this could be overridden

Beyond the scope of this PR, but I just noticed it.

Good point, @justinbastress. It looks like coupling to __class__.__name__ has been the idiom for a while (e.g., an initial commit for the CLI), but an area for future improvement.

"doc": self.doc,
"required": self.required,
}
if hasattr(self, "values_s") and len(self.values_s):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are retv["values"] and retv["examples"] mutually exclusive?

It would be nice if the values could be individually documented (something like "Algorithms": Enum(documented_values={"value1": "docs for value1"}), except more natural)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It se ms reasonable to let different types define their own type of '.doc' property. The default could be to just hand back._doc but could also do something else like compilea list of enumerated values into a doc string.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are retv["values"] and retv["examples"] mutually exclusive?

If there is an exhaustive list of possible values, I see no reason to support a list of example values as well.

It would be nice if the values could be individually documented (something like "Algorithms": Enum(documented_values={"value1": "docs for value1"}), except more natural)

Maybe, though I'm unconvinced this is worth the work to implement. The two examples of enums I recall offhand in our schema are elliptic curves and certificate types (leaf/intermediate/root), both of which seem plenty self-explanatory given the existence of a docstring for the field and a list of possible values.

It se ms reasonable to let different types define their own type of '.doc' property. The default could be to just hand back._doc but could also do something else like compilea list of enumerated values into a doc string.

@zakird I'm not sure I follow what you're suggesting here. As of this PR, types can have a docstring and a list of example values (or a list of possible values, for enums). What are you suggesting should change?

@cdzombak cdzombak force-pushed the cdz/add-human-accessible-outputs branch from 1ffd929 to 1b0cfb7 Compare April 10, 2018 19:37
andrewsardone and others added 2 commits April 10, 2018 16:14
This is so we can assert on the format when testing our new "docs"
output. This also required updating the BigQuery inline fixture.

Co-authored-by: Chris Dzombak <[email protected]>
Copy link
Contributor

@andrewsardone andrewsardone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm 👍 to merge once the tests go green. Things 🆗 on your end, @justinbastress?

I say we :shipit:

@andrewsardone
Copy link
Contributor

andrewsardone commented Apr 10, 2018

Minor point of order

I don't think this PR closes #21, though they are related. #21 is about adding support for JSON Schema proper as an output format, whereas this PR outputs a zschema-specific media type. The issue remains an open backlog item.

Copy link
Contributor

@justinbastress justinbastress left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ran through with my schema updates -- Looks good.

@cdzombak cdzombak merged commit 0e3d8fd into master Apr 10, 2018
@cdzombak cdzombak deleted the cdz/add-human-accessible-outputs branch April 10, 2018 20:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement compilation to json-schema
4 participants