Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions and remarks about supported types, missing types and named variables #213

Open
KrzysFR opened this issue Oct 16, 2024 · 36 comments
Assignees

Comments

@KrzysFR
Copy link

KrzysFR commented Oct 16, 2024

Great idea! I've used tuples for a long time, and always wished there was a standard for representing them and querying them. Hopefully this could be the one :)

I apologize for the flood, they are notes I took when implementing a parser for fql in C#/.NET

I have a few remarks, coming from a long time user of tuples, mostly in the context of writing complex layers in .NET/C#, and not 100% familiar with go conventions.

  • Numbers: Why the distinction between int and uint? Is this a "go" thing, or is there an actual reason?
    • For -123 it is clear that it is not "uint", but what about 123? From my point of view it is both "int" and "uint"
    • What about 32-bit vs 64-bit?
    • Comparatively, there is no distinction with float between 32-bit and 64-bit IEEE numbers?
    • How do you handle NaN or infinities?
  • Uuids: I find it difficult to parse a UUID that is not bounded by either {...} or any other character.
    • I frequently see uuids escaped as either {xxxx-xxx...} or "xxxx-xxxx-...". (or maybe it is a Windows thing?)
    • Using {...} for uuids would make it non-ambiguous for the parser
  • Bytes: when parsing 0x1234, you first see the 0 which could the be start of a number, a uuid, or a byte sequence.
    • I don't have a nice solution for this, but maybe use [ 1234 ] or '\x12\x34' (single quote, like it was in the python 2 tuple encoder) for bytes?
    • Are hexa digits always upper? lower? either? the syntax definition is a bit ambiguous
    • For me, 0xFF and 0xFE is ambiguous, because they are the byte prefix for the System subspace and Directory Layer, which are a single byte, where here I think they would be encoded as 01 FF 00 and 01 FE 00 instead?
  • Strings: How do you handle escaping of unicode characters? What about unicode text like "こんにちは世界" , high/low surrogates?
    • Could we define \uXXXX to encode any codepoint ?
  • Tuples: does (...) means "any tuple, empy or not? For ex, in (1, <int>, (...), <int>), does the middle part means "any tuple" ?
  • Ends With: is ("hello", 123, ..., <int>) supported? This would help with "variable sized" tuples in some layers, where you still need to parse the last one or two parts of a tuple.

There are a few types that I use frequently in tuples, and that are missing:

  • Empty: most indexes have empty values. In the old python binding, it used '' (two single quotes) to define them. What would we use here? nil seems weird because it is different (for me) than the concept of "empy". Maybe add <empty> or <none> for values?
  • Version Stamps: maybe add a new stamp or versionstamp or vs type in variables? ex: (1, <stamp>, ...)
    • There are two types of version stamps, 80-bits and 96-bits. They use prefix 0x32 and 0x33 in the tuple encoding, followed by 10 or 12 bytes.
    • I usually don't really distinguish between both sizes, they are all "stamps" for me.
  • Dates: this is unfortunately not specified, but I usually represent times as days (float) since Unix Epoch
  • Durations: same, I store them as seconds (float), but this may not be a standard representation.
  • Custom types: The original spec had a way to define "custom types" for the application, which would have a custom header byte, followed by 0 or more bytes (custom to the app).
    • In practice, I only used them to define the Directory (0xFE) and System (0xFF) prefix, which are useful in tuples that have to query the system space, or inside the Directory Layer (each nested partition adds another 0xFE to the bytes).
    • It would help resolve the ambiguity in (0xFE, "hello") which for me reads as "the key 'hello' in the top-level Directory Layer" and encoded as FE 02 h e l l o 00, where I guess here it would be encoded as 01 FE 00 02 h e l l o 00 which is not the same.
  • System subspace: how can I represent xFF/metadataVersion or other system keys? They usually don't use the tuple encoding for the keys.
  • Regex/Patterns/Constraints: Would it make sense to be able to impose constraints on types? Like a regex on a string, a range on a number, a maximum/minimum/exact size for string/bytes?
    • could be useful for counters which are usually exactly or up to 4/8 bytes.
    • The uint vs int distinction could be emulated with a "must be positive" or "must be exactly 64-bits" constraint
    • Same for 80-bit vs 96-bit version stamps

Regarding directories:

  • What if I use partitions/sub-partitions ? This is a way to "lock" an application into a specific prefix (ie: if "/foo" is a partition with prefix 15 2A (== (42, )), all keys will have this prefix, even sub-directories of this partition.
    • If the app is locked to sub-partition /foo/bar, then ALL keys would start with /foo/bar/... so in practice we represent them without the prefix, so something like .../my/dir (similar to a webapp that could be hosted under any path)
    • Since ... means "zero or more" in the tuples, maybe use ./my/dir or ~/my/dir to represent "from the root defined for this application" ?
  • Directory names are string, and I also use strings 99%+ of the time, but technically, the names can be any sequence of bytes, in order to represent numbers, uuids, etc... I'm not sure if this is frequently used in the field, but if it is, maybe we would also need variables in directory names? like /foo/<string>/bar or /foo/<uuid>/bar ?

On top of querying, I see this as very useful to encoded the "schema" of a layer somehow, so that a UI could automatically decode data into an arbitrary subspace (using the optional Layer Id in directories).

For example, I used the following format to define the schema of a custom layer, like a typical "table with index + change log" mini layer:

  • (..., <metadata_key>) = <metadata_value>
  • (..., 0, <doc_id>) = <json_bytes>
  • (..., 1, <index_id>, <value>, <doc_id>) = ''
  • (..., 2, <index_id>, <value>) = <counter>
  • (..., 3, <version_stamp>, <doc_id>) = <change_event>

Legend:

  • ... means the prefix of the Directory where this layer is stored.
  • <name> was a placeholder for a type of data, but the type was not specified

I think this could be adapted to use fql as the syntax, but this would required adding the support of named variables:

  • either <foo:int> or <int:foo> would define foo to be a variable of type int
  • would need a solution for "of any type", either <foo:any>/<any:foo> or <foo:>/<:foo>
  • for counters (atomic add/increment), we need to specify a size, usually 32 or 64 bits: <uint32>/<uint64> ? <uint:32>/<uint:64> ? <uint,32>/<uint,64> ?

The above could become:

  • ~/(<string:metadata_key>) = <any:metadata_value>
  • ~/(0, <uint:index_id>, <uint:doc_id>) = <bytes:json>
  • ~/(1, <uint:index_id>, <int|string|bytes:value>, <uint:doc_id>) = <empty>
  • ~/(2, <uint:index_id>, <int|string|bytes:value>) = <uint64>
  • ~/(3, <stamp:timestamp>, <uint:doc_id>) = <bytes:delta>
@janderland
Copy link
Owner

I'll answer all these questions in a follow up comment, but I'll start by saying that I'm almost done with a proper manual for FQL which may clarify a lot of these questions.

@janderland
Copy link
Owner

janderland commented Oct 16, 2024

Why the distinction between int and uint?

It's modeled after Go's implementation of the tuple layer. Thanks for pointing this out. I will need to think about what I want to do with this. Go's tuple layer makes all the integers look like int64 or uint64.

How do you handle NaN or infinities?

I have not added support yet. This will be supported, most likely as the tokens like nan, inf, -inf.

I find it difficult to parse a UUID that is not bounded by either {...} or any other character.

I haven't encountered any problems with parsing UUIDs myself. The context of the FQL query provided me with enough information to parse them without additional bounding characters.

Bytes: when parsing 0x1234, you first see the 0 which could the be start of a number, a uuid, or a byte sequence.

Ah, this gives me a clue as to why you had problems parsing UUIDs. I allow my parser to look ahead, which allows me to see that the 0x is followed by hexadecimal digits, and hence I know it's a hex number.

Are hexa digits always upper? lower? either? the syntax definition is a bit ambiguous.

I currently use Go's standard library to parse the hex string, which allows for either. In the syntax definition I only allow for uppercase, but I'm planning on changing this. I prefer lower case myself.

For me, 0xFF and 0xFE is ambiguous, because they are the byte prefix for the System subspace and Directory Layer, which are a single byte, where here I think they would be encoded as 01 FF 00 and 01 FE 00 instead?

Yes, you are correct. (0xff) would be packed as 0x01ff00.

Also, FQL doesn't support reading/writing key-value outside of the directory layer. Therefore, it doesn't support reading/writing to the system subspace. I may change this in the future.

How do you handle escaping of unicode characters?

Unicode is not currently supported, only ASCII. I do plan to add unicode support in the future.

does (...) means "any tuple, empy or not?

Yes.

...is ("hello", 123, ..., <int>) supported?

Not currently supported, though I may add support for this in the future.

What would we use here? nil seems weird because it is different (for me) than the concept of "empty".

nil is what I'm using for empty values. This allows the empty value to logically mirror the empty element within a tuple: (nil). I don't plan to change this part, though I am glad you told me your opinion as it helps me see how other people's intuition works.

Version Stamps: maybe add a new stamp or versionstamp or vs type in variables?

Yes, version stamps will be supported sometime in the future. They are not supported yet.

Would it make sense to be able to impose constraints on types? Like a regex on a string, a range on a number, a maximum/minimum/exact size for string/bytes?

Yes, I've considered this. I may add this in the future.

What if I use partitions/sub-partitions ? This is a way to "lock" an application into a specific prefix (ie: if "/foo" is a partition with prefix 15 2A (== (42, )), all keys will have this prefix, even sub-directories of this partition.

I have not added support for partitions yet. I still need to look into this. When you set up an FQL instance, you must provide a root directory which would contain all queries to within that directory. I expect partition to work in a similar way.

Directory names are string, and I also use strings 99%+ of the time, but technically, the names can be any sequence of bytes...

I don't plan to support reading/writing all possible key-values right now. For the near future, I'm focused on supporting the 99% of use cases which only includes key-values encoded using a directory (made of strings) and a tuple. After I have this working, documented, and well tested, then I may implement support for other cases like this one.

This is similar to the question about system keys. Most user don't need to access these, so I will focus on the most common use cases first.

On top of querying, I see this as very useful to encoded the "schema" of a layer somehow, so that a UI could automatically decode data into an arbitrary subspace (using the optional Layer Id in directories).

Yes, this is one of the goals of the project.

named variables...

Yes, this is a feature which I will explain in the manual. The manual should be available within the next week or so.

@KrzysFR
Copy link
Author

KrzysFR commented Oct 19, 2024

I've been able to write a very naive first version of a fql parser and query runner for range reads, and this is already VERY handy! In one of my tests, I used a schema for a very simple document+index layer I had, and simply by using fql queries, I was able to implement a query system on top of that, without much effort.

This could be a game changer when writing layers, where adding long running queries has always been the most difficult task! If a fql Layer could implement proper long range reads, it could be reused by a lot of other layers.

By the way, should it spelled "fql" or "FQL" for the "official" name?

It's modeled after Go's implementation of the tuple layer. Thanks for pointing this out. I will need to think about what I want to do with this. Go's tuple layer makes all the integers look like int64 or uint64.

I had the same issue when porting the original Tuple Layer from python to .NET. The layer has been mostly design in Python, so inherits its type system. I've been told that sorting python tuples or sorting the bytes from the packed tuples should yield the same order. I had to replicate this behavior in .NET, and it's probably that Go would have the same issue.

In practice, I treat all numbers are equivalent, whatever their types, and only treat signed/unsigned and 32/64 bits as a limitation in the domains. Meaning that <uint> would reject a negative number, while <int> would include ALL the integers (but not decimal numbers). And this would be independent of the size of the integer: .NET recently got 128-bit unsigned and signed integers which add another 2 combinations :)

For decimal numbers, this is a bit complicated, because for me "1.0" is the same a "1", but in Python floating point numbers where encoded with a different prefix in the tuple layer.

In my mind, the criteria for "is the same as" would be "do they produce the same bytes when encoded with the tuple layer" ?

Ah, this gives me a clue as to why you had problems parsing UUIDs. I allow my parser to look ahead, which allows me to see that the 0x is followed by hexadecimal digits, and hence I know it's a hex number.

My parser is also "look ahead", in fact it can see the whole statement. The remark was more regarding having a very noticeable visual queue when a human looks at the statement. This may vary from person to person, but for me, between these two strings, I can immediately spot the uuid in the second one:

  • ("hello", 12456789, 4567, b8d02d81-6329-ef96-8a4d-55b376d8b25a, 0xabcdef, "world")
  • ("hello", 12456789, 4567, {b8d02d81-6329-ef96-8a4d-55b376d8b25a}, 0xabcdef, "world")

Also, FQL doesn't support reading/writing key-value outside of the directory layer. Therefore, it doesn't support reading/writing to the system subspace. I may change this in the future.

I you want to add support for system keys (0xff) and the directory layer (0xfe), then you can use custom user types that exist in the current tuple layer spec. I don't know if they are implemented by the Go binding. They simply define an item with a byte header, followed by any number of bytes (so like any other pre-defined types).

In the .NET binding, I have the following "well known" custom types:

  • Byte header 0xFF at the start of a tuple: this is a system key, defined as a custom type "FF" with payload being the rest of the tuple. So \xff/metadataVersion is a TupleCustomType(255, "/metadataVersion").
  • Byte header 0xFE, here I only create custom type 254 with 0 payload bytes. The Directory Layer uses the tuple encoding, so the rest of the bytes would decode nicely anyway.

Unicode is not currently supported, only ASCII. I do plan to add unicode support in the future.

Whatever you do, I hope you will standardize on UTF-8 being the only encoding. It supports ASCII out of the box, and you could simply use the same syntax as in JSON (https://datatracker.ietf.org/doc/html/rfc8259#section-7)

@KrzysFR
Copy link
Author

KrzysFR commented Oct 19, 2024

I have not added support for partitions yet. I still need to look into this. When you set up an FQL instance, you must provide a root directory which would contain all queries to within that directory. I expect partition to work in a similar way.

So, if you use '/foo' has the root directory, does this mean that the query /bar/baz means in that the absolute path /foo/bar/baz ? For me, I use the file system path analogy for paths in the Directory layer, so seeing '/something' means "go back to the root the filesystem/database", and would have expected either bar/baz or ./bar/baz to mean "from the current working directory".

Since I had to deal with this in my tests (they are automatically running in a dedicated test partition with a deep path, including the worker host name, process id, etc..) I temporarily used "./bar/bar" to differentiate with "/bar/baz".

I also need to be able to distinguish between "the root of the application" (which is fixed) and "the current directory", since I plan to add this to my command line shell, where you can cd into directories, like you would do in a file system, and so it must be clear to the user if the query would be relative to:

  • the current path ("hello", <int>),
  • a child directory: foo/bar("hello", <int>),
  • a parent directory: ../../some/other/layer/("hello", <int>),
  • or "the root": /tenants/ACME/foo/bar/...

[named variables] Yes, this is a feature which I will explain in the manual. The manual should be available within the next week or so.

One very frequent thing is using integer constants to sub-divide a directory subspace, without involving the Directory Layer. For ex, in pseudo code:

  • (<metadata_key>) = <metadata_value>
  • (0, <rid>) = <json>
  • (1, <doc_id>) = <rid>
  • (2, <artist>, <rid>) = ''
  • (3, <genre>, <rid>) = ''
  • (4, <year>, <rid>) = ''

If would be nice to be able to define constants to give names for 0, 1, 2, and 3, such as docs, 'pk', idx_artist, idx_genre, idx_year.

These could be encoded in some standardized JSON object that would be stored with a standard key at the root of a layer. Layers are supposed to have a non-empty layer_id with the name when created with the Directory Layer (which may not be a frequently used feature?), which make it easy to detect: You you enter a directory that has a non-empty "layer id", look for a "_schema" key (or something else), decode the JSON, and use the content to obtain a set of one or more "predefined" fql queries that would include named constants and variables:

pseudo code:

{
  /* ... */
  "queries": {
    "metadata": "(<key:string>) = <value:any>",
    "doc": "(0, <rid:int>) = <json:bytes>",
    "pk": "(1, <doc_id:string>) = <rid:int>",
    "idx_artist": "(2, <artist:string>, <rid:int>) = nil",
    "idx_genre": "(3, <genre:string>, <rid:int>) = nil",
    "idx_year": "(4, <year:uint>, <rid:int>) = nil"
  }
}

In a shell, the query short cut would be converted into an actual fql command:

  • metadata to list all metadata => (<string>)
  • metadata "count" to read a single value => ("count")
  • pk "album123" to lookup a primary key into a record id => (1, "album123")
  • doc 42 to get a document from its record id
  • idx_genre "Metal" or idx_year 1999 to lookup values in an index => (3, "Metal", <int>) and (4, 1999, <int>).

Ideally, the value could also describe it's type, so that it could be decoded into a human readable format. Most values are usually counters, numbers, strings, json bytes, protobuf bytes, etc...

@janderland
Copy link
Owner

janderland commented Oct 19, 2024

In one of my tests, I used a schema for a very simple document+index layer I had, and simply by using fql queries, I was able to implement a query system on top of that, without much effort.

Amazing. I'm glad these ideas are useful to you.

By the way, should it spelled "fql" or "FQL" for the "official" name?

I don't have a strong opinion. I usually write it as FQL.

I've been told that sorting python tuples or sorting the bytes from the packed tuples should yield the same order. I had to replicate this behavior in .NET, and it's probably that Go would have the same issue.

I don't think I've ever needed to replicate the ordering of tuples client side. I just allow FDB to order the tuples for me.

For decimal numbers, this is a bit complicated, because for me "1.0" is the same a "1", but in Python floating point numbers where encoded with a different prefix in the tuple layer.

According to FQL's specification, 1.0 is no the same as 1. All floating point number require a decimal point.

Whatever you do, I hope you will standardize on UTF-8 being the only encoding.

Yes, this is the plan.

So, if you use '/foo' has the root directory, does this mean that the query /bar/baz means in that the absolute path /foo/bar/baz ? For me, I use the file system path analogy for paths in the Directory layer, so seeing '/something' means "go back to the root the filesystem/database", and would have expected either bar/baz or ./bar/baz to mean "from the current working directory".

FQL doesn't currently have any concept of a "current directory". What I linked was the Go API for using FQL programmatically in Go without including query strings.

I do plan to add a meta language which allows you to alias snippets of a query for use in the queries. For example, the syntax may look something like this:

let dir -> /my/root/dir

${dir}/sub/dir("hello", "world")=nil

The let expression is the meta language. This would allow you to define snippets of a query for re-use in actual queries. I may change this in the future. Having a "current directory" could be useful.

If would be nice to be able to define constants to give names for 0, 1, 2, and 3, such as docs, 'pk', idx_artist, idx_genre, idx_year.

Yup. This could be handled by the meta language I describe above.

One very frequent thing is using integer constants to sub-divide a directory subspace, without involving the Directory Layer. For ex, in pseudo code...

I don't understand how this is better than using separate directories. Can you explain to me?

In a shell, the query short cut would be converted into an actual fql command...

Yes, the meta language would include macros which could do stuff like this. For instance:

let doc rid -> (0,rid)=<json:bytes>

# This would translate to /my/dir(0,22)=<json:bytes>
/my/dir${doc 22} 

I'm still working on this meta language syntax, so don't take any of these examples as law quite yet.

@KrzysFR
Copy link
Author

KrzysFR commented Oct 19, 2024

I don't understand how this is better than using separate directories. Can you explain to me?

The directory layer has a non-trivial overhead: for each directory that you need to traverse, you need to perform sequential reads to the cluster. If you have a path of depth N, you need at least N round-trips to the database, before being able to read actual data! This can really add up ! (see below for a log of a transaction that opens the path /Foo/Bar/Baz !)

Directory subspaces are not safe to cache either, since they could be renamed/moved/deleted/re-created and get a new prefix. If /foo/bar/baz previously had prefix (7,), but was deleted/re-created and now has (42,), and one of your backend still uses the old cached prefix, it will read/write into a dead zone of the database. This delete/recreate or move/create is common when you do unit testing, or when you are upgrading the schema of some table in the database (create a temp directory, copy/upgrade the old data into it, once you are done, swap the names of the old/new tables, and then drop the old table).

I added a lot of infrastructure in the .NET binding just to try to solve this issue (deferred value-checks), but this is not magic, and I don't think this exists in the Go binding.

If you don't have a massive amount of data, the easiest solution is to have the minimum number of directories, and sub-divide them using a simple prefix like 0 (1 byte), or a small integer (2 bytes). In fact, this was how it was done before the Directory Layer was introduced, the notion of "subspaces" is adding a common prefix to a parent "subspace" to subdivide it. The Directory Layer only acts as a global map that translates long strings (directory names) into small integer + adding a hierarchy.

It could be argued that for example in /path/to/table/albums(1, <artist>, <rid>), the 1 is a surrogate for a sub-directory named "index_by_artists", but you don't need the extra round-trip to the database to find the prefix for the subspace.

The last small advantage, is that splitting a subspace "by hand" will ensure that the data stay collocated, while if you were using sub-directories, they could have very different prefix, and be separated by other data from other layers/tenant in between them.

Example log that shows the impact of the Directory Layer:

Transaction #133 (read/write, 22 operations, '#' = 250 µs, started 22:05:36.4199821Z [1729375536.419], ended 22:05:36.4261257Z [1729375536.426])
┌  oper. ┬──────────────────────────┬──── start ──── end ── duration ──┬─ sent  recv ┐
│ 0   G °│ #####&                   │ T+  0.010 ~   1.146 (  1,136 µs) │    22    12 │ Get ("DL", |Directory|, `<02>DL<00><FE>`, `version`) => <01><00><00><00><00><00><00><00><00><00><00><00>
│ 1   G  │      ^:                  │ T+  1.151 ~   1.250 (    100 µs) │    20     4 │ Get ("DL", |Directory|, `<02>DL<00><FE>`, `stamp`) => '\0\0\0\0'
│ 2   G  │       *                  │ T+  1.255 ~   1.354 (     99 µs) │    19     6 │ Get ("DL", |Directory|, `<02>DL<00><FE>`, 0, "Foo") => <02>DL<00><15><16>
│ 3   G  │       ^x                 │ T+  1.358 ~   1.456 (     98 µs) │    21     0 │ Get ("DL", |Directory|, `<02>DL<00><15><16>`, `layer`) => ''
│ 4   G  │            ◦             │ T+  2.294 ~   2.329 (     35 µs) │    22    12 │ Get ("DL", |Directory|, `<02>DL<00><FE>`, `version`) => <01><00><00><00><00><00><00><00><00><00><00><00>
│ 5   G  │            ^,            │ T+  2.336 ~   2.356 (     20 µs) │    19     6 │ Get ("DL", |Directory|, `<02>DL<00><FE>`, 0, "Foo") => <02>DL<00><15><16>
│ 6   G  │             ·            │ T+  2.357 ~   2.368 (     11 µs) │    21     0 │ Get ("DL", |Directory|, `<02>DL<00><15><16>`, `layer`) => ''
│ 7   G  │             *            │ T+  2.369 ~   2.469 (    100 µs) │    20     6 │ Get ("DL", |Directory|, `<02>DL<00><15><16>`, 0, "Bar") => <02>DL<00><15><1A>
│ 8   G  │             x.           │ T+  2.473 ~   2.573 (    100 µs) │    21     0 │ Get ("DL", |Directory|, `<02>DL<00><15><1A>`, `layer`) => ''
│ 9   G  │               ◦          │ T+  2.828 ~   2.856 (     28 µs) │    22    12 │ Get ("DL", |Directory|, `<02>DL<00><FE>`, `version`) => <01><00><00><00><00><00><00><00><00><00><00><00>
│ 10  G  │               ◦          │ T+  2.862 ~   2.881 (     19 µs) │    19     6 │ Get ("DL", |Directory|, `<02>DL<00><FE>`, 0, "Foo") => <02>DL<00><15><16>
│ 11  G  │               ◦          │ T+  2.882 ~   2.906 (     24 µs) │    21     0 │ Get ("DL", |Directory|, `<02>DL<00><15><16>`, `layer`) => ''
│ 12  G  │               ◦          │ T+  2.911 ~   2.928 (     17 µs) │    20     6 │ Get ("DL", |Directory|, `<02>DL<00><15><16>`, 0, "Bar") => <02>DL<00><15><1A>
│ 13  G  │               `,         │ T+  2.929 ~   2.940 (     11 µs) │    21     0 │ Get ("DL", |Directory|, `<02>DL<00><15><1A>`, `layer`) => ''
│ 14  G  │                  ◦       │ T+  3.375 ~   3.411 (     35 µs) │    22    12 │ Get ("DL", |Directory|, `<02>DL<00><FE>`, `version`) => <01><00><00><00><00><00><00><00><00><00><00><00>
│ 15  G  │                  •       │ T+  3.418 ~   3.468 (     51 µs) │    19     6 │ Get ("DL", |Directory|, `<02>DL<00><FE>`, 0, "Foo") => <02>DL<00><15><16>
│ 16  G  │                  ◦       │ T+  3.473 ~   3.498 (     26 µs) │    21     0 │ Get ("DL", |Directory|, `<02>DL<00><15><16>`, `layer`) => ''
│ 17  G  │                  ^x      │ T+  3.501 ~   3.603 (    103 µs) │    20     0 │ Get ("DL", |Directory|, `<02>DL<00><15><16>`, 0, "Baz") => not_found
│ 18  G  │                        ◦ │ T+  4.608 ~   4.644 (     36 µs) │    22    12 │ Get ("DL", |Directory|, `<02>DL<00><FE>`, `version`) => <01><00><00><00><00><00><00><00><00><00><00><00>
│ 19  G  │                        ◦ │ T+  4.651 ~   4.670 (     20 µs) │    19     6 │ Get ("DL", |Directory|, `<02>DL<00><FE>`, 0, "Foo") => <02>DL<00><15><16>
│ 20  G  │                        · │ T+  4.672 ~   4.683 (     11 µs) │    21     0 │ Get ("DL", |Directory|, `<02>DL<00><15><16>`, `layer`) => ''
│ 21  G  │                        , │ T+  4.684 ~   4.695 (     11 µs) │    20     0 │ Get ("DL", |Directory|, `<02>DL<00><15><16>`, 0, "Baz") => not_found
└────────┴──────────────────────────┴──────────────────────────────────┴─────────────┘
> Read 106 bytes in 6.143 ms and 1 attempt(s)

@janderland
Copy link
Owner

Directory subspaces are not safe to cache either, since they could be renamed/moved/deleted/re-created and get a new prefix.

I guess this is where our workloads differ. I have never used the directory rename/move feature, so on application start up we open all the directories we plan to use and cache them for the entire life of the application.

If you are not caching directories then I understand the concern.

@KrzysFR
Copy link
Author

KrzysFR commented Oct 19, 2024

so on application start up we open all the directories we plan to use and cache them for the entire life of the application.

I used to do the same thing a long time ago and got bitten by this, so I encourage you to look into this.

Maybe now you are not using tools that do this, but maybe in the future you will have to use a third party tool, or change how you backup/restore in such a way that the directories could be recreated in a different order (with different prefixes, ...). You will also have to address the issue of in-place schema migration at some point.

You only need one rogue process somewhere in the cluster that was NOT properly killed or updated, or someone that resumes a paused VM, a network split that isolated part of the nodes, etc...

The bad news is that the data layout used by the Directory Layer is not very friendly for caching, so there is no easy solution. And this hasn't changed in ~10 years now, and likely never will. It was designed before VersionStamps and atomic operations existed, so today it would be very different I guess.

One solution I used was to cache all the prefixes of all the parents paths needed to traverse /foo/bar/baz (so 3 nodes in this case), and when you start a new transaction, issue an async read immediately to check that these keys are still there with the expected value, but don't wait for the result yet. Run the rest of the query, but before commit, you have to await the result of the async reads, and if there is even one difference, bust the cache, and retry. That's what I called "deferred value-check". The IFdbTransaction object has a map of pending reads, that it will have to inspect when Commit() is called, and simulate a conflict if this does not match.

You could use watches to maybe flush the cache when a directory key is changed, but watches are async, it is possible in some cases that they won't fire, and also there is a limited number of them.

Recently, I ended up with deeper and deeper directories, which made this too slow, so I took the bullet, and made a slight change to the DL, by adding a metadata key to each partition, that is atomically incremented every time a directory is deleted or renamed (not created). This way, I only need to check one key per partition (a single key if you only use the root partition), instead of N keys.

@janderland janderland self-assigned this Oct 19, 2024
@janderland
Copy link
Owner

You only need one rogue process somewhere in the cluster that was NOT properly killed or updated, or someone that resumes a paused VM, a network split that isolated part of the nodes, etc...

Yea, we've accidentally recreated/moved directories, but because all our processes are stateless we just restarted them so that they cleared their cache and obtained the correct directory. So I agree it is a problem, just not a very bad problem in our case.

You will also have to address the issue of in-place schema migration at some point.

The way we address this at my job is by either using a new directory path or by supporting multiple KV schemas within a directory. In other words, we try to keep things backwards compatible as much as possible.

@janderland
Copy link
Owner

janderland commented Oct 19, 2024

Anyhow, using a manual prefix for subspaces is supported by FQL as long as it's within a directory. It also sounds like your use case could be supported by simply allowing keys to only contain a tuple, without a directory.

Do you often use keys without the tuple layer?

@KrzysFR
Copy link
Author

KrzysFR commented Oct 19, 2024

No, I've always used the Tuple layer because it's so much more easier to use when designing a layer, and is nicely printable in logs.

Though I've heard of people using keys encoded with raw uuids, with protobuf or some other binary protocol. There are APIs in the DL to register such a subspace by telling the DL to NOT allocate a prefix that would collide with it by random chance.

The only keys outside the DL I read are mostly system keys such as \xff/metadataVersion, or some other special keys that are there. But 99.9+% of my keys are encoded with tuples, and probably in a sub-partition (one per tenant) with paths with a very shallow depth to reduce the burden on the DL cache check.

Note that in 7.3 there is a new Tenant API that basically does the same thing as partitions in the DL, but at the native client level. I have not yet had the chance to play with it. I've added basic support for it in the .NET Binding, but the cluster needs to be specifically configured for it, and it seems tailored for big users.

I don't think this would impact FQL very much, except that the tool would need to select a tenant, and then it would be as if it was the only one in the cluster's keyspace (with its own Directory Layer instance). This makes path like /Prod/Tenants/ACME/tables/foo a bit shorter as /tables/foo, but prohibits you from doing cross-tenant queries.

@janderland
Copy link
Owner

janderland commented Oct 19, 2024

By the way, your feedback has been very helpful. Online, it's difficult to see if the other person is enjoying the conversation, so I wanted to make it clear. You are helping me consider other workloads and use cases, which is good.

@KrzysFR
Copy link
Author

KrzysFR commented Oct 19, 2024

Btw, I encourage you to go play with versiontamps: they are the one feature that most changed the way I design layers, and so I (ab)use them everywhere :)

They are usually used as sequential keys in change logs, or as out-of-the-box sequential uuid in data stores, or as a cache busting mechanism. This makes them usually present in the very first part of a tuple, which means that FQL queries would frequently have to "say their name". They can also be present in the value as a pointer to another key in another subspace.

And since I use them a lot, I'd vote for a shorter type name like <vs> or <stamp> instead of <versionstamp> :)

For their text representation, I'm not sure if there is a standard. I use @81985529216486895-123 and @81985529216486895-123#456 since it makes it easier to identify the readVersion, the commit batch, and the optional user version. They can be written in hex, but you lose the readability of the sequential ordering. Also, read versions are somewhat in sync with the clock, so by looking at the delta between two stamps, you can get a rough estimation of the wallclock time between the two events.

@janderland
Copy link
Owner

I encourage you to go play with versiontamps

Yes, we have used them at my job. At one point we implemented a message queue on top of Foundation DB and used the versionstamps to order the messages. They are very useful and will absolutely be included in FQL.

I'd vote for a shorter type name like or instead of

Yes, you may have noticed I preferred shorter type names. vstamp or vs is what I will use.

For their text representation, I'm not sure if there is a standard.

Yea, I think they should have their own unique textual form as you are saying. Simply using a hex string wouldn't be enough.

Also, read versions are somewhat in sync with the clock, so by looking at the delta between two stamps, you can get a rough estimation of the wallclock time between the two events.

At one point we were using versionstamps as a clock. Eventually we decided to use our own clock key. We have a clock client which basically updating the clock key once per second and we synchronize all our services against this clock value.

@KrzysFR
Copy link
Author

KrzysFR commented Oct 28, 2024

After implementing support for Big Integer in the tuple layer, I am even more confused by the whole distinction between int/uint/bint, and I think this is because the original tuple encoding spec is very vague, and each binding had to interpret it and match it with the its own platforms conventions and limitations.

Looking at the spec, it only has a single concept of "integer values", where both int 123, uint 123 and bigint 123 would serialize into the same bytes 15 7B. Only floating point numbers like float 123.0 would produce different bytes (and there is multiple possible encodings as well depending on the original precision!).

This means that if you find the bytes 15 7B in a key or value in the database, without knowing the metadata or schema via a side-channel, you would not be able to figure out the original "type" of the value, as intended by the origin, and this could lead to false negative when filtering.

The only cases where this is evident, just by looking at the bytes, is for negative values, or very large integers (> 64 bits).

  • Negative values, obviously, could not be a uint.
  • The tuple header types 0x0B and 0x1D only exist for negative or positive values that are above +/- 2^64, which would not fit into a 64-bit signed or unsigned variable in code.

I've looked at the implementation in the Java binding, and it is also mixing long and java.math.BigInteger in the same test case, which makes me think that these types are somewhat interchangeable. See the test cases and the encoder

Looking at the C++ implementation (which maybe could be considered the real reference implementation, since this is the one use the the client and server code at runtime?), there is also evidence that all encoded values from 0x0B to 0x1D (so negative/positive, big or small) have the same type ElementType::INT)

This makes me believe that the distinction between int/uint/bint in the query itself may add another layer of confusion for some application and languages?

For my case, in the .NET space, I had to adapt the actual runtime types that exist to represent numbers, such as short, int, long, ushort, uint, ulong, Int128, UInt128, BigInteger (so many types!!). I have methods to encode/decode from each of these types to bytes, and back. But they all are equivalent and would return true when compared for equality. At least for values that fall in their domain: conversion of a negative value in a tuple into a uint would throw at runtime, same as reading a large 64-bit number as a 'short' (16-bits).

I would propose adding a number variable type, that would include ALL possible representations of numbers in the tuple layer, so integers (negative or positive), floating point numbers (single, double, triple, decimal, ...). With this type, the rule is that 1 == 1U == 1.0

Then, add another type integer (shortened to int?) for only integer types (0x0B..0x1D), and float for all decimal types (0x20, 0x21, 0x22). When using these types, the rule would be that 1 == 1U but 1 != 1.0.

If one needs to filter only positive values, then this would be the job of a constraint on the type, equivalent to being able to specify a valid domain for values or constants (pos, [0, 7), [0, 0xFFFF], ...).

@janderland
Copy link
Owner

I am even more confused by the whole distinction between int/uint/bint...

What if FQL only supported a single integer type, the same way there is only a single float type? When using the query language itself, not much would change. When using the Go API, I could provide this integer as either an int64 or a BigInt, depending on how large the number is.

This means that if you find the bytes 15 7B in a key or value in the database, without knowing the metadata or schema via a side-channel, you would not be able to figure out the original "type" of the value, as intended by the origin, and this could lead to false negative when filtering.

In the case of values, FQL currently supports "raw" values which are not encoded with the tuple layer. If you want to read an integer encoded using the tuple layer in the value you would write a query like this:

/my/dir("my","key")=(<int>)

This is different from querying a "raw" value which would store the raw 64-bit integer in the value position. That kind of query looks like this:

/my/dir("my","key")=<int>

I would propose adding a number variable type, that would include ALL possible representations of numbers in the tuple layer, so integers (negative or positive), floating point numbers (single, double, triple, decimal, ...). With this type, the rule is that 1 == 1U == 1.0

I don't see a point for this "number" type. You can specify a variable which supports both types already by doing <int|num>. int is the integer type and num is the float type.

@KrzysFR
Copy link
Author

KrzysFR commented Oct 28, 2024

When using the Go API, I could provide this integer as either an int64 or a BigInt, depending on how large the number is.

The issue could be interop with dynamic languages that have only a single type for numbers, like JavaScript where all numbers are 64-bit floats. This was a known issue at the time, but the Node.js binding did not exist at this point in time, so it was probably left unresolved in the spec. I'm not exactly sure how the maintainers of the Node.js binding as solved the issue of representation of out-of-range numbers?

I'm not familiar with how Go handles the conversion from one type of number to the other (implicit? requires explicit casting? needs a method call?), but in .NET there are a lot of auto-casting, and in the .NET API I've handled this in two ways:

  • Provide a set of "single type" methods, like for example int64 getInt64() and where the caller can clearly specify "I'm expecting a 64 bit signed integer, or something that has a valid conversion path to an int64 without overflow". This method would try very hard to adapt a different type, like a floating point that would be an integer, even parsing a string that contains a base-10 number.
  • Provide a method with a generic type argument T getValue<T>() that has a set of valid conversions from valid tuple encoding to T. For examples, if T is string, then 01 ... 00 will still be considered valid for a string.

That's unfortunately a lot of cases to handle (the type conversion matrix can become big very fast!) but thankfully that's code you need to write once, and then forget about it).

Side note 1: This is a good example of Python 2.x idiosyncrasies where a lot of layer code would encode (ASCII) strings using the 0x01 byte header (intended for bytes), instead of the 0x02 byte header (indented for strings), because in Python 2, bytes and ASCII strings would be the same type. I had to basically encode this behavior in .NET where, like Java, 0x01 is mapped to byte[] and 0x02 is mapped to strings, without overlap. This was required if I wanted to be able to decoded keys generated by python code, and vice versa.

Side note 2: to be fair, I had the reverse issue in .NET where Guid, the type for uuids, uses the Windows encoding with some of the bytes swapped compared to the standard (because Microsoft in the 90's!). They produce the same string, but they don't produce the same bytes, so I had to create a custom type that emulates the behavior of Uuids when encoded into tuples, then "swaps" the bytes in memory so that it can work well inside a .NET application.

In the case of values, FQL currently supports "raw" values which are not encoded with the tuple layer.

I realize I did not consider the case of raw values, where here it makes a lot more sense to specify signed vs unsigned and size (for counters, etc...). Without a schema, you would not even be able to decode it anyway (and heuristics will always fail for some of the values)

For keys, the scenario where you could use a generic "number" type (as in ℝ, the set of all real numbers) would be for indexes on a number that could be anything but is most usually an integer, and infrequently with a decimal part (and for this it could ".5", ".25"). The index could use the much more compact integer representation for actual integers and only pay the cost of 9 bytes when required. You would lose ordering, but not all number-based indexes have the concept of ranges.

Then, if you want to go down a level and work with ℕ (set of positive integers) or ℤ (set of positive and negative integers), that's where you could make use of int or uint.

Side note 3: maybe that's why it looks weird for me, because having integers vs floats, sort of imply that integers (ℕ) are NOT real numbers (ℝ), which in maths is non-sense, but for us developers is somewhat "normal" ? :D. It seems more natural for me that ℝ would match all numbers, then ℤ would match only the part that are integers, and then ℕ only those that are positive (or non-negative, depending on if you include 0 or not). We are lucky to not have to deal with ℂ (complex numbers!) :)

The type for "all kinds of numbers" (as in how it would exist in JavaScript) could indeed be constructed with <int|num|uint|....> but maybe an alias for "a number I don't care what kind, by opposition to a string" could be easier to use than have to list all of them, especially if additional types are added down the line.

@KrzysFR
Copy link
Author

KrzysFR commented Oct 28, 2024

Looking at the node.js binding, I see this entry: https://github.com/josephg/fdb-tuple?tab=readme-ov-file#js-to-tuple-value-mapping

Danger 💣: If you mix integers and floats in your database keys, getRange will not work correctly. The tuple encoding sorts all integers before all floating point numbers. So if your values have keys [0.9], [1], [1.1], your keys will sort in the wrong order (you'll get [1], [0.9], [1.1]). And if you call tn.getRange([0], [2]), you will only get [1]. You can force the tuple encoder to always use floating point encoding for integers by wrapping the number in an object: pack([{type: 'double', value: 1}, ...]) instead of pack([1, ...]).

Seems like fun times, especially for folks that uses "64-bit integers as uuids" 😨

The encoder seems to check if a number is an actual integer or not: https://github.com/josephg/fdb-tuple/blob/master/lib/index.ts#L253-L281

It would be very easy for an application using the node.js binding to let a few floating points numbers "escape" in the keys, even if they only intended them to be only integers (and thus use <int> in an FQL query), especially if you factor in the classical issues where 3 * (1 / 3) could not be equal to 1 in some corner cases, and this 1 would be encoded as a 9-byte float, instead of a 2-byte int !!!

@janderland
Copy link
Owner

janderland commented Oct 28, 2024

Side note 3: maybe that's why it looks weird for me, because having integers vs floats, sort of imply that integers (ℕ) are NOT real numbers (ℝ), which in maths is non-sense, but for us developers is somewhat "normal" ? :D. It seems more natural for me that ℝ would match all numbers, then ℤ would match only the part that are integers, and then ℕ only those that are positive (or non-negative, depending on if you include 0 or not). We are lucky to not have to deal with ℂ (complex numbers!)

Fair point. I have a math minor with my degree, so this makes sense to me. I think most programmers would prefer the traditional integer/float split though.

The issue could be interop with dynamic languages that have only a single type for numbers, like JavaScript where all numbers are 64-bit floats.

I think this can be solved by including custom types in the API. For instance, the Go API for FQL doesn't use the raw 'int' which Go provides. I define a custom type (AKA a "box") which contains an integer. This provides me with extra runtime information. I'd probably do the same for JS.

@janderland
Copy link
Owner

BTW, based on these conversation I'm pretty sure I'm going to remove the uint type from FQL. If it's not represented in the Tuple layer bytes then it shouldn't be in FQL.

@KrzysFR
Copy link
Author

KrzysFR commented Oct 28, 2024

BTW, based on these conversation I'm pretty sure I'm going to remove the uint type from FQL. If it's not represented in the Tuple layer bytes then it shouldn't be in FQL.

They could still be useful for a raw value? I'm not sure if you have plans to be able to represent counters, which I use a lot in layers to maintain count of documents, statistics on indexes, or even as a simple change token that is watched or used for cache busting.

I think if you had the ability to represent constraint on types (number of bytes, signed/unsigned, high-/low-endian, ...), PLUS the ability to create type aliases, then something like a "counter64" could be an alias on an integer + constraints.

The one thing I've always been missing, is the ability to encode that the value is a JSON document, a counter updated with atomic_add, or some simple 64-bit integer or versionstamp. Bonus side-quest: being able to specify which .proto schema is used when encoding protobufs !

If you had just a set of "core" type: numbers (split in integer vs non integer maybe), strings (unicode), uuid, bytes, stamps, tuples and bools, + a constraint representation + creating type aliases, you'd be able to construct all the rest from that. You could ship with a set of predefined aliases (to get back 'int', 'uint', etc...) to make it easier.

I define a custom type (AKA a "box") which contains an integer. This provides me with extra runtime information. I'd probably do the same for JS.

I've seen other bindings use a similar concept, if one does not already exist in the language. I've used the same "Number" box concept in the context of JSON Numbers (JSON has the same issues with representing numbers, since it comes from the JavaScript side of the world...)

@janderland
Copy link
Owner

janderland commented Oct 29, 2024

I think if you had the ability to represent constraint on types (number of bytes, signed/unsigned, high-/low-endian, ...), PLUS the ability to create type aliases, then something like a "counter64" could be an alias on an integer + constraints.

I have a vague idea for a syntax component called "options". They can change the way a transaction, query, & variable behaves. I may have them apply to values as well.

Transaction options control:

  • If writes are allowed
  • Max bytes before transaction splitting
  • Query options for entire tx

Query options control:

  • Range read limits & direction
  • Endianess of raw values
  • Filter invalid keys or throw on error when encountered

Variable options control:

  • Constraints on the variable like regex or integer ranges

Value options control:

  • Integer size, unsigned vs. signed?
  • Float size?

So instead of handling these integer sizes at the type level, I may make them a value option. I don't know how the option syntax would look, but I'm thinking I could use a list of options after the object it's modifying. Something like this...

/my/dir(<int> [pos, even])=nil

The option list [pos, even] requires the <int> variable's value to be positive & even. This syntax doesn't look very nice, so I want to think about it a bit more.

@janderland
Copy link
Owner

OK. I gave this some more thought last night. The following is a first draft of my ideas:

Options (maybe called Modifiers) are a collection of arguments which are included during object construction. You can think of them as parameters to a constructor, or as settings of the object.

Options can modify transactions, queries, types, & elements. The options are specified as a list wrapped in braces ([, ]).

Here is an example of an element being constructed with options:

32[u16,little]

The expression 32 constructs an integer. This integer includes the option u16 which specifies that it's binary format should be a 16-bit unsigned integer. The option little specifies that it's binary format is little endian. These options will affect how the element is serialized as a value, amongst other things.

Types can also have options:

int[u16]

This specifies that when this type is used in a variable, the binary format is expected to be an unsigned 16-bit integer. I will also include pseudo types which are syntactical sugar for this kind of option. The type u16 is shorthand for the type int[u16].

You can use this type in a variable like so:

% these queries are semantically identical
/my/dir("key")=<int[u16]>
/my/dir("key")=<u16>

Options can also be used to constrain a type:

str[/ab+c/]

Here, this instance of the str type is constrained by the regex option. When matching in a query, only strings which conform to the given regex will match.

Options can also be used to modify properties of queries. The main difference is that query options are specified on the line before the query itself:

[reverse,limit:500]
/my/dir(...)

The reverse option specifies that the range-read query should be done in reverse. The limit option includes a value of 500 which sets the range-read limit.

Options can also be used to modify transactions. I don't have the syntax for transaction defined yet, so I won't show examples, but the options for transactions would include things like read or write, byte limits, and max number of retries.

Let me know what you think.

@KrzysFR
Copy link
Author

KrzysFR commented Oct 29, 2024

The [...] notation makes me think of Attributes in C#, that are used to annotate a type, method, property, etc... (seems the equivalent in go would be struct tags?), so this kind of makes sense to me.

The option little specifies that it's binary format is little endian. These options will affect how the element is serialized as a value, amongst other things.

The literal little means there is probably a big counterpart? I think this is maybe a bit too long to write, and big could be confused with big integers. Why not le/he which is something I've seen frequently in docs or APIs ? Admittedly, these could also be confused for less or equal / higher or equal, but if you use </> in constraints to represent this concept, it could be fine?

I'm unclear about what should be the default endianness. Parts of the FoundationDB API use high-endian (tuple encoding), while other uses little-endian (atomic_add). I'm not sure if the default should depend on where the variable is used, but I think that whatever the default will be, there will always be a common issue of forgetting to specify the endianness of the other. Typically a key that has both a counter in the key (high-endian) but is used to atomically increment a counter (which is high-endian)

This could be fixed by defining a set of type aliases for :

# id expected to be used in keys, little-endian by default
type id64 = int[u64]
# raw value expected to be updated with atomic_add
type counter64 = int[u64,he] 

/path/to/...("counters", <id64:docId>) = <counter64:count>

The modifiers could also encode things like json, or utf-8 or protobuf, etc?

In my experience, there's always some of these tags that will require a value, as in foo vs foo=123, so maybe allow the modifiers to have an optional part?

  • format=json, format=protobuf
  • encoding=base64, encoding=hex, encoding=base62, ...
  • version=2.0
  • schema=user
  • custom=abc123
  • label="User"
  • `description="The unique ID of the user, must be a 64-bit integer"

Combined with the ability to use let (or const) to give name to number constants used to split a directory into subspaces, I could almost define entire layers like this:

let USERS = 1
type userId = int[u64, label="UserId"]
type user = bytes[format=json, schema=user] 

let ROLES = 2
type roleId = int[u64, label="RoleId"]
type role = bytes[format=json, schema=user] 

let MESSAGES = 3
type msgId = versionstamp[label="MessageId"]
type msgPayload = bytes[format=protobuf, schema=messaage_v2.proto, param=something] 

/path/to(USERS, <userId>) = <user>

/path/to(ROLES, <roleId>) = <role>

/path/to(MESSAGES, <msgId>) = <msgPayload>

@janderland
Copy link
Owner

janderland commented Oct 29, 2024

there's always some of these tags that will require a value, as in foo vs foo=123

Yes, look at the query option example. I specify argument after a colon: [reverse,limit:500].

Combined with the ability to use let (or const) to give name to number constants used to split a directory into subspaces, I could almost define entire layers like this

Exactly. That's the goal.

The modifiers could also encode things like json, or utf-8 or protobuf, etc?

I'm hesitant to include JSON or Protobuf into the language. I'd rather have the programmer connect FQL to other tools like jq or grpcurl via pipes so that FQL doesn't need to worry about understanding those formats. What use case were you thinking of for working with JSON?

@KrzysFR
Copy link
Author

KrzysFR commented Oct 29, 2024

I'm hesitant to include JSON or Protobuf into the language.

This could be done by convention, but for example the modifier format:json, format:protobuf, format:bson, ....

From FQL point of view, this is simply another key/value in a list of modifiers, and could simply ignore it. But the a custom UI could use this modifier to select a better renderer for the value.

What use case were you thinking of for working with JSON?

The common issue I have with JSON is that some documents can be very large, but in a condensed view, you frequently only want to see the most important fields, like the id / name / title, and use the |...| or expand button to show the full JSON. You can only go so far with heuristics that target common names like id, but this is not very practical and sometimes will show a random field instead of a more interesting one.

Same cases here, by convention, a UI could use a schema:xyz modifier and use the attached value to select a better renderer for this value.

This is especially useful for binary formats that do not include the attribute names in the document (like JSON/XML do), because here a generic would have no way of rendering something useful, without access to the schema (.proto file, etc...).

@janderland
Copy link
Owner

Ah OK. Yea I think we can leave room for custom options used by custom formatters or UIs.

@KrzysFR
Copy link
Author

KrzysFR commented Oct 30, 2024

Have you already worked on writing a regular expression that would be able to match a FQL query?

I'm looking for a true/false this is valid and possibly complete. I don't need to actually parse it. I'm working on the prompt that will see the query typed character by character, and I was thinking of using a regexp to validate, as well as know when the query is complete, in order to start parsing the rest of the command (options, ...)

The shell has several commands, a few of which will accept FQL queries as arguments, with a set of optional parameters. While the user is typing the FQL query itself, I want to validate and show in orange/red whenever the syntax is incorrect (immediate feedback), as well as start using the partially typed query to run the auto-completion tool in the background.

note: _ is where the cursor is currently.

The first part if outside the scope of FQL, I want to use the query command with syntax query <fql> [--foo --bar <value> ...]

prompt > _
prompt > q_
prompt > quer_
prompt > query_
prompt > query _

From this point, I know that the next token is of type "FQL", so I can use the regexp to validate the rest, AND detect when this is complete (to continue parsing other types of tokens, like --foo bar parameters, ...

In the middle of writing a path:

prompt > query /_
prompt > query /pat_
prompt > query /path/to/us_
prompt > query /path/to/users_

Whenever a character updates the query, and it is validated, I can already see that, if you have typed "/path/to/" already, this is still incomplete, but would be a valid query on its own, so I could fetch the subdirectories of /path/to and propose them for auto-completion.

The next is ( we are in the middle of typing a tuple expression:

prompt > query /path/to/users(_   
prompt > query /path/to/users(1,_
prompt > query /path/to/users(1, <_
prompt > query /path/to/users(1, <str_   
prompt > query /path/to/users(1, <str>, _

Same here, whenever I see a ,, I could implicitly add a ...) and start looking for the top 10 keys that could match the partial query, and show some results already.

We have finished the query, additional characters are not part of the FQL query itself:

prompt > query /path/to/users(1, <str>, <uint>)_     # press ENTER
prompt > query /path/to/users(1, <str>, <uint>) _    # space to start adding another option/parameter
prompt > query /path/to/users(1, <str>, <uint>) --_  # I can run the autocomplete for existing options to this command
prompt > query /path/to/users(1, <str>, <uint>) --export foo/bar/file.json_
prompt > query /path/to/users(1, <str>, <uint>) --export foo/bar/file.json_

As soon has I see the last ) I know the FQL query part is finished, and now I'm waiting for either the ENTER key, or will auto-complete arguments to the query command (again, specific to this tool, outside of the scope of the FQL syntax)

@janderland
Copy link
Owner

janderland commented Oct 30, 2024

FQL is a context-free language. Every regular language is context-free, but not the other way around, so there is a chance that FQL will not be parsable by RegEx. I have not tried.

For this kind of validation, I'm planning on running my parser.

If you combine many RegEx expressions with a state machine then you can definitely do what you're asking. This is how I implemented syntax highlighting for the language docs.

@janderland
Copy link
Owner

As soon has I see the last ) I know the FQL query part is finished, and now I'm waiting for either the ENTER key, or will auto-complete arguments to the query command (again, specific to this tool, outside of the scope of the FQL syntax)

I'm planning on implementing auto-complete via my parser as well. In normal mode, the parser returns an error if the expression stops before a valid end-state. In auto-complete mode, if the expression stops before a valid end-state then the parser will return possible options for continuing the expression.

@KrzysFR
Copy link
Author

KrzysFR commented Oct 31, 2024

I guess I could re-execute the parser every time the expression is changed, and keep around the position of the last token that is not truncated.

When in the process of typing /users, all the intermediate states /u, /us, /use, etc..., are all technically valid, but would maybe not match an existing directory. This makes me think that the syntax is missing a "name starting with..." operator for directories, something like /use* that would match users. This may not be something intended for standard queries, but would be handy for auto-completion queries !

So when the expression is still in progress, but we have only parsed up to /use, we could implicitly change it into /use*(...) and run a quick query with that. At least until there is at least one ( in the expression like for /users( where this could be changed to /users(...) and executed.

I guess there are a few different cases that should be handled, like if you are in the middle of a variable (/users(<st), where this could be temporarily replaced with <> (or if you are and optimist, figure out that the only valid type that starts with st is str and substitute this instead).

Or when typing a string (/users("hello w_) where you could add a wildcard and run /users("hello w"*) for "tuples where the first element is a string that starts with "hello w".

It is possible to hack the tuple encoding, to make it act like a "starts with" for strings or bytes. If you have the tuple (..., "abc"), it will end with ... 02 a b c 00, but if you chop off the last byte, and do a range read from ... 02 a b c to ... 02 a b c FF, you will get all the tuples that have a string which starts with "hello" at this position. Same thing for bytes. This is as simple as chopping off the last byte for the begin key, and replacing the 00 with FF for the end key.

The only tricky thing right now would be the API of the Directory Layer that probably only has a "List" method, but you could just filter the results of listing all the sub-directories. Though technically, implementing a "starts with" version of listing sub-directories is possible, since the children names are stored as tuples, so the same trick described above would work.

@janderland
Copy link
Owner

Here is the rough draft of the FQL documentation. I still need to proofread it, rephrase certain things, etc.

This document maps out what I plan to implement from now till summer 2025.

Note that the grammar file has not yet been updated.

Let me know what you think.

https://jander.land/fql

@KrzysFR
Copy link
Author

KrzysFR commented Nov 29, 2024

Here is a list of notes while reading the draft:

  • "Foundation DB" -> "FoundationDB", single word

  • Data Elements > "int" > "Signed Integer". I think it also includes unsigned integers?

  • Element Encoding, "/raw_value()=4000", this is the first time () is introduced but I'm not sure what it does, does it adds any bytes to the query? is it required to distinguish a query that lists directories, vs one that would mention the key with the directory prefix? Would "/raw_value=4000" be equivalent?

  • big/lil, may be add "Big endian (default)", so that it is clear which is the default when omitted.

  • lil: I thought it was a typo, but it seems that it is in use as another of the many names for little-endian. https://en.wikipedia.org/wiki/Endianness does not use "lil" though it explains the origin from "Lilliputians", and I've seen "lil endian" used in some discussions. I'm more used to "BE" / "LE" notation and Big/Lil makes me think of Big/Little cpu architecture.

  • Space & Comments: are inlined comments something that would be supported sometime? (/* ... */ vs // ....).

  • Options: the doc only shows an example of limit:5 with an integer, but I guess it would also be possible to specify strings, booleans, bytes etc... using the already existing syntax defined for them? so foo:123, foo:"bar", foo:true, foo:0x123456, ...

  • Options: maybe it would be useful to define a convention for casing of multi-word option names? [helloWorld] vs [hello_world] vs [hello-world] vs [hElLOwOrlD]..... Are option names case sensitive, etc...

  • Filtering, the section about filtering client-side as soon as there is a variable or hole: technically, if there is a single variable with a known type, it is possible to tweak the key range prefix to not read values with another type. For example for /people(3392, <str>, <>), since <str> is a string, and all strings start with \x02 when encoded, you can perform the range read with \x02 suffix and it will automatically ignore keys with a different type. Though, in practice, if the value is a string, there would probably not be any other type to filter out... For mixed types, like <str|int> it could be also tweaked to run two queries, one with the \x02 suffix, and another one with the range of all integer header types. I don't know if this would be actually beneficial at runtime, but it is possible.

  • Indirection: the C API has support for fdb_transaction_get_mapped_range (https://github.com/apple/foundationdb/wiki/Everything-about-GetMappedRange) which I haven't had time to use yet, but it looks like it would be really useful in this case. There is a JAVA unit test here and a C++ test here. They use a somewhat different mapper syntax ("hello", "{K[2]}", "{K[3]}", "{...}"), but it is definitely something that could be generated from an FQL query.

@janderland
Copy link
Owner

Thanks. I'm gonna try to wrap up the manual this week. I will address these critiques.

@janderland
Copy link
Owner

lil: I thought it was a typo, but it seems that it is in use as another of the many names for little-endian.

In American slang, "lil" means "little". But I'll use LE & BE instead so it's clearer to everyone.

@janderland
Copy link
Owner

For example for /people(3392, , <>), since is a string, and all strings start with \x02 when encoded, you can perform the range read with \x02 suffix and it will automatically ignore keys with a different type.

This is a good idea. I'll add it to my TODO list as a future optimization.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants