-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refreshing the schemas: freeze the p5subset, add it to our vc, update the syntax in the ODD #62
Comments
Another hopefully minor issue (but actually part of a larger issue suitable for a separate task in a separate ticket) is the way to make sure that the newly derived RNG is still valid for all the dictionary databases. I seem to recall that the Freedict make system had a 'validate' target, so I imagine that, after regenerating the RNG, I would only have to run make with the specific parameter, and watch for error messages. @humenda , do you sense any trouble in this regard, please?
If it is about applying the RNG to the TEI file (xmllint, etc.), I
would say that's fine. The only stepping stone here is that the RNGs are
sometimes symlinked and a broken symlink can cause trouble :).
BTW, if the schemas were in tools/, we wouldn't need to copy / symlink the
schemas to each dictionary, but they were part of the tooling. Does this sound
sensible? If so, I would like to make this shift at some point.
Envisioned action sequence:
1. derive the current p5subset (on my disk, against the current snapshot of the TEI and TEI Stylesheets)
Y you can always check out a new branch and commit any temporary state
there. That at least gives people the chance to see progress.
2. freeze the `p5subset` by adding it to Freedict version control (where? under `shared/` or elsewhere?)
What is it? If that subset manifests itself as ODDs and schemas, then why not
move it straight to tools and adapt the build system to use it from there? If
it is a preformat and we decide to move the schemas to tools, this preformat
should also be in tools.
5. freeze the newly derived `freedict_p5subset` next to the `p5subset`; this one should be regenerated by hand after each modification of the Freedict ODD (one has to remember about that); recall: it's frozen for convenience, to shield it from any ensuing modifications in the TEI Stylesheets
I am lost here. Please go ahead if you have your plan :).
|
Replying to specific points:
I don't thank document grammars should be seen as part of the tooling. ODD and schemas are what provides semantic and syntactic rules for the interpretation of dictionary documents. I would definitely advise to keep them within the fd-dictionaries repository and either symlink, the way it's done now, or make the dictionaries point to the shared/ directory to identify the schema. I have just posted #66 to outline that. [EDIT: I would be completely comfortable (or even outright happy) with scratching issue #66 and maintaining the current status quo]
Thanks :-) I understand that some of the above may be unclear (and I think I will reduce the procedure somewhat, to save some time), but indeed, I'm going to work on that in a separate branch, so nothing will be affected until I'm finished and it looks good. |
Trying to keep the off-topic to a single ticket, so I am reposting Sebastian's comment from elsewhere. I am not sure if Sebastian had read my reply above before posting that comment.
Tools operate on the semi-structured databases (as our XML dictionaries can be treated) in many cases thanks to the document grammars that flesh out the semantics of the particular components or regulate the relationships between components. Think very early HTML with all the styling info inside. Separating the styling info into CSS leaves us with a skeleton that the styling information from the CSS attaches to. You need to put the two together in order to receive a pleasant, readable web page. If you were to take the schemas away, you would only leave part of the relevant information in fd-dictionaries. They would be half-useless as XML documents, until the schemas were located or (imperfectly) inferred from the existing structure. There is completely nothing natural in snatching schemas away from the dictionary documents. I don't think it is a good approach for an open project to say, "fd-dictionaries contain bare XML documents; in order to make them meaningful, you have to install the other repository as well". That just isn't user-friendly. The TEI ODD makes the connection between the XML documents and schemas even more explicit, and it is my fault to not have maintained our ODD for a long time, and to have failed to exploit some of its features. I intend to take a step to amend that situation, and this ticket outlines the first steps towards that goal. Looping back to the beginning of this particular comment: I believe that there exist good arguments for keeping schemas in fd-dictionaries rather than in fd-tools. I would like to suggest that we maintain the status quo in this regard, and don't try to fix something that is not broken. [1] A minor note: it is part of TEI compliance requirements that in order to qualify as the TEI document, an XML document has to be (among other things) accompanied by the ODD document that defines its schema. But I believe that my argument above stands even without this further detail. |
If you were to take the schemas away, you would only leave part of the relevant information in fd-dictionaries. They would be half-useless as XML documents, until the schemas were located or (imperfectly) inferred from the existing structure. There is completely nothing natural in snatching schemas away from the dictionary documents. I don't think it is a good approach for an open project to say, "fd-dictionaries contain bare XML documents; in order to make them meaningful, you have to install the other repository as well". That just isn't user-friendly.
fd-dictionaries in the current form (with schemas) do not _require_ the fd-tools in order to become useful to people who do not wish to build distribution packages. They can safely exist on their own and the shared/ directory contains enough information (even if some of it is outdated) to get people started using or even fixing or extending fd-dictionaries with an XML editor. fd-tools make fd-dictionaries even more valuable, but they are not essential for fd-dictionaries to function on their own, _if_ fd-documents are accompanied by their schema and their ODD.[1]
I can only stress what is in the README. It says, among other things, that
this repository only contains the dictionaries that are not auto-imported
(anymore). There are more dictionaries these days that we automatically import
than those we maintain by hand. The dictionaries in this repository are in the
sense only half of the story. What about our auto-imported dictionaries?
Why is it "user friendly" that somebody who reads about them and visits
https://download.freedict.org/generated doesn't find the shared folder with
the schemas? They are currently copy-pasted there, just because we have a
rather sloppy way of treating schemas somewhere between data and tools.
I'm looking to it from the perspective of a distributor. The schemas are one
thing, the data is something else. Both belong into different packags. The
`shared` folder was a great thing as long as everything was in one repository.
Schema updates automatically propagated to the dictionaries. Today, this is
not longer the case. It could well happen that the schemas change in an
incompatible way and there's no way for external users to figure out which
schema version should be used for the dictionary they have at hand. I am all
for strict versioning here. We could of course separate the schemas into a separate repository, but IMO
versioning together with tools is more convenient.
Is there a compromise we could find to resolve this discrepancy between
dictionaries in this repo and from other sources?
|
I would like to update the existing ODD, in two steps, and this ticket is meant for the first and gentler of them, namely for a rewrite of the current ODD to the current TEI idiom, which should ideally mean just a cosmetic change without affecting the extension (i.e., the patterns/grammars defined by RNG, XSD, DTD), but in practice, the extension is going to be affected due the the changes in the TEI that have happened over the years, so some tinkering may be in order, and a lot of test runs across all the databases.
In doing that, I would like to add two files to our version control. For strictly internal purposes, so that we can trace the changes in the TEI internals without investigating the git history of the TEI itself, each time.
Let me sketch some background:
p5subset
. It is called an 'integrated ODD'.p5subset
) silently creates something that can be called Freedict integrated ODD; it is not visible to the outside eyes, because it is regenerated each time that the Freedict ODD is manipulated by the TEI Stylesheets.p5subset
as it was defined by the TEI years ago. So while the Freedict ODD hasn't been modified since then, the result of its application on the currentp5subset
is going to be extensionally different from what was used years ago. I don't think it's a major issue (because we only use a very small subset of the TEI), but it's definitely something to be aware of.p5subset
in our version control is that, if one doesn't have full control of the TEI environment, their ODDs may reference the current 'blessed' TEI ODD, recreated after each release in the TEI Vault, or the current snapshot of the TEI under control of their Jenkins environment, or the local p5subset on the user's hard drive; what I propose reduces this potential complexity and adds a lot of transparency.A hopefully minor complication is that our RNG was edited by hand since it got derived. Since it is version-controlled, I can extract the modifications and reapply them at the ODD level.
Another hopefully minor issue (but actually part of a larger issue suitable for a separate task in a separate ticket) is the way to make sure that the newly derived RNG is still valid for all the dictionary databases.
I seem to recall that the Freedict make system had a 'validate' target, so I imagine that, after regenerating the RNG, I would only have to run make with the specific parameter, and watch for error messages. @humenda , do you sense any trouble in this regard, please?EDIT: this is now the topic of freedict/tools#28 and I have an interim solution
I mentioned adding two files to the version control. I meant the current
p5subset
and the Freedict integrated ODD (call it...freedict_p5subset
?). The first one freezes the current state of the TEI, so that, in the future, we can diff that. The second is to expose the Freedict integrated ODD for similar comparisons. I could probably live without the latter, since it depends on the former, but it also depends on the TEI stylesheets, and those are under constant development as well. Bottom line: it's far more convenient in case one has to investigate some schema-related issue across time, to have both these files handy, because both of them can only be recreated in the future after tinkering with two very dynamic repositories (TEI Guidelines and TEI Stylesheets).Envisioned action sequence:
p5subset
by adding it to Freedict version control (where? undershared/
or elsewhere?)freedict_p5subset
by using the current Freedict ODD, with one change: its@source
attribute will now point at thep5subset
frozen at step (2)freedict_p5subset
next to thep5subset
; this one should be regenerated by hand after each modification of the Freedict ODD (one has to remember about that); recall: it's frozen for convenience, to shield it from any ensuing modifications in the TEI Stylesheetsfreedict_p5subset
just to document any modifications that could have crept in at step (6)At this point, after all the above actions, we should be still at the status quo, except with (a) 2 new files, kept for reproducibility checks and (b) a newer Freedict ODD, ready to be modified further.
The text was updated successfully, but these errors were encountered: