(Kind of personal notes) Brainstorm on documentation/developer project mantainability #1220

daquintero · 2023-10-30T12:58:26Z

daquintero
Oct 30, 2023

Hi guys,

I thought to write this as a personal notebook of thoughts when looking at the documentation/developer structure but realized this is an open-source section of the project, and what makes it powerful is our ability to collaborate so I've decided to write my notes here.

I'm going through the code in detail. Even though I'm just a humble developer, I am freshly looking at the code and I wanted to write down the ideas of ways that maybe the development structure can be made easier - even if my ideas may be not necessarily useful. The current structure is good, and am happy to start developing with it. This discussion contains some proposals/personal notes of potential maintainability ideas that are really my brainstorm or self-explanation of the structure of the current development flow.

In the documentation repo, we install the dependency requirements by submodules. This seems partly because we want to have the development tidy3d requirements synchronized with the documentation development which of course makes sense. The complexity is maintaining the synchronization between the main codebase and requirements. What is a simpler way to achieve this that simplifies the installation process without development within development? Can the two projects be treated independently, yet linked by their common usage? Say we script up a virtual environment whose requirements are updated locally, so that people can develop the documentation and the project in a way separately, but linked by a common environment. Something following this reproducible mamba structure This would guarantee all the pandoc recepies properly install the toolset in the corresponding windows/linux/mac way as it can run mamba installation scripts which include shell scripts, and distribute environments in a more common methodology. It could also be integrated onto a mambaforge distribution like gdsfactory do. How different is this from providing installation shell scripts for each platform? Maybe this guarantees that the right section of the script gets run for the corresponding platform, and is distributable to users as part of a mamba or conda installation. However, I am not personally super convinced of my proposed approach. This is because having the current requirements.txt installation means that the tool is more generic for people developing in raw python and also more easily distributed between platforms. However, it would be nicer if we could package up the developer installation within a script. gdsfactory install have a pip install gdsfactory[develop] which installs all the developer requirements, and is part of the pyproject.toml but this may not necessarily apply to us since we want to keep the documentation separately from the other repo due to the size of the project. Having a distributed virtualenvironment guarantees the sphinx compilation should work well too. Hmm I have only just begun thinking about this so I'm sure some other approaches will come to mind later.
We are currently using nbdime to perform notebook collaboration and merges which is a great tool for the job. gdsfactory has all their examples in jupytext which makes collaboration and upgrade tracking work as python text file line tracking which is native to GitHub for example, rather than the binary changes tracking which nbdime makes easier to manage. This also means you can use standard file extensions such as black and others without depending on jupyter notebook specific tools. The drawback is that jupytext becomes essential to everyone. However, it also creates a level of standardization with the gdsfactory documentation so that people using both tools use the same example documentation flow. I have personally used jupytext extensively and have found that for most applications of jupyter notebooks, it is an equivalent tool. However, the problem with it is that it does not maintain a cache, which is great for tracking changes, but less useful if we want to save notebooks in a particular state.

daquintero · 2023-10-30T14:40:54Z

daquintero
Oct 30, 2023
Author

However, judging by #1068 it sounds like we don't want to use the requirements.txt anymore and go more towards a pyproject.toml based form of installation. There's a bit of complexity here in terms of the installation of the documentation requirements accordingly. Currently, we have requirements.txt everywhere, and again, this is not exclusive to the setup.py or pyproject.toml installations as this allows environment management tools to know what are the required package dependencies.

Let's see this functionally. We have conceptually the same project, with the source code and the related documentation. The state of the documentation and the codebase are linked based on the corresponding versions. However, if we break it down, we kind of have two separate build flows. The main tidy3d codebase is developed for the actual FDTD operation, whilst the documentation development follows a standard sphinx process. However, what links them together is the autoapi and corresponding automatic code-documentation build construction accordingly. So for the purposes of integrating sphinx in the development flow, it makes sense to have the tidy3d main project within tidy3d-docs since it guarantees that level of integration.

If we were to think of the documentation and main code integrated together, then it would be easier to perform the standard development design flow. Let's say, we write a pyproject.toml that includes the development requirements, alongside the other set of requirements. However, what that means is that the instructions and requirements for the installation of the documentation would all be done in the main codebase. This also enables some level of maintainability in the future in terms of integrating all of the docs and main codebase back togther should that be desired, and then hosting the examples in a separate repo which gets built as part of the sphinx github action so that everything can be developed separately and is easier to modularly mantain. However, we would need some fundamental scripts to connect them all together, but these are well-defined and bounded interconnections.

This does not yet fully solve the pandoc easy installation issue.

I think a good approach is this: we have a pyproject.toml that controls all the requirements for the main project, and documentation + tests. This means that we have full control over how to create a virtual environment that will be functionally correct and that is shared and verified between the two projects. We can build a mamba recipe for multiple platforms based on this that automatically installs the pandoc requirement on script that works based on the platform, besides the mamba pandoc recepie already includes the actual binary.

0 replies

tylerflex · 2023-10-30T15:23:12Z

tylerflex
Oct 30, 2023
Maintainer

Thanks Dario, your insights are super appreciated so thank you for writing them down like this. It's invaluable to us. As you can imagine, a lot of the way things are structured at this point have more to do with maintaining legacy code and decisions made a while ago when we had different problems than we do now. So a big way you can help us out is to take a fresh look at our development process and come up with better ways to do things so that we can be more efficient and scale as things get even more complex in the coming months.

The docs / frontend integration is one major painpoint that I think we need to figure out so it's great that we're starting with this. Note also that we have various private repos that are similarly integrated with the front end, so there's a lot of interdependency that makes development a bit tricky. For example, if I introduce a new feature, often times I need to have 3 PRs (docs, frontend, backend) that are all linked through submodules and need to be merged one by one, it's a bit of a mess to be honest so having things integrated would really help.

For a bit of context, we decided to split tidy3d-docs and tidy3d to avoid notebook and docs commits cluttering the tidy3d git history and bloating the package. This was before we introduced nbdime so it's totally possible we can integrate them but just want to be aware of this aspect. In fact, I'm definitely in favor of it if we can find a good way.

About your individual comments, I'll try to answer a bit / give some more context:

virtual environments / requirements.

We want to really simplify and modernize this and avoid messing with parsing requirements.txt in setup.py like we have now. I'm not as familiar with existing python frameworks for managing things, but we have some ideas. As you pointed out, we want to move towards a single pyproject.toml. We're also considering using poetry to handle requirements. We deal with some "dependency hell" situations from time to time, or packages upgrading and breaking versions of tidy3d, so we want to have a better system for dealing with this. I think if we use poetry, this gives the flexibility for users to pip install tidy3d (we want to maintain a very simple installation command and probably not require tools like conda), but also lets us control the requirements much more tightly in production and gives users that ability. Does that fit in with your proposed solution?

Notebook diffs

@lucas-flexcompute originally suggested nbdime and it's been working pretty well, but I'm not opposed to trying something else, like jupytertext if it is better suited for us. I think making it required is not a problem. Not sure about the cache feature, we at least haven't used this, would it be useful for us do you think? When we make a new minor version, we generally re-run all of the notebooks to generate a new state, which can be quite a lot of work. Would this allow us to maybe only update some notebooks and manage their states more easily?

frontend / docs integration

I think a good approach is this: we have a pyproject.toml that controls all the requirements for the main project, and documentation + tests. This means that we have full control over how to create a virtual environment that will be functionally correct and that is shared and verified between the two projects.

I think this sounds good to me, again, assuming that regular users wouldn't have to install or worry about docs at all. So would we just move tidy3d into tidy3d/docs?

An important note,tidy3d-docs is "owned" by flexcompute-readthedocs account because we didn't want to give readthedocs read access to our flexcompute site which has private repos. Do you think it's possible to still configure readthedocs to use the original tidy3d-docs repo without much complication?

0 replies

daquintero · 2023-10-31T10:51:03Z

daquintero
Oct 31, 2023
Author

Hi Tyler,

I am very grateful for taking the time to write your valuable insight. I have been thinking about it since yesterday.

I like the poetry + pyproject.toml dependency management approach. This would allow us to leverage a few things. If we use it as the main "control" for dependency management of the projects, we should be able to create reproducible development environments across platforms locally and distributed. And we should be able to set up reproducible documentation builds from it as subscripts.
I have been thinking about how to manage the backend -> frontend -> docs flow you describe and manage multiple project sections beyond submodules. Another thing related to this is that if we have multiple versions of each project, then there are bound to be unreproducible bugs without well-defined guaranteed dependency management between these projects. How about if we were to do something like an "environment project version lock"? We use the pyproject.toml to configure a virtual environment, and have a well-defined verified project dependency configuration, say with the existing dependency or any future project/plugins, etc. We use a bump2version-esque version manager between the projects so that we can script up say a snapshot that enables us to commit multiple project versions in one go if we wanted - and track a bug across multiple tools. We could use the pyproject.toml to configure a virtual environment, in which the development versions can be easily installed. If we have well-defined version numbers it should be possible to track and commit the state of multiple projects working together be it plugins or dependencies. I think this would allow us to move in the direction of treating the projects a bit like a microservice, but with a well-defined and reproducible environment dependency manager that controls how well-integrated each of their functionality might be with each other. Each plugin/project has the minimal pyproject.toml for its own functionality, but the thing that could integrate them together is a main tidy3d pyproject.toml that integrates them together, and manages how integrated the system is at a particular development or deployment version. This would enable us to move away from submodules (except in the case of the documentation autoapi, which so far I understand is the main requirement), and then be able to treat each project semi-independently and reproducibly together. It's still in the back of my mind other ways to appraoch this multi-project management structure.
Thanks for the explanation on the documentation, yeah that makes sense. I noticed the same thing you mentioned regarding readthedocs having organization access and I wasn't a fan of it either to be honest. Sounds good, it makes sense to keep working on the multi organisation structure as it is now (although moving the build requirements of the documentation dependency back into the main project maybe to treat it as a pure subdirectory which just includes the docs? (and we just have some sort of version matching?). I'm happy to structure working towards this approach. I'm also curious though. Since GitHub already has access to all organization files, and GitHub pages is also another way of hosting sphinx-generated stuff. Is it that readthedocs is cheaper than Github pages? This might be an old discussion you guys already had and I'm sure the methodology was chosen for a good reason so happy to stick with readthedocs - although I'm curious if it was a pricing thing. This might be particularly handy if the plan is to release other projects/submodules each with their own documentation as that way it would be managed together. However, I'm sure you guys already thought about that so I'll just catch up on maybe the old decision in time, and keep going with the current structure.
I am still thinking about the notebooks. Sorry I wasn't originally suggesting using the jupyter caches, it's mainly that's what nbdime is cleaning and what the jupyter tools have to deal with. In previous experience, generally relying on jupyter caches is a massive pain and causes unreproducible bugs and non-working code. There is something to be said for reproducibility in terms of running everything again. I think I meant that we can still have jupyter files but save as .py text files via jupytext which also enables us to use other IDE functionality when making them. It's really a thing of preference if we want to keep using .ipynb files via nbdime or.pyfiles viajupytextas both enable clean commits and merging. I'm happy to continue with the current methodology so was just a suggestion if of interest. I definitely agree on the importance of the basepipinstallation. Upon further thought, it makes sense for people to install their own pandoc in their own way, rather than us packaging it in theconda` development environment recipe which may really not be necessary. I was thinking in terms of how to set up a reproducible development documentation build environment, rather than the actual main codebase. Haha just wanted to clarify what I was thinking there! I don't think this is crucial currently but was just thinking about it.

What do you think?

0 replies

lucas-flexcompute · 2023-10-31T11:14:47Z

lucas-flexcompute
Oct 31, 2023
Maintainer

Hey @daquintero, thanks for the discussion points! I just wanted to share my thoughts from when I looked into the notebooks issue.

I was also looking into jupytext when trying to improve our workflow, but I could not find ans easy way to have the run results also cached in the repo. The discussion I had with Tyler and Momchil at the time was that having the cell outputs in the repository is very desirable, because PR reviews are much easier when we can compare plots side-by-side, for example, straight in github. It's also convenient to have the notebook results ready when we're making any changes without having to run them before we start (some notebooks can be very long).

My ideal solution would be to have the jupytext version of each notebook in the repository history and the ipynb versions only for the branch tip, but I don't know if that's viable. If jupytext and ipynb are kept in sync, we might be able to set up a git hook that removes old versions of the ipynb when we commit, but both working with synced files and using commit hooks are prone to error.

1 reply

daquintero Oct 31, 2023
Author

Hi Lucas! Hope you're well. Yeah, that totally makes sense and it's a good methodology. It's hard to get that only with jupytext and the nbdime solution works "which we don't want to break" haha

tylerflex · 2023-10-31T16:26:50Z

tylerflex
Oct 31, 2023
Maintainer

I like the poetry + pyproject.toml dependency management approach. This would allow us to leverage a few things. If we use it as the main "control" for dependency management of the projects, we should be able to create reproducible development environments across platforms locally and distributed. And we should be able to set up reproducible documentation builds from it as subscripts.

Sounds good to me, then!

I have been thinking about how to manage the backend -> frontend -> docs flow you describe and manage multiple project sections beyond submodules. Another thing related to this is that if we have multiple versions of each project, then there are bound to be unreproducible bugs without well-defined guaranteed dependency management between these projects. How about if we were to do something like an "environment project version lock"? We use the pyproject.toml to configure a virtual environment, and have a well-defined verified project dependency configuration, say with the existing dependency or any future project/plugins, etc. We use a bump2version-esque version manager between the projects so that we can script up say a snapshot that enables us to commit multiple project versions in one go if we wanted - and track a bug across multiple tools. We could use the pyproject.toml to configure a virtual environment, in which the development versions can be easily installed. If we have well-defined version numbers it should be possible to track and commit the state of multiple projects working together be it plugins or dependencies. I think this would allow us to move in the direction of treating the projects a bit like a microservice, but with a well-defined and reproducible environment dependency manager that controls how well-integrated each of their functionality might be with each other. Each plugin/project has the minimal pyproject.toml for its own functionality, but the thing that could integrate them together is a main tidy3d pyproject.toml that integrates them together, and manages how integrated the system is at a particular development or deployment version. This would enable us to move away from submodules (except in the case of the documentation autoapi, which so far I understand is the main requirement), and then be able to treat each project semi-independently and reproducibly together. It's still in the back of my mind other ways to appraoch this multi-project management structure.

I think something along these lines could be useful, I think the locking between backend and front end is not so difficult at the moment but might be good to standardize. I'd say this is a lower priority than merging front end and docs though. Once you join, it would be useful to talk with some of the backend-focused developers to get a feeling for what their pain points are regarding this, I'm less informed on that end.

Thanks for the explanation on the documentation, yeah that makes sense. I noticed the same thing you mentioned regarding readthedocs having organization access and I wasn't a fan of it either to be honest. Sounds good, it makes sense to keep working on the multi organisation structure as it is now (although moving the build requirements of the documentation dependency back into the main project maybe to treat it as a pure subdirectory which just includes the docs? (and we just have some sort of version matching?). I'm happy to structure working towards this approach.

Something like this could work, yea. I think the main thing that would be nice is if the new feature PR could be contained entirely in the frontend repo tidy3d. so that would generally mean the front end source code changes, notebook changes, and some tidy3d-docs/docs/source/api.rst changes. Maybe an option is to move that stuff into tidy3d/docs and have tidy3d-docs just grab these things from the git submodule.

I'm also curious though. Since GitHub already has access to all organization files, and GitHub pages is also another way of hosting sphinx-generated stuff. Is it that readthedocs is cheaper than Github pages? This might be an old discussion you guys already had and I'm sure the methodology was chosen for a good reason so happy to stick with readthedocs - although I'm curious if it was a pricing thing. This might be particularly handy if the plan is to release other projects/submodules each with their own documentation as that way it would be managed together. However, I'm sure you guys already thought about that so I'll just catch up on maybe the old decision in time, and keep going with the current structure.

Yea I'll have to refer to you on the person on our other team who directed me to use flexcompute-readthedocs, there might be a way around this so it would be good to double check. I think we just need to be 100% sure that our private repos are not exposed but it's a good point that GitHub already can see them.

I am still thinking about the notebooks. Sorry I wasn't originally suggesting using the jupyter caches, it's mainly that's what nbdime is cleaning and what the jupyter tools have to deal with. In previous experience, generally relying on jupyter caches is a massive pain and causes unreproducible bugs and non-working code. There is something to be said for reproducibility in terms of running everything again. I think I meant that we can still have jupyter files but save as .py text files via jupytext which also enables us to use other IDE functionality when making them. It's really a thing of preference if we want to keep using .ipynb files via nbdime or.pyfiles viajupytext`as both enable clean commits and merging. I'm happy to continue with the current methodology so was just a suggestion if of interest.

That sounds fine, I think Lucas' comment on this explains the situation quite well: we generally want the notebook diffs to be quite small to not bloat the repo, but also it is useful to have the outputs when reviewing. maybe you and Lucas can work out a good solution for that.

I definitely agree on the importance of the basepipinstallation. Upon further thought, it makes sense for people to install their own pandoc in their own way, rather than us packaging it in theconda development environment recipe which may really not be necessary. I was thinking in terms of how to set up a reproducible development documentation build environment, rather than the actual main codebase. Haha just wanted to clarify what I was thinking there! I don't think this is crucial currently but was just thinking about it.

I would say that it's very rare for any external users to compile the docs themselves, so I am less concerned generally about how we install the docs requirements, but want to keep the pip install tidy3d simple. I think at this point only about 1/3 of our team seems able to compile the docs locally anyway, which isn't good :P so anything will be an improvement there.

Let's schedule a time to talk tomorrow, maybe my morning (EST)? after you are set up with onboarding and we can discuss some more over zoom. Maybe just email me tomorrow once you're set up.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(Kind of personal notes) Brainstorm on documentation/developer project mantainability #1220

{{title}}

Replies: 5 comments 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

(Kind of personal notes) Brainstorm on documentation/developer project mantainability #1220

daquintero Oct 30, 2023

Replies: 5 comments · 1 reply

daquintero Oct 30, 2023 Author

tylerflex Oct 30, 2023 Maintainer

virtual environments / requirements.

Notebook diffs

frontend / docs integration

daquintero Oct 31, 2023 Author

lucas-flexcompute Oct 31, 2023 Maintainer

daquintero Oct 31, 2023 Author

tylerflex Oct 31, 2023 Maintainer

daquintero
Oct 30, 2023

Replies: 5 comments 1 reply

daquintero
Oct 30, 2023
Author

tylerflex
Oct 30, 2023
Maintainer

daquintero
Oct 31, 2023
Author

lucas-flexcompute
Oct 31, 2023
Maintainer

daquintero Oct 31, 2023
Author

tylerflex
Oct 31, 2023
Maintainer