Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion : Wait for installer state to be ok before continuing. #110

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

zipkid
Copy link
Contributor

@zipkid zipkid commented Feb 9, 2018

We often see the crx installer fail because AEM is restarting or in any other way not able to handle the necessary queries/commands.
This adds a 'wait for ok to install state' in the crx installer provider.
This type of 'wait' may also be needed in the other providers but possibly without/with another check than the 'Sling+OSGi+Installer.json' .
This code is certainly not good to be merged but we would like to discuss where/how this could be done to ensure clean puppet runs.

Maybe this should be part of https://github.com/bstopp/crx-packmgr-api-client-gem, but that is generated from https://github.com/bstopp/swagger-aem, which i don't know how to work with.

@wimsymons
Copy link

This might fix #82 as well.

@zipkid zipkid force-pushed the feature/wait_for_install_ok branch from f29031d to 8375759 Compare February 9, 2018 08:57
require 'net/http'
retries ||= @resource[:retries]
retry_timeout = @resource[:retry_timeout]
host = 'http://localhost:4502'

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fixed port '4502' should come from the "port" parameter -> see build_cfg method

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, but for that i needed to modify the 'build_client' function, which i did not want to do for this PR. It is mainly intended as a discussion entry-point and definitely not as a final clean-code PR.

@zipkid
Copy link
Contributor Author

zipkid commented Feb 9, 2018

@bstopp , I have updated the .rubocop.yaml -> 'TargetRubyVersion: 2.2'.
Can you trigger the checks please?

@bstopp
Copy link
Owner

bstopp commented Feb 10, 2018

Do you have a use case or manifest set that shows this occurring? What is causing the AEM restart, a puppet change or a user initiated change?

What is being experienced right now? A number of subsequent failures?

I am pretty certain i know what the issue is, and this won't solve it; the system already does a check with retries here when the resource is encountered by Puppet for applying.

I was pretty sure i opened a ticket somewhere on the underlying issue; if i find it, i'll link it.

@stevengssns
Copy link

Hi @bstopp,

An exact example of what we have observed is the following:

In our setup we have a clean AEM 6.3 installation, followed by a Service Pack 1 and Cumulative Fix Pack 2 package installation. When we do a clean install, we have observed that the CFP is often (but not always) only partially installed. When going to the package manager, a substantial number of the sub-packages are still in an uninstalled state.
When reproducing the issue on a local workstation, I observed that one of the package install hooks of one of the CFP sub-packages threw an exception. The exception said that the Dynamic Class Loader service was no longer available.
When investigating further, it turned out that the installation of the CFP package started too soon. When the Service Pack gets installed, and the package manager API returns, then the package manager GUI will show the package to be installed, but it is actually still in progress. This means that there are still a lot of OSGi services that are being reloaded due to the ongoing installation(s), when the next package installation is already started.

To try and make the package installations more robust, we are trying to add a more reliable check on the installation state. This check is based on the Sling OSGi Installer JMX MBean which is mentioned in the following AEM Gem:

https://docs.adobe.com/content/ddc/en/gems/AEM-Sustenance---Best-Practices-for-deploying-AEM-Maintenance-Releases/_jcr_content/par/download/file.res/AEM-Sustenance-Best-Practices-Gems.pdf

In the mean while I have also learned that the following end-point provides similar information, though it is documented nowhere, and googling for it seems to return no Adobe search hits at all.

/crx/packmgr/installstatus.jsp

I've decompiled the code, and it does a very simple check on the ActiveResourceCount attribute of the Sling OSGi Installer JMX MBean being '0' or not.

I hope this clarifies the necessity for these changes, if not I can provide more info.

Fyi, there are still a issues to tackle or think about.

  • A package installation does not mean there will be a single installation run. This means that when you observe an ActiveResourceCount=0, another run might still start. Certainly when installing a Service Pack I observed ActiveResourceCount=0 several times before the installation was completely done. So a few successful calls are probably needed before deciding the system is done (maybe also checking that the other attributes are no longer changing).
  • When the installation has failed for some reason, and some of the (new?) bundles are not starting, then the ActiveResourceCount remains > 0. So no other package installations will be possible until this gets resolved. No sure if this will always be desired..

@henrykuijpers
Copy link

@bstopp any input on this?

@bstopp
Copy link
Owner

bstopp commented Sep 4, 2019

Can you confirm this wasn't fixed with v3.0.0?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants