Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make Spark/Hadoop service installation idempotent #237

Open
steve-drew-strong-bridge opened this issue Feb 20, 2018 · 2 comments
Open

Make Spark/Hadoop service installation idempotent #237

steve-drew-strong-bridge opened this issue Feb 20, 2018 · 2 comments

Comments

@steve-drew-strong-bridge
  • Flintrock version: 0.9.0
  • Python version: 3.6
  • OS: aws linux / windows

As a user of flintrock, I would like to shave a lot of time off of spinning up new clusters.
To do so, I would like to copy an AMI from a previous flintrock install and reuse that.

Expected: without installing HDFS and Spark again, the new AMI is instantiated, the slaves files are updated, the master IP is updated in the appropriate config files and HDFS/Spark is launched.

Bonus expectation: I would love to tell Flintrock that I've already configured the drives correctly and be able to skip the ephemeral drive allocation step as well.

Actual: Today, I have to turn on the installation of each service to get them configured. No time savings for using an AMI with the software pre-installed.

@steve-drew-strong-bridge
Copy link
Author

Quick thought:
As a shorter path to this, it might be feasible to say that if the service directory exists (as Flintrock would name it) on the instance then Flintrock considers it installed. E.g., /home/ec2-user/hadoop folder exists means skip the install. It's up to the AMI owner at that point to be sure that hadoop is correctly installed. All Flintrock does is move config files and start the services.

The same could be done for drive configuration. If (assuming we can't just look at df for /media/ephemeral0, /media/ephemeral1) a file exists with a specific name (e.g., FlintrockDrivesInstalledDontBotherDoingItAgain) then skip that as well.

@nchammas
Copy link
Owner

Service installation and configuration are already separated in Flintrock. We leverage this separation when adding new nodes to a cluster, for example, since when that happens all existing nodes need to have their services reconfigured but not reinstalled.

I believe what you're asking for is that installation be idempotent. One easy example of Flintrock implementing a declarative-style method of managing software is ensure_java8().

To accomplish what you're looking for, we'd need to do a few things, some of which you touched on:

  1. Add a new method to FlintrockService, maybe called _is_installed(), which takes the same input as install() and returns a boolean saying whether or not that particular service is installed. That's where we'd capture the logic defining what "installed" means for each service. We'd call _is_installed() somewhere to figure out if we need to do anything. (Maybe this is a good use case for a decorator? I'm not sure.) It may also be better to just add the appropriate logic directly to each install() method.

  2. We need to do something similar for the ephemeral drives. The "proper" way to do it is probably to convert that code into a FlintrockService and follow the _is_installed() pattern, but we can also get away with just updating this code to make the check we want and skip setup when appropriate.

    Detecting when a drive is already setup and doesn't need any work is tricky because EC2 behavior in this regard is, to quote myself, "haphazard". Maybe your idea of having a marker file would work, but I'm concerned about adding a lot of clutter to handle this and muddying the logic for formatting ephemeral vs. EBS volumes.

This is a good request, and discussing it reminds me again just how close Flintrock comes to reinventing other tools (like Ansible). 😄 Since Flintrock is strictly limited to Apache Spark and Hadoop, I'm fine with refining how we do things as long as it doesn't add a lot of complexity.

It'll take a bit of work here to implement this in a non-hacky way, but I think it's possible, especially for the main services like Spark and Hadoop.

@nchammas nchammas changed the title Separate service config from service install. Make Spark/Hadoop service installation idempotent Mar 8, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants