Skip to content
This repository has been archived by the owner on Aug 14, 2024. It is now read-only.

Review lesson objectives #15

Open
tobyhodges opened this issue Aug 1, 2022 · 9 comments
Open

Review lesson objectives #15

tobyhodges opened this issue Aug 1, 2022 · 9 comments
Labels
hackathon status:in progress Contributor working on issue type:discussion Discussion or feedback about the lesson

Comments

@tobyhodges
Copy link
Contributor

To guide the development of this lesson, it would be a good idea to have a clear outline of which parts of the current lesson are important to be kept for the HPC Carpentry setting, which are non-essential but not harmful, and which need to be removed.

One way to do this would be to review the current list of objectives for the lesson and discuss them in the context above. Perhaps dividing them into "must be kept", "could be kept", and "should be removed"? And then you will also need a list of new objectives that you want/need to add, which are not in the lesson in its current form.

@tobyhodges
Copy link
Contributor Author

List of objectives in the current version of the lesson:

  • 1. Managing Data Processing Workflow
    • Understand our example problem.
  • 2. Snakefiles
    • Understand the components of a Snakefile: rules, inputs, outputs, and actions.
    • Write a simple Snakefile.
    • Run Snakemake from the shell.
    • Perform a dry-run, to understand your workflow without executing anything.
  • 3. Wildcards
    • Use Snakemake wildcards to simplify our rules.
    • Understand that outputs depend not only on the input data files but also on the scripts or code.
  • 4. Pattern Rules
    • Write Snakemake pattern rules.
  • 5. Snakefiles are Python Code
    • Use Python variables, functions, and imports in a Snakefile.
    • Learn to use the run action to execute Python code as an action.
  • 6. Completing the Pipeline
    • Update existing rules so that dat files are created in a subdirectory.
    • Add a rule to your Snakefile that generates PNG plots of word frequencies.
    • Add an all rule to your Snakefile.
    • Make all the default rule.
  • 7. Resources and Paralellism
    • Modify your pipeline to run in parallel.
  • 8. Make your workflow portable and reduce duplication
    • Learn to use configuration files to make workflows portable.
    • Learn a safe way to mix global variables and snakemake wildcards.
    • Learn to use configuration files, global variables, and wildcards in a systematic way to reduce duplication and make your workflows less error-prone.
  • 9. Scaling a pipeline across a cluster
    • Understand the Snakemake cluster job submission workflow.
  • 10. Final notes
    • Understand how to perform a dry-run of your workflow.
    • Understand how to configure logging so that each rule generates a separate log.
    • Understand how to visualise your workflow.

@tobyhodges
Copy link
Contributor Author

Based on discussion in the first sprint session at CarpentryCon, it sounds like some discussion of the --profile option is something that will need to be added to the lesson for an HPC setting.

@tobyhodges
Copy link
Contributor Author

Related to discussion in #4, familiarity with Python is not currently a prerequisite for the lesson, so episode 5 should probably be removed altogether (after checking that there is not other useful content in there, not specific to Python)

@vinisalazar
Copy link

Based on discussion in the first sprint session at CarpentryCon, it sounds like some discussion of the --profile option is something that will need to be added to the lesson for an HPC setting.

I will try working on that on the upcoming sprints by submitting a PR to episode 9.

@tkphd tkphd added the good first issue Good for newcomers label Aug 4, 2022
@bkmgit
Copy link
Contributor

bkmgit commented Aug 10, 2022

Some of the flexibility comes from Python. Programming experience from the shell lesson should be enough to understand the Python code if it is introduced correctly.

@bkmgit
Copy link
Contributor

bkmgit commented Aug 10, 2022

The lesson is quite long, 6 hours, it would be good if it was about 4 hours. Removing some of the repeated parts may work.

@reid-a
Copy link
Member

reid-a commented Oct 20, 2022

Draft outline of objectives for the revised lesson, for when we get there. Some of this material is cut-and-pasted from the Sprint notes.

  1. Episode 1 is mostly motivation/stage-setting. Brief refresh of Amdahl example, running the script on laptops, or the head node (if allowed). Motivation question: what is the relationship between width and time taken?
    • Objective: by the end, learners should be able to... compare performance indicators from tasks with different parameters run on the same system
    • Note: note in episode about Bash scripts is less important for HPC Carpentry setting
  2. Episode 2 introduces a lot of concepts - rules, inputs, outputs, actions, running a snakefile, graphs (maybe not explicitly introduced but used in figures), dry run. Needs more exercises to assess progress to all of this.
    • Note: we might aim to write a very short Snakefile: one rule with no inputs (Amdahl again). Run that twice and see that it does nothing the second time. Change the output file name and run again. Exercise could be to adjust the width for the Amdahl command.
    • Objective: by the end, learners should be able to... write a rule that produces an output file.
    • Objective: by the end, learners should be able to... predict (correctly!) whether a rule will run based on its output and the files in the project folder.
    • Note: Our episode three would then be extending this to multiple rules, introducing rule to plot results and show how these connect together. add at least one more Amdahl rule with different widths, plot the results. demonstrate the importance of rule order. introduce a rule to clean things up.
    • Objective: by the end, learners should be able to write a basic Snakefile and run it.
  3. Episode 3 is about wildcards. Andrew found it helpful to write a Snakefile that iterates over a list of values, which become parameters/arguments to a task. This will introduce the wildcards object, preparing us to introduce the resources object later, by analogy with the wildcards object.
    • Note: we could bridge from our episode 3 to this by discussing how to replace those repetitive rules with a single rule and wildcard width values.
    • Objective: by the end, learners should be able to write a Snakefile that iterates over a set of values and generates multiple outputs using wildcards.
  4. Episode 4 is about pattern rules, which we think we do not really need, but may be a concept that is important to thinking about workflows. Right now, HPC Carpentry's next episode should probably be more concerned with introducing the cluster config.
    • Note: important ideas for cluster config: resources object and our defined function get_tasks(). Function executed at runtime for the rule it is written into.
    • We will need to be careful about how much Python we end up talking about from here on in. Cognitive load is probably very large for any learner who is seeing a Python function definition for the first time.
    • also looking at config.yaml for the first time. hope we can bridge that gap by calling back to SBATCH parameters.
  5. Episode 5 is about Python code in Snakefiles. Makes the point that e.g. the "input" iten can have sub-parts, and you can refer to them by attribute, e.g. "input.cmd". They still refer to files. Also introduces the "expand" primitive.
    • We do want the "expand" primitive. The access-by-attribute is late for us, we will have done this with "resources" in the previous lesson. Using Python code as actions depends on whether we want to require Python as a pre-req -- arguably this is already induced by the get_tasks() Do we care about the --quiet option? Python is probably a pre-req at this point.
    • Objective: Learners should be able to write a Snakemake rule with a Python-code body.
  6. "Completing the pipeline", default rules, Snakemake best practices. Introduces a "plotcount.py" for making plots. General book-keeping.
    • Unlike our case, their workflow has plotting in the middle of the DAG, where as we are aggregating many outputs into the plot. Some of the data management and data movement is perhaps optional for us. HPC Intro also discusses storage, though, so this is an opportunity to reinforce.
  7. Resources and parallelism
    • The ability to do parallelism may be limited by HPC resource policy. Learners will be able to run parallel rules on their laptops, but may not have the Amdahl code there.
    • Objective: Learners should be able to write and run a Snakefile that runs the rules concurrently, and control the "width" of this parallelism.
  8. Make your workflow portable, reduce duplication.
    • Unclear to me if this is a separate task -- we have been introducing wildcards and global stateful configuration as appropriate as we've gone along, so this episode might be a no-op for us?
  9. Scaling a pipeline across a cluster.
    • At this point we have probably deviated pretty far from the original lesson. Running on the cluster may have already happened in the "parallelism" stage? For us, running on a cluster has less novelty, because we are following the HPC Intro lesson, so this might have already happened when we introduced the cluster config and global state info. Workflow is pretty murky at this point.
    • Objective: Learners should definitely be able to use Snakemake to dispatch workflow to the cluster, and at this point, should be able to aggregate results data from the cluster on the head node and analyze it.
  10. "Final notes"
  • Also pretty murky.

@tkphd
Copy link
Member

tkphd commented Oct 20, 2022

Can we (should we) convert this list to a Project?

@tobyhodges tobyhodges added status:in progress Contributor working on issue type:discussion Discussion or feedback about the lesson and removed good first issue Good for newcomers labels Oct 20, 2022
@reid-a
Copy link
Member

reid-a commented Apr 20, 2023

In light of the decision made at the April co-working meeting, viz. that we should not ever run the Amdahl code on the head node or on learner's laptops, some revisions are appropriate for the lesson content.

Episode one's goal needs some revision --- the current thinking is that learners can run the Amdahl code on the head node and collect some preliminary performance data, by way of re-familiarizing themselves with the code, and illustrating the difference between the "bare metal" run and the Snakemake-enclosed one. We can still make this point, but will need to do the initial runs on the cluster, using batch files. Graduates of HPC Intro will have seen this, but realistically we should add some time for refreshing this knowledge for actual human learners with imperfect retention.

So a high-level version of the set of tasks now maybe looks something like this:

  1. Run the amdahl code on the cluster. Learners should be able to identify what output files the code generates, and know what data is in them.
  2. Introduce the Snakemake tool, and construct a "Hello, world" snakefile. Learners should be able to correctly predict whether the rule in the snakefile will fire or not, based on the presence and currency of the output file.
  3. Generate a multi-rule snakefile, with a dependency, to introduce the concept of the task graph, and illustrate that the order of operations. We can continue to use "Hello, world" level executables here. Learners should be able to correctly predict which snakemake rules will fire on an invocation, and in what order, based on the presence and currency of the output targets.
  4. Generate a single-rule snakefile that runs on the cluster. At first, manually specify all the cluster stuff, like the partition name and so forth, to foreground it. Learners should be able to predict how their snakefile will dispatch to the cluster, and predict the location and character of the resulting outputs.
  5. Introduce the cluster config file, and populate it for the local cluster. Repeat the task of the previous lesson, but with the cluster info implicit in the configuration. Same learner capability, I guess?
  6. Dispatch multiple jobs to the cluster via snakemake. Observe that the snakemake process itself remains active on the head node until the jobs are finished. (Deal with the thing where a cluster rule exits at dispatch-time, but the target doesn't appear until later?) Once this content is more developed, the goal can probably be clarified, beyond the obvious "learners should be able to correctly predict the sequence of operations that will result from running their snakefile", which is the emerging theme here.

From here, the tasks get a bit more murky in my mind, but the two beats to hit include:

  1. Plan and execute the workflow that generates the data needed for the Amdahl plot.
  2. Actually generate the Amdahl plot, and observe and appreciate the diminishing returns to increased parallelism.

The mapping of these goals to the existing lesson material is the next step, hopefully much of it is reusable, but a clear and coherent lesson structure is more important than re-using prior content, IMO.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
hackathon status:in progress Contributor working on issue type:discussion Discussion or feedback about the lesson
Projects
None yet
Development

No branches or pull requests

6 participants