From 6a6d6a3b65add8a451db6296d2cce2bef55dc663 Mon Sep 17 00:00:00 2001
From: Marina Gourtovaia <mg8@sanger.ac.uk>
Date: Thu, 16 May 2024 12:13:58 +0100
Subject: [PATCH] Explained merge options in README

---
 README.md | 41 +++++++++++++++++++++++++++++++++--------
 1 file changed, 33 insertions(+), 8 deletions(-)

diff --git a/README.md b/README.md
index bb29f5db..b4b292ce 100644
--- a/README.md
+++ b/README.md
@@ -1,8 +1,8 @@
 # NPG Pipelines for Processing Illumina Sequencing Data
 
 This software provides the Sanger NPG team's automation for analysing and
-internally archiving Illumina sequencing on behalf of DNA Pipelines for their
-customers.
+internally archiving Illumina sequencing data on behalf of DNA Pipelines for
+their customers.
 
 There are two main pipelines:
 
@@ -18,16 +18,16 @@ sequencing flowcell, or each tagged library (within a pool on the flowcell).
 ## Batch Processing and Dependency Tracking with LSF or wr
 
 With this system, all of a pipeline's jobs for its steps are submitted for
-execution to the LSF, or wr, batch/job processing system as the pipeline is
+execution to `LSF` or `wr` batch/job processing system as the pipeline is
 initialised. As such, a _submitted_ pipeline does not have an orchestration
 script or daemon running: managing the runtime dependencies of jobs within an
 instance of a pipeline is delegated to the batch/job processing system.
 
 How is this done? The job representing the start point of a graph is submitted
-to LSF, or wr, in a suspended state and is resumed once all other jobs have been
+to `LSF` or `wr` in a suspended state and is resumed once all other jobs have been
 submitted thus ensuring that the execution starts only if all steps are
-successfully submitted to LSF, or wr. If an error occurs at any point during job
-submissions, all submitted jobs, apart from the start job, are killed.
+successfully submitted. If an error occurs at any point during job submissions,
+all submitted jobs, apart from the start job, are killed.
 
 ## Pipeline Creation
 
@@ -84,8 +84,8 @@ The input for an instance of the pipeline is the instrument output run folder
 (BCL and associated files) and LIMS information which drives appropriate
 processing.
 
-The key data products are aligned CRAM files and indexes, or unaligned CRAM
-files. However per study (a LIMS datum) pipeline configuration allows for the
+The key data products are aligned or unaligned CRAM files and indexes.
+However per study (a LIMS datum) pipeline configuration allows for the
 creation of GATK gVCF files, or the running for external tool/pipeline e.g.
 ncov2012-artic-nf
 
@@ -135,3 +135,28 @@ flow DAGs.
 
 Also, the [npg_irods](https://github.com/wtsi-npg/npg_irods) system is essential
 for the internal archival of data products.
+
+## Data Merging across Lanes of a Flowcell
+
+If the same library is sequenced in different lanes of a flowcell, under certain
+conditions the pipeline will automatically merge all data for a library into
+a single end product. Spiked in PhiX libraries data and unassigned to any tags
+data (tag zero) are never merged. The following scenarios trigger the merge:
+
+* NovaSeq Standard flowcell - a merge across all two or four lanes is performed.
+
+* Any flowcell run on a NovaSeqX instrument - if multiple lanes belong to the
+  same pool, the data for individual libraries will be merged across those
+  lanes. Thus the output of a NovaSeqX run might contain a mixture of merged and
+  unmerged products.
+
+If the data quality in a lane is poor, the lane should be excluded from the merge.
+The `--process_separately_lanes` pipeline option is used to list lanes like this.
+Usually this option is used for the analysis pipeline. The pipeline caches
+the supplied lane numbers so that the archival pipeline can generate a consistent
+with the analysis pipeline list of data products. The same relates to the
+`npg_run_is_deletable` script. The cached value is retrieved only if the
+`--process_separately_lanes` argument was not set when any of these scripts are
+invoked.
+
+