Skip to content

Commit

Permalink
Add support for excluded runs and libraries (#38)
Browse files Browse the repository at this point in the history
* begin implementation of excluded runs and libraries

* further implementation

* Update README

* update README

* fix typos
  • Loading branch information
dfornika authored Jul 23, 2024
1 parent ccf4847 commit 20e6673
Show file tree
Hide file tree
Showing 2 changed files with 132 additions and 11 deletions.
64 changes: 54 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,19 +7,31 @@ Create fastq symlinks for selected samples in sequencer output directories based
## Usage

```
usage: symlink-seqs [-h] [-p PROJECT_ID] [-r RUN_ID] [-i IDS_FILE] [-s] [-c CONFIG] [--copy] [--csv] [-o OUTDIR]
usage: symlink-seqs [-h] [-p PROJECT_ID] [-r RUN_ID] [-i IDS_FILE] [-s] [-c CONFIG] [--copy] [--csv] [--skip-qc-status-check] [--exclude-run EXCLUDE_RUN] [--excluded-runs-file EXCLUDED_RUNS_FILE]
[--excluded-libraries-file EXCLUDED_LIBRARIES_FILE] [-o OUTDIR]
optional arguments:
-h, --help Show this help message and exit.
-h, --help show this help message and exit
-p PROJECT_ID, --project-id PROJECT_ID
-r RUN_ID, --run-id RUN_ID
-i IDS_FILE, --ids-file IDS_FILE File of sample IDs (one sample ID per line)
-s, --simplify-sample-id Simplify filenames of symlinks to include only sample-id_R{1,2}.fastq.gz
-c CONFIG, --config CONFIG Config file (json format).
--copy Create copies instead of symlinks.
--csv Print csv-format summary of fastq file paths for each sample to stdout.
--skip-qc-status-check Skip checking qc status for runs.
-o OUTDIR, --outdir OUTDIR Output directory, where symlinks (or copies) will be created.
-i IDS_FILE, --ids-file IDS_FILE
File of sample IDs (one sample ID per line)
-s, --simplify-sample-id
Simplify filenames of symlinks to include only sample-id_R{1,2}.fastq.gz
-c CONFIG, --config CONFIG
Config file (json format).
--copy Create copies instead of symlinks.
--csv Print csv-format summary of fastq file paths for each sample to stdout.
--skip-qc-status-check
Skip checking qc status for runs.
--exclude-run EXCLUDE_RUN
A single run ID to exclude.
--excluded-runs-file EXCLUDED_RUNS_FILE
File containing list of run IDs to exclude. (single column, one run ID per line, no header)
--excluded-libraries-file EXCLUDED_LIBRARIES_FILE
File containing list of run ID, library ID pairs to exclude. (csv format, no header)
-o OUTDIR, --outdir OUTDIR
Output directory, where symlinks (or copies) will be created.
```

If you add the `-s` (or `--simplify-sample-id`) flag, then the filenames of the symlinks will be simplified to only `sample-id_R1.fastq.gz`, instead of
Expand All @@ -37,6 +49,34 @@ ID,R1,R2

If a run ID is supplied, then only samples from that run will be symlinked. If no sample IDs file is supplied, then all samples on that run will be symlinked.

### Excluded Runs and Libraries

Specific sequencing runs and libraries may be excluded when searching for fastq files to symlink.

The `--excluded-runs-file` flag can be used to provide a list of sequencing run IDs for runs to be skipped when searching for fastq files.
The file should be a simple list of sequencing run IDs with no header:

```
220318_M00123_0128_000000000-AGTBE
220521_M00456_0132_000000000-AB5RA
```

A single run may be excluded using the `--exclude-run` flag, with the sequencing run ID provided as a command-line argument:

```
symlink-seqs --ids-file sample_ids.csv --exclude-run "220318_M00123_0128_000000000-AGTBE" --outdir symlinks
```

The `--excluded-libraries-file` flag can be used to provide a list of specific libraries to be skipped. The file should be a two-column, comma-separated file with no header,
with a sequencing run ID in the first column and a library ID in the second column:

```
220318_M00123_0128_000000000-AGTBE,sample-01
220318_M00123_0128_000000000-AGTBE,sample-08
220318_M00123_0128_000000000-AGTBE,sample-12
220521_M00456_0132_000000000-AB5RA,sample-06
```

## Configuration

The tool reads a config file from `~/.config/symlink-seqs/config.json` by default. An alternative config file can be provided using the `-c` or `--config` flags.
Expand All @@ -63,7 +103,9 @@ Additional settings may be added to the config:
"/path/to/sequencer-03/output"
],
"simplify_sample_id": true,
"skip_qc_status_check": true
"skip_qc_status_check": true,
"excluded_runs_file": "/path/to/excluded_runs.csv",
"excluded_libraries_file": "/path/to/excluded_libraries.csv"
}
```

Expand All @@ -75,5 +117,7 @@ Additional settings may be added to the config:
| `csv` | False | Boolean | When set to `true`, print a csv summary of fastq files per sample |
| `skip_qc_status_check` | False | Boolean | When set to `true`, the QC status of runs will not be checked |
| `outdir` | False | Path | Directory to create symlinks or copies under |
| `excluded_runs_file` | False | Path | Path to file containing a list of sequencing run IDs to be excluded when searching for fastq files (single column, no header) |
| `excluded_libraries_file` | True | Path | Path to file containing list of (sequencing run ID, library ID) pairs to be excluded when searching for fastq files (two columns, comma-separated) |

The file must be in valid JSON format.
79 changes: 78 additions & 1 deletion symlink-seqs
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,43 @@ def jdump(x):
print(json.dumps(x, indent=2))


def parse_excluded_runs_file(excluded_runs_file):
"""
Parse the excluded runs file, return as a set.
:param excluded_runs_file:
:type excluded_runs_file: str
:return: excluded_runs
:rtype: set[str]
"""
excluded_runs = set()
with open(excluded_runs_file, 'r') as f:
for line in f:
excluded_runs.add(line.strip())

return excluded_runs


def parse_excluded_libraries_file(excluded_libs_file):
"""
Parse the excluded libraries file, return as a dict, indexed by sequencing run ID.
:param excluded_libs_file:
:type excluded_libs_file: str
:return: excluded_libraries_by_run
:rtype: dict[str, set[str]]
"""
excluded_libraries_by_run = {}
with open(excluded_libs_file, 'r') as f:
for line in f:
sequencing_run_id, library_id = line.strip().split(',')
if sequencing_run_id not in excluded_libraries_by_run:
excluded_libraries_by_run[sequencing_run_id] = set()
excluded_libraries_by_run[sequencing_run_id].add(library_id)

return excluded_libraries_by_run


def parse_config(config_path):
"""
Parse json-format config file, return as a dict.
Expand All @@ -35,6 +72,16 @@ def parse_config(config_path):
print("ERROR: Error parsing config file: " + config_path, file=sys.stderr)
print(e, file=sys.stderr)
exit(-1)
if 'excluded_runs_file' in config and os.path.exists(config['excluded_runs_file']):
config['excluded_runs'] = parse_excluded_runs_file(config['excluded_runs_file'])
else:
config['excluded_runs'] = set()

if 'excluded_libraries_file' in config and os.path.exists(config['excluded_libraries_file']):
config['excluded_libraries_by_sequencing_run_id'] = parse_excluded_libraries_file(config['excluded_libraries_file'])
else:
config['excluded_libraries_by_sequencing_run_id'] = {}

else:
print("ERROR: config file does not exist: " + config_path, file=sys.stderr)
exit(-1)
Expand Down Expand Up @@ -70,6 +117,20 @@ def merge_config_with_args(config, args):
if 'skip_qc_status_check' not in config:
config['skip_qc_status_check'] = args.skip_qc_status_check

if args.exclude_run is not None:
config['excluded_runs'].add(args.exclude_run)

if args.excluded_runs_file is not None:
excluded_runs_from_args = parse_excluded_runs_file(args.excluded_runs_file)
config['excluded_runs'] = config['excluded_runs'].union(excluded_runs_from_args)

if args.excluded_libraries_file is not None:
excluded_libraries_from_args = parse_excluded_libraries_file(args.excluded_libraries_file)
for run_id, libraries in excluded_libraries_from_args.items():
if run_id not in config['excluded_libraries_by_sequencing_run_id']:
config['excluded_libraries_by_sequencing_run_id'][run_id] = set()
config['excluded_libraries_by_sequencing_run_id'][run_id] = config['excluded_libraries_by_sequencing_run_id'][run_id].union(libraries)

return config


Expand Down Expand Up @@ -645,13 +706,26 @@ def main(args):
run_dirs = list(filter(lambda x: os.path.basename(x) == args.run_id, run_dirs))

for run_dir in run_dirs:
sequencing_run_id = os.path.basename(run_dir.rstrip('/'))
if sequencing_run_id in config['excluded_runs']:
print(f"INFO: Skipping run {sequencing_run_id} due to exclusion: {run_dir}", file=sys.stderr)
continue
if not config['skip_qc_status_check']:
qc_status = check_qc_status(run_dir)
if qc_status != None and qc_status != 'PASS':
print("WARNING: Skipping run due to failed QC: " + run_dir, file=sys.stderr)
continue

fastq_paths = get_fastq_paths(config, run_dir, sample_ids, args.project_id)
included_sample_ids = []
if sequencing_run_id in config['excluded_libraries_by_sequencing_run_id']:
excluded_sample_ids = config['excluded_libraries_by_sequencing_run_id'][sequencing_run_id]
for sample_id in sample_ids:
if sample_id not in excluded_sample_ids:
included_sample_ids.append(sample_id)
else:
included_sample_ids = sample_ids

fastq_paths = get_fastq_paths(config, run_dir, included_sample_ids, args.project_id)

if config['copy']:
create_copies(fastq_paths)
Expand All @@ -676,6 +750,9 @@ if __name__ == '__main__':
parser.add_argument('--copy', action='store_true', help="Create copies instead of symlinks.")
parser.add_argument('--csv', action='store_true', help="Print csv-format summary of fastq file paths for each sample to stdout.")
parser.add_argument('--skip-qc-status-check', action='store_true', help="Skip checking qc status for runs.")
parser.add_argument('--exclude-run', help="A single run ID to exclude.")
parser.add_argument('--excluded-runs-file', help="File containing list of run IDs to exclude. (single column, one run ID per line, no header)")
parser.add_argument('--excluded-libraries-file', help="File containing list of run ID, library ID pairs to exclude. (csv format, no header)")
parser.add_argument('-o', '--outdir', required=True, help="Output directory, where symlinks (or copies) will be created.")
args = parser.parse_args()
main(args)

0 comments on commit 20e6673

Please sign in to comment.