Skip to content

Commit

Permalink
simplify all scripts and Snakefile
Browse files Browse the repository at this point in the history
  • Loading branch information
bast committed Mar 27, 2023
1 parent 848b95a commit f4ca474
Show file tree
Hide file tree
Showing 19 changed files with 91 additions and 246 deletions.
8 changes: 2 additions & 6 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,6 +1,2 @@
*~
__pycache__/
venv/
results/*.png
results/*.txt
processed_data/*.dat
.snakemake
*/*.log
48 changes: 13 additions & 35 deletions Snakefile
Original file line number Diff line number Diff line change
@@ -1,49 +1,27 @@
# a list of all the books we are analyzing
DATA = glob_wildcards('data/{book}.txt').book

# this is for running on HPC resources
localrules: all, make_archive

# the default rule
rule all:
input:
'zipf_analysis.tar.gz'
expand('statistics/{book}.data', book=DATA),
expand('plot/{book}.png', book=DATA)

# count words in one of our books
# logfiles from each run are put in .log files"
rule count_words:
input:
wc='source/wordcount.py',
script='statistics/count.py',
book='data/{file}.txt'
output: 'processed_data/{file}.dat'
threads: 4
log: 'processed_data/{file}.log'
shell:
'''
python {input.wc} {input.book} {output} >> {log} 2>&1
'''
output: 'statistics/{file}.data'
conda: 'environment.yml'
log: 'statistics/{file}.log'
shell: 'python {input.script} {input.book} > {output}'

# create a plot for each book
rule make_plot:
input:
plotcount='source/plotcount.py',
book='processed_data/{file}.dat'
output: 'results/{file}.png'
shell: 'python {input.plotcount} {input.book} {output}'

# generate summary table
rule zipf_test:
input:
zipf='source/zipf_test.py',
books=expand('processed_data/{book}.dat', book=DATA)
output: 'results/results.txt'
shell: 'python {input.zipf} {input.books} > {output}'

# create an archive with all of our results
rule make_archive:
input:
expand('results/{book}.png', book=DATA),
expand('processed_data/{book}.dat', book=DATA),
'results/results.txt'
output: 'zipf_analysis.tar.gz'
shell: 'tar -czvf {output} {input}'
script='plot/plot.py',
book='statistics/{file}.data'
output: 'plot/{file}.png'
conda: 'environment.yml'
log: 'plot/{file}.log'
shell: 'python {input.script} --data-file {input.book} --plot-file {output}'
1 change: 0 additions & 1 deletion doc/README.md

This file was deleted.

23 changes: 5 additions & 18 deletions environment.yml
Original file line number Diff line number Diff line change
@@ -1,22 +1,9 @@
name: coderefinery
name: word-count
channels:
- conda-forge
- defaults
- bioconda
dependencies:
- python>3.7
- click=7.1.2
- ipywidgets=7.6.3
- jupyterlab=3.0.14
- jupyterlab-git=0.30.0
- matplotlib=3.4.1
- numpy=1.20.2
- pandas=1.2.4
- pytest=6.2.3
- seaborn=0.11.1
- snakemake-minimal=6.2.1
- sphinx=3.5.4
- sphinx_rtd_theme=0.5.2
- pip
# - pip:
# - jupyterlab-github==2.0.0
- python>3.9
- click=8.1.3
- matplotlib=3.7.0
- snakemake-minimal=7.22.0
Binary file added plot/abyss.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added plot/isles.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added plot/last.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
31 changes: 31 additions & 0 deletions plot/plot.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
import matplotlib.pyplot as plt
import click


def plot_bar_chart(x_values, y_values, title, plot_file):
plt.figure(figsize=(10, 5))
plt.bar(x_values, y_values)
plt.title(title)
plt.savefig(plot_file)


@click.command()
@click.option(
"--data-file", required=True, help="Input data file", type=click.Path(exists=True)
)
@click.option("--plot-file", required=True, help="Output plot file")
def main(data_file, plot_file):
# read data from input_file
x_values = []
y_values = []
for line in open(data_file, "r").readlines():
word, count = line.split()
x_values.append(word)
y_values.append(int(count))

# now plot the data
plot_bar_chart(x_values, y_values, "10 most common words", plot_file)


if __name__ == "__main__":
main()
Binary file added plot/sierra.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Empty file removed processed_data/.gitkeep
Empty file.
Empty file removed results/.gitkeep
Empty file.
33 changes: 0 additions & 33 deletions source/plotcount.py

This file was deleted.

130 changes: 0 additions & 130 deletions source/wordcount.py

This file was deleted.

23 changes: 0 additions & 23 deletions source/zipf_test.py

This file was deleted.

10 changes: 10 additions & 0 deletions statistics/abyss.data
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
the 4044
and 2807
of 1907
a 1594
to 1515
in 1221
i 974
was 695
it 680
for 675
File renamed without changes.
10 changes: 10 additions & 0 deletions statistics/isles.data
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
the 3822
of 2460
and 1723
to 1479
a 1308
in 997
is 894
that 652
by 607
it 573
10 changes: 10 additions & 0 deletions statistics/last.data
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
the 12244
and 5566
to 5073
of 4952
a 4015
in 2699
we 2649
is 2302
it 2102
on 1861
10 changes: 10 additions & 0 deletions statistics/sierra.data
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
the 4242
and 2469
of 2190
a 1319
to 1292
in 1175
i 621
is 564
as 524
on 513

0 comments on commit f4ca474

Please sign in to comment.