Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reorganization and updates #18

Merged
merged 40 commits into from
Mar 2, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
40 commits
Select commit Hold shift + click to select a range
5ed9403
Reorganization
timlai4 Feb 4, 2020
4b56c48
Added new plots
timlai4 Feb 10, 2020
737f66e
Added titles to new plots
timlai4 Feb 10, 2020
381cf1c
Changed delta to 100
timlai4 Feb 18, 2020
b16c680
Added median statistics
timlai4 Feb 18, 2020
690098a
Renamed variable
timlai4 Feb 23, 2020
4546c61
Changed sum to max in chain length
timlai4 Feb 23, 2020
fdb95d8
Fixed formula for length
timlai4 Feb 23, 2020
ad823d1
Fixed tiebreaker computation
timlai4 Feb 24, 2020
a6c7486
Updated data
timlai4 Feb 24, 2020
1a0d731
Updated plots
timlai4 Feb 24, 2020
89b1fd9
Fixed typo
timlai4 Feb 24, 2020
b21fff2
Adjusted formatting
timlai4 Feb 24, 2020
c54667c
Added redundancy checks to tiebreakers
timlai4 Feb 24, 2020
05ffe5e
Initial commit
timlai4 Feb 25, 2020
9cf942a
Fixed stuck iter bug
timlai4 Mar 2, 2020
384702d
Initial commit
timlai4 Mar 2, 2020
295f6ac
Chaining script
timlai4 Mar 2, 2020
fb10a7c
Merge branch 'comparisons'
timlai4 Mar 2, 2020
6ed60d3
Reorganization
timlai4 Feb 4, 2020
cdce86a
Added new plots
timlai4 Feb 10, 2020
6947972
Added titles to new plots
timlai4 Feb 10, 2020
8acf1da
Changed delta to 100
timlai4 Feb 18, 2020
b431920
Added median statistics
timlai4 Feb 18, 2020
52d14d6
Renamed variable
timlai4 Feb 23, 2020
4cb7f2d
Changed sum to max in chain length
timlai4 Feb 23, 2020
c381ae2
Fixed formula for length
timlai4 Feb 23, 2020
0c830e5
Fixed tiebreaker computation
timlai4 Feb 24, 2020
7d75489
Updated data
timlai4 Feb 24, 2020
9369f48
Updated plots
timlai4 Feb 24, 2020
833a46d
Fixed typo
timlai4 Feb 24, 2020
ddc1ba6
Adjusted formatting
timlai4 Feb 24, 2020
70d06a7
Added redundancy checks to tiebreakers
timlai4 Feb 24, 2020
682ff30
Initial commit
timlai4 Feb 25, 2020
20f536e
Fixed stuck iter bug
timlai4 Mar 2, 2020
f55f968
Initial commit
timlai4 Mar 2, 2020
0e18184
Chaining script
timlai4 Mar 2, 2020
431e25a
Merge branch 'master' of https://github.com/timlai4/IntervalLoci
timlai4 Mar 2, 2020
537184b
Reorganized workflow structure
timlai4 Mar 2, 2020
62a8d3a
Fix genhub-build.py
timlai4 Mar 2, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
407 changes: 286 additions & 121 deletions 03-b-daphnia.ipynb

Large diffs are not rendered by default.

6 changes: 3 additions & 3 deletions 03-c-volvox.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,9 +11,9 @@ We then use a custom Python script to assign each iLocus a provisional status ba

```bash
cd chlorophyta
genhub-build.py --cfgdir=config/ --batch=chlorophyta+ \
--workdir=../data/ --numprocs=4 \
download format prepare stats cluster
fidibus --cfgdir=config/ --refrbatch=chlorophyta+ \
--workdir=../data/ --numprocs=13 \
download prep iloci breakdown stats cluster
python status.py GenHub.hiloci.tsv > Chlorophyta.hiLocus.pre-status.tsv
```

Expand Down
4 changes: 0 additions & 4 deletions Atha/AT_iloci.tsv

This file was deleted.

4 changes: 0 additions & 4 deletions Atha/AT_miloci.tsv

This file was deleted.

4 changes: 0 additions & 4 deletions Atha/AT_piloci.tsv

This file was deleted.

14 changes: 0 additions & 14 deletions Atha/README.md

This file was deleted.

16 changes: 0 additions & 16 deletions Atha/phisigma-Atha-min2Mb.tsv

This file was deleted.

16 changes: 0 additions & 16 deletions Atha/phisigma-Atha-min500kb.tsv

This file was deleted.

16 changes: 0 additions & 16 deletions Atha/phisigma-Atha.tsv

This file was deleted.

22,534 changes: 22,534 additions & 0 deletions compare/Amel/Amel.iloci.tsv

Large diffs are not rendered by default.

15,946 changes: 15,946 additions & 0 deletions compare/Amel/Amh3.iloci.tsv

Large diffs are not rendered by default.

23 changes: 23 additions & 0 deletions compare/Amel/chaining.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
#!/bin/bash
set -e
set -u
set -o pipefail

lastz Amh3.iloci.fa[multiple] Amel.iloci.fa --match=1,9 --filter=identity:95 --chain \
format=general:name1,length1,size1,name2,length2,size2,identity,nmatch \
> entire.tsv
lastz Amh3.iloci.fa[multiple] Amel.filoci.fa --match=1,9 --filter=identity:95 --chain \
format=general:name1,length1,size1,name2,length2,size2,identity,nmatch \
> fi.tsv
lastz Amh3.iloci.fa[multiple] Amel.ciloci.fa --match=1,9 --filter=identity:95 --chain \
format=general:name1,length1,size1,name2,length2,size2,identity,nmatch \
> ci.tsv
lastz Amh3.iloci.fa[multiple] Amel.niloci.fa --match=1,9 --filter=identity:95 --chain \
format=general:name1,length1,size1,name2,length2,size2,identity,nmatch \
> ni.tsv
lastz Amh3.iloci.fa[multiple] Amel.iiloci.fa --match=1,9 --filter=identity:95 --chain \
format=general:name1,length1,size1,name2,length2,size2,identity,nmatch \
> ii.tsv
lastz Amh3.iloci.fa[multiple] Amel.siloci.fa --match=1,9 --filter=identity:95 --chain \
format=general:name1,length1,size1,name2,length2,size2,identity,nmatch \
> si.tsv
52 changes: 52 additions & 0 deletions compare/Amel/ci-count.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
import pickle
import pandas as pd

species = "Amh3"
iloci = pd.read_csv(species + '.iloci.tsv',sep='\t')
ii = iloci['LocusClass'] == "iiLocus"
iiloci = iloci[ii]
fi = iloci['LocusClass'] == "fiLocus"
filoci = iloci[fi]
siloci = iloci[iloci['LocusClass'] == "siLocus"]
niloci = iloci[iloci['LocusClass'] == "niLocus"]
ciloci = iloci[iloci['LocusClass'] == "ciLocus"]

ii_ids = set(iiloci['LocusId'])
fi_ids = set(filoci['LocusId'])
si_ids = set(siloci['LocusId'])
ni_ids = set(niloci['LocusId'])
ci_ids = set(ciloci['LocusId'])

with open('Amel_ci-relations','rb') as f:
rels = pickle.load(f)
si_count = 0
ci_count = 0
ii_count = 0
ni_count = 0
fi_count = 0
ties = 0
for key in rels:
if len(rels[key]) < 1:
continue
elif len(rels[key]) > 1:
ties += 1
else:
for match in rels[key]:
if match in si_ids:
si_count += 1
elif match in ci_ids:
ci_count += 1
elif match in ii_ids:
ii_count += 1
elif match in ni_ids:
ni_count += 1
elif match in fi_ids:
fi_count += 1
else:
raise ValueError("Wrong id")
print(si_count)
print(ci_count)
print(ni_count)
print(ii_count)
print(fi_count)
print(ties)
54 changes: 54 additions & 0 deletions compare/Amel/ci-hsp.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
import pandas as pd
import pickle
blast = pd.read_csv("ci.tsv", sep='\t')
blast[['num1','num2']] = blast['identity'].str.split('/',expand=True)
blast[['num1','num2']] = blast[['num1','num2']].apply(pd.to_numeric)
blast.rename(columns = {'#name1' : 'name1'}, inplace = True)
iloci = list(set(blast.name2))
num_matches = 0
# Warning: this isn't the total number of giLoci.
# These are giLoci that have some chains, we later filter out based on 95% length
# To get the actual number of giLoci, we need to go back to the giLoci data files

num_conserved = 0
relations = {}
locus_lengths = {}
for locus in iloci:
indices = blast.name2 == locus
ilocus = blast[indices]
query_length = ilocus.iloc[0]['size2']
match_loci = list(set(ilocus['name1']))
chains = {}
for match in match_loci:
indices = ilocus.name1 == match
hsp = ilocus[indices]
length = hsp['nmatch'].max()
if length / query_length >= 0.9:
chains[match] = length
try:
targets = [key for m in [max(chains.values())] for key,val in chains.items() if val == m]
if chains[targets[0]] > ilocus[ilocus.name1 == targets[0]].iloc[0]['size1'] * 0.9:
num_conserved += 1
if len(targets) > 1:
ids = {}
for target in targets:
search = ilocus[ilocus.name1 == target]
if len(search) > 1:
max_len = search['nmatch'].max()
search = search[search.nmatch == max_len]
assert len(search) <= 1
ids[target] = search['num1'].sum() / search['num2'].sum()
tiebreakers = [key for m in [max(ids.values())] for key,val in ids.items() if val == m]
relations[locus] = tiebreakers
else:
relations[locus] = targets
except ValueError:
continue
with open('Amel_ci-relations','wb') as f:
pickle.dump(relations,f)
for locus in relations:
if len(relations[locus]) > 0:
num_matches += 1

print(str(num_matches) + ' iLoci had at least one match')
print('Conserved: ' + str(num_conserved))
Loading