-
Notifications
You must be signed in to change notification settings - Fork 0
/
readme
84 lines (71 loc) · 5.06 KB
/
readme
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
## This project is to create an XML parser in Python capable of reading an LRG file
# LRG IN; FASTA OUT
# GUIDE:
To run this program, open a command line terminal and navigate to the directory containing this
program (XML_Parser.py) and desired LRG files. Enter:
"python XML_Parser.py LRG_FILE_TITLE INTRONIC_PADDING_LENGTH"
Where LRG_FILE_TITLE is an LRG format file ending in ".xml"
And INTRONIC_PADDING_LENGTH is a number between 0 and 2000, specifying the length of
intronic flanking sequence around the printed exons
As a default, flanking sequences are lower case, and exonic sequence is capitalized
This will run the program and create a new folder (if it does not already exist) called outputFiles"
If a file in the specified folder already exists, the program will report this at the command line and
ask for a confirmation to continue
To be duplicated a file will have to have been created with the same input and flanking length
# Input:
Pass file title to program as a string argument (cmd line)
Optional argument to specify intronic sequence length around exons
Pass optional command to specify genomic, cDNA, protein.. (genomic default)
Program creates an output file based on the sequences used
# Method:
Command line arguments are supplied to specify the input file and specific parameters
The appropriate sequence is read into a dictionary (multiple sequences where appropriate)
Tree iteration is used to find the coordinate details for all exons
The name of the output file is created using the input file title (LRG #)
Using the coordinates and specific sequence type, the specific portions of sequence corresponding to each
exon are output into a output file
The presence of an existing file of the same title is checked
If a file already exists, the user is prompted to overwrite (Y/N)
If the user chooses to overwrite, the program continues
If the user chooses not to, the program exits and reports that no output was created
# Output:
FastA file format
Each exon is indvidually identified and paired with corresponding sequence
The description line identifies the exon number and transcript, so each can be used in isolation.
This line also contains the length of the exon for quality control checking
Output file name was created using:
-the title of the input file
-the amount of flanking intron requested
-the transcript name (to prevent overwriting in the case of multiple transcripts)
At the time of writing, an os test is used to determine if a file with the specified name already
exists, if so the user is prompted to continue or exit.
This meant that only a file which would be the exact duplicate of an existing file (all parameters the
same) would be capable of replacing the current version
This may be desirable if writing with a new version of the program
# Testing:
Correct performance of this program was confirmed by a combination of assert statements and
error handling techniques throughout the code. Try-catch blocks were used to handle loops and
element access which could produce issues in faulty XML files. Deliberately faulty XML files
were used to check the performance of these measures.
Assert statements have been used to prevent users from inserting the wrong inputs into the program,
and to make sure that the number of valid command line arguments are not exceeded
Assert statements are used throughout the indexing for string-slicing to ensure that the indexes
used are not out of bounds (such as when the exon coordinates for transcripts are used for protein sequences)
Try-catches are used during opening the input file, accessing th sequence elements and determining the
intronic padding to be used when outputting the exons.
A draft loop to detect an additional option for genomic, transcript and protein sequences has been written
and commented out. This would cause the program to choose between one of three separate methods, one for
each type. Only one has been written as of 20/11/2014, although this program is extensible.
Exceptions were also used to provide users with readable error messages if parameters or file contents
were not found at runtime
*** Note: This program was edited 03/12/2014 to add the coordinates of the exon start/finish to the printed FastA file. This was requested by the genetics lab so they could use the application for creating reference sequence hard copies (getting helpless trainees to spend hours making references and calling it a 'competency')
---
The added functionality is summarised below;
* The program now takes a third argument by default (program name, target LRG, intronic padding, Option
* Options and padding are not mandatory, but currently padding is required to input an option (argument ordering, I'll look into changing this)
* Available options are
- '-g': default, genomic sequence only
- '-p': protein sequence only
- '-pg' | '-gp': genomic and protein sequence. Layout is header, DNA, protein
I will be going through this soon to standardise variable naming and testing that sequence extracts are correct (content and length)
M. Welland