-
Notifications
You must be signed in to change notification settings - Fork 3
Refactoring Code
The current satay code base contains large scripts (such as transposon-mapping.py) that perform many tasks and some modules that are called in these scripts. In order keep the code base easy to understand and accessible for testing, the code needs to be refactored. Refactoring is the technique of changing an application (either the code or the architecture) so that it behaves the same way on the outside, but internally has improved. These improvements can be stability, performance, or reduction in complexity. Generally, refactoring existing code bases with large scripts aims at breaking the code into smaller modules.
Below are some general considerations and an example on refactoring and writing modular code. If you would like to have a more hands-on training, please have a look at these links:
Modular programming refers to the process of breaking a large, unwieldy programming task into separate, smaller, more manageable subtasks or modules. Individual modules can then put combined like building blocks to create a larger application.
There are several advantages to modularizing code in a large application:
-
Simplicity: Rather than focusing on the entire problem at hand, a module typically focuses on one relatively small portion of the problem. If you’re working on a single module, you’ll have a smaller problem domain to wrap your head around. This makes development easier and less error-prone.
-
Maintainability: Modules are typically designed so that they enforce logical boundaries between different problem domains. If modules are written in a way that minimizes interdependency, there is decreased likelihood that modifications to a single module will have an impact on other parts of the program. (You may even be able to make changes to a module without having any knowledge of the application outside that module.) This makes it more viable for a team of many programmers to work collaboratively on a large application.
-
Reusability: Functionality defined in a single module can be easily reused (through an appropriately defined interface) by other parts of the application (or other applications entirely). This eliminates the need to duplicate code.
-
Scoping: Modules typically define a separate namespace, which helps avoid collisions between identifiers in different areas of a program.
From: https://realpython.com/python-modules-packages/
The first task in the script transposon-mapping.py is loading a BAM file.
#%% LOADING BAM FILE
if bamfile is None:
path = os.path.join('/home', 'gregoryvanbeek', 'Documents', 'data_processing')
# filename = 'E-MTAB-4885.WT2.bam'
filename = 'SRR062634.filt_trimmed.sorted.bam'
bamfile = os.path.join(path,filename)
else:
filename = os.path.basename(bamfile)
path = bamfile.replace(filename,'')
assert os.path.isfile(bamfile), 'Bam file not found at: %s' % bamfile #check if given bam file exists
This single task would be easier to maintain and test if it was a separate function. Some observations:
- The code is not actually loading a BAM file, only verifying its presence.
- The code contains references to local files and folders
This can be turned into a separate function with one task:
def verify_presence_bamfile(path=None, filename=None):
"""
Docstring
"""
if path and filename:
bamfile = os.path.join(path,filename)
elif not path or not filename:
# Some default behaviour, such as looking for the .bam extension in the current folder.
However, this would add a second functionality to the function, namely find_bamfile
assert os.path.isfile(bamfile), 'Bam file not found at: %s' % bamfile
This new function (or module) can then be added to a script load_bamfile.py
in a folder containing all scripts pertaining to data importing. This script could then later also contain the import routines for bamfiles. The existing code in transposon-mapping.py
can the be replaced with
# Add to module import
from satay.importing import verify_presence_bamfile
# Set parameters, preferably in a separate config file
PATH = "path/to/file"
FILENAME = "filename.bam"
# Run code
verify_presence_bamfile(path=PATH, filename=FILENAME)
Of course, if more filetypes need to be verified, then the refactored function can be made more general further still.
While you're refactoring the code base to make it more modular, you might find better ways of writing the code altogether. Might as well take the opportunity! Below are some pointers to think of.
Commented out code is confusing. It bloats the codebase and makes it difficult to follow the true execution paths of the code. It also gets in the way of genuine comments, which can become lost in a sea of out of date code lines. Since the code base is under version control, there is no risk of losing the code. Just give the commit where you delete the comments a suitable name and they will be easy to find.
Project details
- Genetic Interactions Pipeline
- DCC Project Goals
- Software Maintenance Plan
- Knowledge Transfer Workshop
FAIR
Code base
Dependencies
Git and GitHub
Resources