Coding for biologists:
You should submit a single zipped file containing the entire work directory for the assignment.
This should include: all FASTA files, all of your code, and an iPython notebook with the details of your work. All code should be either included in the notebook, or written in separate files that are either imported or run from the notebook via the %run iPython magic, see https://ipython.org/ipython-doc/rel-0.10.2/html/interactive/tutorial.html
All output and all comments should appear in the notebook. It should be possible to run the entire notebook by running the cells sequentially from the beginning to the end (check that this works by restarting the kernel and working through the notebook from the top). Code and graphic output not linked to (directly or indirectly) from the notebook will not be marked. Only the notebook, Python code, text files required by the software and graphic output produced by the software will be marked. Comment your code thoroughly and format it properly.
Your work will be marked based on:
? completeness and correctness: 60%
? quality of the algorithmic solutions (including appropriate use of data and control
structures, use of functions, etc.): 30%
? coding style (comments, variable names, readability of code): 10%
For this assignment, you will implement a simple utility for drawing dotplots comparing two proteins. You can refer to the dotter program and the lecture notes for the Computational Genomics module for inspiration. The assignment is presented as a sequence of stages.
Attempt all questions in the “Required functionality” part before implementing any features marked as “Optional functionality”. You can implement any subset you like of the optional functionality. Check that your program runs correctly in the terminal, then use the iPython %run magic to run it from within a notebook. Include sample output for each functionality you implement and any other relevant information in the notebook.
Indicate clearly near the top of the notebook which of the questions you have attempted.
a) Write a dotplot program that reads two proteins from FASTA files specified on the command line (see sys.argv in the Python documentation). The program should output a simple dotplot to the terminal. The dotplot should involve only the first 70 residues of the sequence displayed horizontally and the first 20 residues of the sequence displayed vertically, so as to fit in the standard terminal screen. The first row and the first column should display the two sequences. In the dotplot proper, an asterisk (*) should mark locations corresponding to matching entries, while the rest should be left empty. A sample output (limited here for convenience to 10 residues from one sequence and 5 from the other) should look like:
Include a sample output in your notebook. [30 marks]
b) Code a simple help message to be displayed when the program is invoked with wrong or insufficient arguments or with the string help on the command line. Run your program from within the notebook to display the help message. To allow for easy modification and translation, the help message should be stored in a separate text file and loaded and displayed upon request.
c) Program a simple menu system of the type found in clustalw that allows the user to specify the names of the input files, obtain help, and quit the program. The menu should be displayed if the program is invoked without command-line arguments, or in any case after a dotplot is produced. You should wait for the user to press the enter key before reverting to the menu, to avoid wiping out the dotplot immediately when running in a terminal. For clarity, print the following line just below the dotplot: Hit enter to return to menu:
Include a screenshot of the menu in the notebook.
d) Implement panning through the sequences to visualise the rest of the dotplot. When a dotplot is displayed, the user should have a choice to press one of five keys to “page” forwards or backwards through either sequence, or return to the main menu. Following this a different portion of the dotplot should be displayed, or the user should be returned to the main menu. For example, a text line printed just below the dotplot should read:
Enter [r]ight, [l]eft, [u]p, [d]own or [m]enu:
The system should be able to handle sequences with a number of residues that isn't a multiple of 20 or 70. Demonstrate this feature in the notebook. [5 marks]
e) Use a scoring matrix instead than a simple identity check to score corresponding amino acids. Only plot a (*) if the score is above a threshold. The scoring matrix should be stored in a separate file that is loaded as required. The user should be able to select the threshold with a command line option and through the menu; for example mydotplot -t0.3 proteinA.fasta proteinB.fasta should select a threshold value of 0.3. Include sample output in the notebook and comment on the difference with respect to the simpler scoring scheme, if any (you can return to identity matching by choosing the identity matrix as your scoring scheme). [15 marks]
f) Implement filtering with a window of length w.
If you are not implementing (e): only draw a (*) at position (i,j) on the dotplot if the number of matching residues in corresponding positions within windows of length w centred at positions i (respectively j) on the two sequences is above a threshold t. So for instance if w=5 and t=3 a (*) should appear at any given position only if at least 3 corresponding residues within windows of length 5 match (both in the sense that they are the same residue, and that they are in the same position within the window; so for example if the two filtering windows contain “APKTR” and “AKQWR” then A and R count as a matches but K does not).
If you are implementing (e): For each position (i,j) in the two sequences, pairs of amino acids in corresponding positions in the filtering windows should be scored using the scoring matrix. These scores should be averaged and compared against the threshold. A (*) should then be printed only if the resulting average score is above the threshold.
In either case you should implement a command line option -f to allow the user to request the use of the filter and specify the length of the window, and an option -t for threshold selection. For instance mydotplot -f5 -t2.0 proteinA.fasta proteinB.fasta should produce a dotplot of protein A vs protein B, filtered with a window of length 5 and a threshold of 2.0. The same functionality should also be accessible through the menu. Invoke your dotplot program on two sample sequences, without and with filtering, include the output in the notebook and comment on the differences. [20 marks]
g) Give the user the option to display the dotplot for the entire sequences using a graphic library. I suggest the imshow function from the matplotlib library, but other equivalent choices are also fine (if this library is not present on your system, use the software installer to install python-matplotlib). http://matplotlib.org/api/pyplot_api.html#matplotlib.pyplot.imshow
Note that you will not be able to display the sequences with imshow, only the dots will be displayed as an image.
For this to work, you will need to create a two-dimensional array of the appropriate size and set each single entry to 0.0 (black) or 1.0 (white) to differentiate between dots and background. You can pass the keyword argument cmap='gray' to imshow to select a grey scale colormap. If you have implemented point (e) and/or (f), you may want to display the matching score itself as a grey level, instead of creating a black-and-white (two-level) dotplot. It is still useful to set a threshold below which the point is set to white. According to your scoring scheme, you may need to rescale/normalize the thresholded scores for display with imshow (read the function description and the example carefully). Graphic output should be selectable from the command line (via option -g) and from the menu of your program. Include sample output in the notebook (you can get matplotlib images to display directly in the notebook by running the magic %matplotlib inline). [10 marks]
- Notebook contains links to all relevant code and all output required
- Notebook runs in a sequence from the first cell to the last with a fresh kernel
- Notebook and software include name of author and/or student number
- No Microsoft Word or other files other than Python code, text and a notebook file, and images generated by the code (with links in the notebook)
- All relevant files are included in the submission as a single .zip file