Overview
Teaching: 0 min
Exercises: 180 minQuestions
NA
Objectives
Create a repository
Write data to it
Clean and tidy data
Make a plot
Create a new directory in your work directory. Title it with your last name and “datapractical”.
Within the directory, create scripts, output and data subdirectories.
Initialize a git repository in your directory. Create a README. Copy the text from this document into the README. You will fill out your answers in the README.
Import the data from the class repository into your project repo. You will need to git pull to get updated files. The files will be in the _homework/takehome
folder of the class repo.
The next several steps will involve data cleaning in Python. I will ask you to make commits at various points, or to paste in the command you used, or to save a copy of a datafile.
Load both the “good” and “bad” data files into Pandas. Standardize the columns across the two datafiles. This may involve splitting columns up, or dropping columns from the dataframes.
Save these two dataframes to files with names that mark them as distinct from the raw data.
Now, combine each of these dataframes with the file of the same name with the suffix “_organism_info”. How will it make sense to add the “organism info” column to the dataframe. Hint: it does not have to be a join. We have covered other syntax for adding a column.
Save these two new dataframes to files with names that mark them as distinct from the raw data.
Combine the two dataframes. You should have then have one dataframe. Paste below the command that you used to do this join or concatenation. Save the dataframe of the joined or concatenated data with a name indicating what it is.
Exit Pandas. Make sure all the datafiles generated in steps 1-9 are in your data directory, and commit them. From the project directory in the course repo, copy over the pore_plotting.py into your scripts directory.
python pore_plotting.py -h
to see the help, but to understand what is going on, you may need to have a look at the code. You will not understand all of it, and that’s OK.
Call the script on each data frame. Write the plots to the output folder.
Make two comments in the script: one comment on a line 12, explaining what the code in lines 8-11 are doing. Make one other comment somewhere indicating something you don’t understand.
Save and commit your modified script.
For the last bit, we will make a plot from our joined data with organism information (i.e. the data file you output in Pandas practical step 9.
Load the joined dataset.
Convert the quality scores in this dataframe to numeric values.
Plot these make a bar-and-whiskers plot. This plot should show the quality per position in read, across reads. For each position, there should be two boxes: one for the “good” dataset, and one for the bad. Paste your code below.practical
Save your plot to the output directory. Embed the plot below.
Compress the whole project directory into a .tgz file. Either commit and push to your fork of the course repo, or email it to me.
Key Points