Jump to content
Toggle sidebar
Neurobiology.Dev
Search
Create account
Personal tools
Create account
Log in
Pages for logged out editors
learn more
Talk
Contributions
Navigation
Main page
Records
Recent changes
Random page
Tools
What links here
Related changes
Special pages
Page information
Editing
Nanopore RNA Sequencing Protocol
(section)
Page
Discussion
English
Read
Edit
View history
More
Read
Edit
View history
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
===EpiNano=== {{warning|Before we get started, I highly recommend that you take a look at your output from the previous steps in the ''Integrative Genomics Viewer (IGV)'', which is [https://software.broadinstitute.org/software/igv/download available here]. This will allow you to visualize a gene of interest on the chromosome of your choice and ensure that you actually have mapped reads at this stage in the pipeline. Otherwise, you’ll likely get an error when trying to run ''EpiNano'' as there wont be any reads in your input data.}} Next up, we need to process the “error” data on these files using ''EpiNano'' in order to determine the likelihood of there being an '''m6A''' modification present within any <code>RRACH</code> motifs on the strand. To do so, we need to first activate our <code>conda stack</code> that contains the ''EpiNano'' packages that we plan to use. Then, we will change directories to the root ''EpiNano'' folder, and execute some code. This first bit of code extracts the “error” basecalling data for each nucleotide and spits it out as a <code>.csv</code> file called <code>${SAMPLEID}.per.site.var.csv</code>. We will use this to determine the likelihood of an m6A modification being present at a given site. <b>Note:</b> If you plan on using ''EpiNano-Error'' (i.e., comparing two samples), then you need to run this step and the next few on both sample conditions. <code>Bash</code> <syntaxhighlight lang="Bash"> conda activate --stack epinano python $EPI/Epinano_Variants.py -n 20 \ -R $GREF/${GREFERENCEID} \ -b $NALI/${SAMPLEID}/${SAMPLEID}_${CHR}.bam \ -s $EPI/misc/sam2tsv.jar --type g </syntaxhighlight> Keep in mind that the number in the line python <code>Epinano_Variants.py -n 20</code> is specifying how many CPU cores to utilize for this process. Therefore, your number here may need to change depending upon what system you are using. The last line argument <code>--type g</code> tells ''EpiNano'' that your reference is a genome, rather than a transcriptome (the latter of which could be specified instead with <code>--type t</code>). Now we have too ways of proceeding from here: '''EpiNano-Error''' or '''EpiNano-SVM'''. ''EpiNano-SVM'' is useful if you want to determine the likelihood of an m6A modification being present at an RRACH motif in a single sample (since you compare your data to a pre-trained model), but it requires Guppy 3.1.5. EpiNano-Error can be used with any basecalling algorithm, but requires you to compare modifications between samples, with one of those samples being a knock down or knock out control. ====EpiNano-Error (Pair-wise)==== If we have two samples, one that is '''unmodified''' and one that is '''modified''' (in <code>Rscript Epinano_DiffErr.R ...</code>, the '''ko.csv''' is the unmodified sample), we can compare them using the ''R script'' <code>Epinano_DiffErr.R</code>. This script fits a linear regression model between paired samples, then determines outliers that often are due to base modifications. <code>Bash</code> <syntaxhighlight lang="Bash"> SAMPLEID_UNM="FAR91556" \ SAMPLEID_MOD="FAM95931" \ Rscript $EPI/Epinano_DiffErr.R \ -k $NALI/${SAMPLEID_UNM}/${SAMPLEID_UNM}_${CHR}.plus_strand.per.site.csv \ -w $NALI/${SAMPLEID_MOD}/${SAMPLEID_MOD}_${CHR}.plus_strand.per.site.csv \ -t 3 \ -o ${SAMPLEID_UNM}_vs_${SAMPLEID_MOD}_${CHR} \ -c 30 -f sum_err -d 0.1 -p </syntaxhighlight> In the above code, <code>-t</code> indicates the minimum z-score threshold (i.e., standard deviations from the mean) you would like to reach for a mod to be detected. <code>-o</code> indicates the name prefix of the output file you wish to generate from the run. <code>-f</code> indicates the feature (via column name) used to predict the modification. <code>-d</code> indicates the minimum deviance required of the selected feature between the two samples. -p is included here, which generates EpiNano plots, but you can omit that and run Epinano_Plot.R instead if you wish. Lastly, I’ll want to move these files to the EpiNano output folder. We can do that with this block: <code>Bash</code> <syntaxhighlight lang="Bash"> ETARGET="$EPIOUT/${SAMPLEID_UNM}_vs_${SAMPLEID_MOD}" \ mkdir $ETARGET mv ${SAMPLEID_UNM}_vs_${SAMPLEID_MOD}_${CHR}.delta-sum_err.prediction.csv \ ${SAMPLEID_UNM}_vs_${SAMPLEID_MOD}_${CHR}.linear-regression.prediction.csv \ ${CHR}.${SAMPLEID_UNM}_vs_${SAMPLEID_MOD}_${CHR}.delta-sum_err.bar.pdf \ ${CHR}.${SAMPLEID_UNM}_vs_${SAMPLEID_MOD}_${CHR}.linear-regression.scatter.pdf $ETARGET </syntaxhighlight> We will now move forward and I will show you how to do the alternative and run single samples with ''EpiNano-SVM''. ====EpiNano-SVM (Single Sample)==== {{warning|'''EpiNano-SVM''' uses a pretrained <code>RRACH</code> model from ''Guppy 3.1.5''. Therefore, if you use this model, you will need to also have basecalled your data using ''Guppy 3.1.5''. If you are using a different basecaller algorithm, or ''Guppy'' of a different version, please do a pairwise comparison with EpiNano-Error as described above, insert m6A controls, or consider using ''m6ANet''.}} If we would like to extract the likelihood of m6A modifications using the pretrained models, we can do so by running the script <code>Epinano_Predict.py</code> and referencing our <code>RRACH</code> model. To do this, though, we must first divide our reads into kmers of 5 and isolate just the <code>RRACH</code> motifs. Isolating into 5 kmers: <code>Bash</code> <syntaxhighlight lang="Bash"> python $EPI/misc/Slide_Variants.py $NALI/${SAMPLEID}/${SAMPLEID}_${CHR}.plus_strand.per.site.csv 5 python $EPI/misc/Slide_Variants.py $NALI/${SAMPLEID}/${SAMPLEID}_${CHR}.minus_strand.per.site.csv 5 </syntaxhighlight> Filter for ''only'' <code>RRACH</code> <code>Bash</code> <syntaxhighlight lang="Bash"> cd $NALI/${SAMPLEID}/ awk -F"," 'NR==1; NR > 1{if($1=="AAACA" || $1=="AAACT" || $1=="GAACT" || $1=="GGACA" || $1=="AGACT" || $1=="GGACT" || \ $1=="AAACC" || $1=="AGACC" || $1=="GAACC" || $1=="GAACA" || $1=="AGACA" || $1=="GGACC") print}' \ ${SAMPLEID}_${CHR}.plus_strand.per.site.5mer.csv > ${SAMPLEID}_${CHR}_plus_RRACH.csv awk -F"," 'NR==1; NR > 1{if($1=="AAACA" || $1=="AAACT" || $1=="GAACT" || $1=="GGACA" || $1=="AGACT" || $1=="GGACT" || \ $1=="AAACC" || $1=="AGACC" || $1=="GAACC" || $1=="GAACA" || $1=="AGACA" || $1=="GGACC") print}' \ ${SAMPLEID}_${CHR}.minus_strand.per.site.5mer.csv > ${SAMPLEID}_${CHR}_minus_RRACH.csv </syntaxhighlight> Now that we have that, we can go ahead and run <code>Epinano_Predict.py</code>. Note that the code below is duplicated, because we will run it also on the data table that contains mRNA reads mapped to the reverse strand. <code>Bash</code> <syntaxhighlight lang="Bash"> python $EPI/Epinano_Predict.py \ --model $EPI/models/rrach.q3.mis3.del3.linear.dump \ --predict ${SAMPLEID}_${CHR}_plus_RRACH.csv \ --out_prefix ${SAMPLEID}_${CHR}_plus \ --columns 8,13,23 python $EPI/Epinano_Predict.py \ --model $EPI/models/rrach.q3.mis3.del3.linear.dump \ --predict ${SAMPLEID}_${CHR}_minus_RRACH.csv \ --out_prefix ${SAMPLEID}_${CHR}_minus \ --columns 8,13,23 </syntaxhighlight> So let’s break down what the above is doing. First, we are telling the console that we want to run the python script <code>Epinano_Predict.py</code> with the pre-trained model specified by <code>--model</code>, which will be used to detect m6A modifications. <code>--predict</code> specifies the table file that will be used for making predictions (i.e., your processed data), while <code>--out_prefix</code> specifies the prefix output file name. Finally, <code>--columns</code> specify the columns within the dataset that contain the features that will be used for the prediction of m6A sites. The final output will be in the same folder as you ran the command (in this case, in the <code>sorted_bam</code> subfolder), and will be saved as your specified <code>--out_prefix</code> + <code>.q3.mis3.del3.MODEL.rrach.q3.mis3.del3.linear.dump.csv</code> if you used the same model as I did. The next thing we want to do is import this data into ''R'' for data analysis and further visualization. <code>Bash</code> <syntaxhighlight lang="Bash"> # Load a few packages needed. suppressMessages (library(tidyverse)) suppressMessages (library(stringr)) # Specify the input file in the R working directory. You can check the working directory with getwd(). # Note: You could also specify the path to the output file location and point directly to it there too. plusstrandfile = paste(paste0(OUTPUT_PATH, SAMPLEID), CHR, "plus.q3.mis3.del3.MODEL.rrach.q3.mis3.del3.linear.dump.csv", sep = "_") minusstrandfile = paste(paste0(OUTPUT_PATH, SAMPLEID), CHR, "minus.q3.mis3.del3.MODEL.rrach.q3.mis3.del3.linear.dump.csv", sep = "_") # Read the csv input file and separate necessary columns. epi_res_plus <- read.csv (plusstrandfile, header=T)[,c("Window","Ref","Strand","ProbM","prediction")] epi_res_minus <- read.csv (minusstrandfile, header=T)[,c("Window","Ref","Strand","ProbM","prediction")] epi_res <- rbind(epi_res_plus, epi_res_minus) # Split the "Window" into two integers. split <- data.frame (str_split_fixed (epi_res$Window, '-', 2)) # Grab the split integer columns. split$first <- as.integer (as.character(split$X1)) + 2 split$middle <- as.integer (as.character(split$X2)) + 2 # Add the integer columns to the data frame. epi_res$Start <- split$first epi_res$End <- split$middle # Remove "Window" column after split. Left with "Ref" (chromosome), "Position" (start), "Strand", "ProbM", and "prediction". epi_res<- epi_res[,c("Ref", "Start", "End", "Strand", "ProbM", "prediction")] colnames(epi_res) <- c("Ref", "Start", "End", "Strand", "m6A_Prob", "m6A_Pred") # Assign the epinano output a name based on sample ID and chromosome. assign(paste(SAMPLEID, CHR, "epinano_result", sep = "_"), epi_res) </syntaxhighlight> This will import the data and isolate the columns for ''chromosome'', Start and End positions on the chromosome (with regard to GRCh38, our reference genome used), the ''Strand'', the ''probability of an m6A modification'', and the ''m6A EpiNano Prediction'': either listed as “mod” (for “modified”) or “unm” (for “unmodified”). =====Probing Genes===== Now we want to be able to visualize the data, preferably between two samples. This next code block can be executed directly in ''R'', all you have to do is change the first set of values to indicate which gene you want to probe and your samples you want to include. <code>R</code> <syntaxhighlight lang="R"> # Define your parameters. Note that your sample IDs are the same as above. GENE="ATF5" SAMPLE1_ID="FAR91556" # Usually WT SAMPLE2_ID="FAM95931" # Usually Mut # Code to run </syntaxhighlight> =====Plotting Results (ATF5)===== Here we can plot the results for the ATF5 gene. This will be modified in a future iteration. <code>R</code> <syntaxhighlight lang="R"> suppressMessages (library(reshape2)) suppressMessages (library(ggplot2)) suppressMessages (library(ggrepel)) suppressMessages (library(tidyverse)) suppressMessages (library(stringr)) # Isolate ATF5 WT122021_nano_ATF5 <- subset(FAR91556_chr19_epinano_result, Start >= 49928403) WT122021_nano_ATF5 <- subset(WT122021_nano_ATF5, End <= 49934438) IDHMUT_nano_ATF5 <- subset(FAM95931_chr19_epinano_result, Start >= 49928403) IDHMUT_nano_ATF5 <- subset(IDHMUT_nano_ATF5, End <= 49934438) #WT_mod <- subset(WT122021_nano_ATF5, m6A_Prob >= 0.05) #WT_unm <- subset(WT122021_nano_ATF5, m6A_Prob < 0.05) # Note: Some of these data were generated view the MotifFinder script. #49928702-49933935 # ATF5 Gene, Chr19:49929205-49933935 merge <- ggplot (WT122021_nano_ATF5) + ggtitle ("Merge") + # Underlying Strand geom_segment(aes(x=49928403, xend=49934438, y=0.125, yend=0.125), color="#383838", size=4) + # Exon 1 49,928,890 - 49,929,022 geom_segment(aes(x=49928890, xend=49929022, y=0.125, yend=0.125), color="#858585", size=6) + # Exon 2 49,929,223 - 49,929,421 geom_segment(aes(x=49929223, xend=49929421, y=0.125, yend=0.125), color="#858585", size=6) + # Exon 3 49,930,681 - 49,931,079 geom_segment(aes(x=49930681, xend=49931079, y=0.125, yend=0.125), color="#858585", size=6) + # Exon 4 49,932,271 - 49,934,086 geom_segment(aes(x=49932271, xend=49934086, y=0.125, yend=0.125), color="#858585", size=6) + # Location of RRACH sites in ATF5. This was ascertained using the Motif Finder function. #geom_linerange(data=ATF5_RRACH, aes(x=Hg38Position, ymin=0, ymax=1), color="#d9fbcb", size=0.4) + #geom_linerange(data=ATF5_MAZF, aes(x=Hg38Position, ymin=0, ymax=1), color="#476ad0", size=0.4) + # WT Dots geom_point(data=WT122021_nano_ATF5, aes(x=Start, y=m6A_Prob),shape = 21, colour = "black", fill = "#959595", size=2, stroke = 1.0) + # IDH Mut Dots geom_point(data=IDHMUT_nano_ATF5, aes(x=Start, y=m6A_Prob),shape = 21, colour = "black", fill = "#dd1818", size=2, stroke = 1.0) + # Instead of entering in the chromosome, paste it to the label from the data. xlab (paste0("Position on ", WT122021_nano_ATF5[1,1])) + ylab ("m6A Prob.") + theme_bw ()+ theme (axis.text.x = element_text(face="bold", color="black",size=11), axis.text.y = element_text(face="bold", color="black", size=11), plot.title = element_text(color="black", size=24, face="bold.italic", hjust = 0.5), axis.title.x = element_text(color="black", size=15, face="bold"), axis.title.y = element_text(color="black", size=15, face="bold"), panel.background = element_blank(), legend.position = "none", axis.line = element_line(colour = "black", size=0.5)) #geom_text_repel (data=mod, aes(x=Position, y=ProbM, label=Position), color="red", box.padding=3, point.padding=3, segment.size = 0.6, segment.color = "green") </syntaxhighlight> Final export on one graph: <code>R</code> <syntaxhighlight lang="R"> suppressMessages(library(gridExtra)) grid.arrange(wt, idh, merge, nrow = 2) </syntaxhighlight>
Summary:
Please note that all contributions to Neurobiology.Dev may be edited, altered, or removed by other contributors. If you do not want your writing to be edited mercilessly, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource (see
Neurobiology.Dev:Copyrights
for details).
Do not submit copyrighted work without permission!
Cancel
Editing help
(opens in new window)