Jump to content
Toggle sidebar
Neurobiology.Dev
Search
Create account
Personal tools
Create account
Log in
Pages for logged out editors
learn more
Talk
Contributions
Navigation
Main page
Records
Recent changes
Random page
Tools
What links here
Related changes
Special pages
Page information
Editing
Nanopore RNA Sequencing Protocol
(section)
Page
Discussion
English
Read
Edit
View history
More
Read
Edit
View history
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
=m6A Detection= ===m6ANet=== {{warning|Before we get started, I highly recommend that you take a look at your output from the previous steps in the ''Integrative Genomics Viewer (IGV)'', which is [https://software.broadinstitute.org/software/igv/download available here]. This will allow you to visualize a gene of interest and ensure that you actually have mapped reads at this stage in the pipeline. Otherwise, you’ll be just wasting your time trying to extract methylation status on 0 reads! <b>Note:</b> I had issues running the setup install for ''m6ANet'' over an existing ''Conda'' installation. Likely there are some package version incompatibilities. I highly recommend running the setup and executing ''m6ANet'' from its own ''Conda'' stack, after installing only <b>Python 3.8</b>. This is covered in the Installations section above.}} Now that we have basecalled, mapped, and sorted RNA reads, we now want to try and detect m6A modifications. In the past, we have used various other methods such as ''EpiNano'' and ''Tombo'', but a more recent and powerful method has emerged called '''m6ANet'''. The advantage of ''m6ANet'' is that it uses very few software requirements out of the box (''python, Nanopolish, Pytorch,'' and ''Samtools''), does not require a specific basecaller version (like ''EpiNano-SVM'' does), evaluates <code>DRACH</code> motifs instead of just <code>RRACH</code> motifs, and has much greater accuracy than most other contemporarily available models. Therefore, we will use ''m6ANet'' for our detection. ====Data Preparation==== Before we can begin, however, we need to extract some data from our raw <code>.fast5</code> files using ''Nanopolish'' and do some data formatting. First up, we need to index our raw reads using ''Nanopolish'', which can be done via the following: <code>Bash</code> <syntaxhighlight lang="Bash"> cd $NBASE nanopolish index -d $NRAW/${SAMPLEID} ${SAMPLEID}_master.fastq </syntaxhighlight> Note that this should be called from the same directory as your .fastq master file. You can include the ''Guppy'' sequencing summary output by adding in the line <code>--sequencing-summary=sequencing_summary.txt</code> after <code>$NRAW/${SAMPLEID}</code> to speed up the indexing process, but I recommend against it if you get an error. In my experience, ''Guppy'' tells ''Nanopolish'' where files are located through this process, which may or may not be accurate. Next, we need to generate the <code>eventalign</code> output from ''Nanopolish'', which aligns events from the raw data to the transcriptome for you. To do this, you need to indicate the location of your reference transcriptome file, your sorted and aligned <code>.bam</code> file, and your basecalled master <code>.fastq</code> file. The output of <code>eventalign</code> is '''''very large''''' with a good nanopore run (~23+ GB per chromosome of data), so I highly recommend considering mapping your reads to a transcriptome “subset” first and then running <code>eventalign</code> on that instead. <code>Bash</code> <syntaxhighlight lang="Bash"> nanopolish eventalign \ -r ${SAMPLEID}_master.fastq \ -b $NALI/${SAMPLEID}_aligned.bam \ -g $GREF/${GREFERENCEID} \ --scale-events \ --signal-index \ --threads 50 > ${SAMPLEID}_eventalign.txt </syntaxhighlight> Now we can do some ''m6ANet'' data prep and generate a few files: <code>Bash</code> <syntaxhighlight lang="Bash"> conda activate m6anet m6anet-dataprep \ --eventalign $NBASE/${SAMPLEID}_eventalign.txt \ --out_dir $M6AO/${SAMPLEID} \ --n_processes 4 </syntaxhighlight> The output of this will generate the following: * <code>data.index</code>: Indexing of <code>data.json</code> to allow faster access to the file. * <code>data.json</code>: json file containing the features to feed into ''m6ANet'' model for prediction. * <code>data.log</code>: Log file containing all the transcripts that have been successfully pre-processed. * <code>data.readcount</code>: File containing the number of reads for each <code>DRACH</code> positions in <code>eventalign.txt</code> document. * <code>eventalign.index</code>: Index file created during <code>dataprep</code> to allow faster access of the ''Nanopolish'' <code>eventalign.txt</code> data during this step. ====Run Inference==== {{tip|M6ANet may run into issues if your shell open file limit is too low. The max number of open files can be viewed with the <code>ulimit -n</code> command. To change this, simply add a number following the command (e.g., <code>ulimit -n 100000</code>).}} Finally, we can execute the final bit of code to generate the final output: <code>Bash</code> <syntaxhighlight lang="Bash"> m6anet-run_inference \ --input_dir $M6AO \ --out_dir $M6AO \ --infer_mod_rate \ --n_processes 4 </syntaxhighlight> Note that the <code>--input_dir</code> here is the output folder from the previous step that contains your <code>eventalign.index</code> file, <code>data.index</code> file, etc. The <code>--out_dir</code> can be specified as something else, but I wanted to put the output from this last step back into the final folder. The output is a <code>.gz</code> file, which might need to be unzipped via the command line if you have trouble using the GUI to do so: <code>Bash</code> <syntaxhighlight lang="Bash"> cd $M6AO gunzip data.result.csv.gz mv data.result.csv ${SAMPLEID}_m6anet_result.csv </syntaxhighlight> ====R Analysis and Plots==== Then in <code>R</code>, we can import these results using the <code>read.csv</code> function like so: <code>R</code> <syntaxhighlight lang="Bash"> # Change name based upon what you saved it as in the previous step, if you renamed it based upon sample name. m6ANet_result <- read.csv(file=paste0(M6ANET_PATH, "/data.result.csv")) </syntaxhighlight> ===EpiNano=== {{warning|Before we get started, I highly recommend that you take a look at your output from the previous steps in the ''Integrative Genomics Viewer (IGV)'', which is [https://software.broadinstitute.org/software/igv/download available here]. This will allow you to visualize a gene of interest on the chromosome of your choice and ensure that you actually have mapped reads at this stage in the pipeline. Otherwise, you’ll likely get an error when trying to run ''EpiNano'' as there wont be any reads in your input data.}} Next up, we need to process the “error” data on these files using ''EpiNano'' in order to determine the likelihood of there being an '''m6A''' modification present within any <code>RRACH</code> motifs on the strand. To do so, we need to first activate our <code>conda stack</code> that contains the ''EpiNano'' packages that we plan to use. Then, we will change directories to the root ''EpiNano'' folder, and execute some code. This first bit of code extracts the “error” basecalling data for each nucleotide and spits it out as a <code>.csv</code> file called <code>${SAMPLEID}.per.site.var.csv</code>. We will use this to determine the likelihood of an m6A modification being present at a given site. <b>Note:</b> If you plan on using ''EpiNano-Error'' (i.e., comparing two samples), then you need to run this step and the next few on both sample conditions. <code>Bash</code> <syntaxhighlight lang="Bash"> conda activate --stack epinano python $EPI/Epinano_Variants.py -n 20 \ -R $GREF/${GREFERENCEID} \ -b $NALI/${SAMPLEID}/${SAMPLEID}_${CHR}.bam \ -s $EPI/misc/sam2tsv.jar --type g </syntaxhighlight> Keep in mind that the number in the line python <code>Epinano_Variants.py -n 20</code> is specifying how many CPU cores to utilize for this process. Therefore, your number here may need to change depending upon what system you are using. The last line argument <code>--type g</code> tells ''EpiNano'' that your reference is a genome, rather than a transcriptome (the latter of which could be specified instead with <code>--type t</code>). Now we have too ways of proceeding from here: '''EpiNano-Error''' or '''EpiNano-SVM'''. ''EpiNano-SVM'' is useful if you want to determine the likelihood of an m6A modification being present at an RRACH motif in a single sample (since you compare your data to a pre-trained model), but it requires Guppy 3.1.5. EpiNano-Error can be used with any basecalling algorithm, but requires you to compare modifications between samples, with one of those samples being a knock down or knock out control. ====EpiNano-Error (Pair-wise)==== If we have two samples, one that is '''unmodified''' and one that is '''modified''' (in <code>Rscript Epinano_DiffErr.R ...</code>, the '''ko.csv''' is the unmodified sample), we can compare them using the ''R script'' <code>Epinano_DiffErr.R</code>. This script fits a linear regression model between paired samples, then determines outliers that often are due to base modifications. <code>Bash</code> <syntaxhighlight lang="Bash"> SAMPLEID_UNM="FAR91556" \ SAMPLEID_MOD="FAM95931" \ Rscript $EPI/Epinano_DiffErr.R \ -k $NALI/${SAMPLEID_UNM}/${SAMPLEID_UNM}_${CHR}.plus_strand.per.site.csv \ -w $NALI/${SAMPLEID_MOD}/${SAMPLEID_MOD}_${CHR}.plus_strand.per.site.csv \ -t 3 \ -o ${SAMPLEID_UNM}_vs_${SAMPLEID_MOD}_${CHR} \ -c 30 -f sum_err -d 0.1 -p </syntaxhighlight> In the above code, <code>-t</code> indicates the minimum z-score threshold (i.e., standard deviations from the mean) you would like to reach for a mod to be detected. <code>-o</code> indicates the name prefix of the output file you wish to generate from the run. <code>-f</code> indicates the feature (via column name) used to predict the modification. <code>-d</code> indicates the minimum deviance required of the selected feature between the two samples. -p is included here, which generates EpiNano plots, but you can omit that and run Epinano_Plot.R instead if you wish. Lastly, I’ll want to move these files to the EpiNano output folder. We can do that with this block: <code>Bash</code> <syntaxhighlight lang="Bash"> ETARGET="$EPIOUT/${SAMPLEID_UNM}_vs_${SAMPLEID_MOD}" \ mkdir $ETARGET mv ${SAMPLEID_UNM}_vs_${SAMPLEID_MOD}_${CHR}.delta-sum_err.prediction.csv \ ${SAMPLEID_UNM}_vs_${SAMPLEID_MOD}_${CHR}.linear-regression.prediction.csv \ ${CHR}.${SAMPLEID_UNM}_vs_${SAMPLEID_MOD}_${CHR}.delta-sum_err.bar.pdf \ ${CHR}.${SAMPLEID_UNM}_vs_${SAMPLEID_MOD}_${CHR}.linear-regression.scatter.pdf $ETARGET </syntaxhighlight> We will now move forward and I will show you how to do the alternative and run single samples with ''EpiNano-SVM''. ====EpiNano-SVM (Single Sample)==== {{warning|'''EpiNano-SVM''' uses a pretrained <code>RRACH</code> model from ''Guppy 3.1.5''. Therefore, if you use this model, you will need to also have basecalled your data using ''Guppy 3.1.5''. If you are using a different basecaller algorithm, or ''Guppy'' of a different version, please do a pairwise comparison with EpiNano-Error as described above, insert m6A controls, or consider using ''m6ANet''.}} If we would like to extract the likelihood of m6A modifications using the pretrained models, we can do so by running the script <code>Epinano_Predict.py</code> and referencing our <code>RRACH</code> model. To do this, though, we must first divide our reads into kmers of 5 and isolate just the <code>RRACH</code> motifs. Isolating into 5 kmers: <code>Bash</code> <syntaxhighlight lang="Bash"> python $EPI/misc/Slide_Variants.py $NALI/${SAMPLEID}/${SAMPLEID}_${CHR}.plus_strand.per.site.csv 5 python $EPI/misc/Slide_Variants.py $NALI/${SAMPLEID}/${SAMPLEID}_${CHR}.minus_strand.per.site.csv 5 </syntaxhighlight> Filter for ''only'' <code>RRACH</code> <code>Bash</code> <syntaxhighlight lang="Bash"> cd $NALI/${SAMPLEID}/ awk -F"," 'NR==1; NR > 1{if($1=="AAACA" || $1=="AAACT" || $1=="GAACT" || $1=="GGACA" || $1=="AGACT" || $1=="GGACT" || \ $1=="AAACC" || $1=="AGACC" || $1=="GAACC" || $1=="GAACA" || $1=="AGACA" || $1=="GGACC") print}' \ ${SAMPLEID}_${CHR}.plus_strand.per.site.5mer.csv > ${SAMPLEID}_${CHR}_plus_RRACH.csv awk -F"," 'NR==1; NR > 1{if($1=="AAACA" || $1=="AAACT" || $1=="GAACT" || $1=="GGACA" || $1=="AGACT" || $1=="GGACT" || \ $1=="AAACC" || $1=="AGACC" || $1=="GAACC" || $1=="GAACA" || $1=="AGACA" || $1=="GGACC") print}' \ ${SAMPLEID}_${CHR}.minus_strand.per.site.5mer.csv > ${SAMPLEID}_${CHR}_minus_RRACH.csv </syntaxhighlight> Now that we have that, we can go ahead and run <code>Epinano_Predict.py</code>. Note that the code below is duplicated, because we will run it also on the data table that contains mRNA reads mapped to the reverse strand. <code>Bash</code> <syntaxhighlight lang="Bash"> python $EPI/Epinano_Predict.py \ --model $EPI/models/rrach.q3.mis3.del3.linear.dump \ --predict ${SAMPLEID}_${CHR}_plus_RRACH.csv \ --out_prefix ${SAMPLEID}_${CHR}_plus \ --columns 8,13,23 python $EPI/Epinano_Predict.py \ --model $EPI/models/rrach.q3.mis3.del3.linear.dump \ --predict ${SAMPLEID}_${CHR}_minus_RRACH.csv \ --out_prefix ${SAMPLEID}_${CHR}_minus \ --columns 8,13,23 </syntaxhighlight> So let’s break down what the above is doing. First, we are telling the console that we want to run the python script <code>Epinano_Predict.py</code> with the pre-trained model specified by <code>--model</code>, which will be used to detect m6A modifications. <code>--predict</code> specifies the table file that will be used for making predictions (i.e., your processed data), while <code>--out_prefix</code> specifies the prefix output file name. Finally, <code>--columns</code> specify the columns within the dataset that contain the features that will be used for the prediction of m6A sites. The final output will be in the same folder as you ran the command (in this case, in the <code>sorted_bam</code> subfolder), and will be saved as your specified <code>--out_prefix</code> + <code>.q3.mis3.del3.MODEL.rrach.q3.mis3.del3.linear.dump.csv</code> if you used the same model as I did. The next thing we want to do is import this data into ''R'' for data analysis and further visualization. <code>Bash</code> <syntaxhighlight lang="Bash"> # Load a few packages needed. suppressMessages (library(tidyverse)) suppressMessages (library(stringr)) # Specify the input file in the R working directory. You can check the working directory with getwd(). # Note: You could also specify the path to the output file location and point directly to it there too. plusstrandfile = paste(paste0(OUTPUT_PATH, SAMPLEID), CHR, "plus.q3.mis3.del3.MODEL.rrach.q3.mis3.del3.linear.dump.csv", sep = "_") minusstrandfile = paste(paste0(OUTPUT_PATH, SAMPLEID), CHR, "minus.q3.mis3.del3.MODEL.rrach.q3.mis3.del3.linear.dump.csv", sep = "_") # Read the csv input file and separate necessary columns. epi_res_plus <- read.csv (plusstrandfile, header=T)[,c("Window","Ref","Strand","ProbM","prediction")] epi_res_minus <- read.csv (minusstrandfile, header=T)[,c("Window","Ref","Strand","ProbM","prediction")] epi_res <- rbind(epi_res_plus, epi_res_minus) # Split the "Window" into two integers. split <- data.frame (str_split_fixed (epi_res$Window, '-', 2)) # Grab the split integer columns. split$first <- as.integer (as.character(split$X1)) + 2 split$middle <- as.integer (as.character(split$X2)) + 2 # Add the integer columns to the data frame. epi_res$Start <- split$first epi_res$End <- split$middle # Remove "Window" column after split. Left with "Ref" (chromosome), "Position" (start), "Strand", "ProbM", and "prediction". epi_res<- epi_res[,c("Ref", "Start", "End", "Strand", "ProbM", "prediction")] colnames(epi_res) <- c("Ref", "Start", "End", "Strand", "m6A_Prob", "m6A_Pred") # Assign the epinano output a name based on sample ID and chromosome. assign(paste(SAMPLEID, CHR, "epinano_result", sep = "_"), epi_res) </syntaxhighlight> This will import the data and isolate the columns for ''chromosome'', Start and End positions on the chromosome (with regard to GRCh38, our reference genome used), the ''Strand'', the ''probability of an m6A modification'', and the ''m6A EpiNano Prediction'': either listed as “mod” (for “modified”) or “unm” (for “unmodified”). =====Probing Genes===== Now we want to be able to visualize the data, preferably between two samples. This next code block can be executed directly in ''R'', all you have to do is change the first set of values to indicate which gene you want to probe and your samples you want to include. <code>R</code> <syntaxhighlight lang="R"> # Define your parameters. Note that your sample IDs are the same as above. GENE="ATF5" SAMPLE1_ID="FAR91556" # Usually WT SAMPLE2_ID="FAM95931" # Usually Mut # Code to run </syntaxhighlight> =====Plotting Results (ATF5)===== Here we can plot the results for the ATF5 gene. This will be modified in a future iteration. <code>R</code> <syntaxhighlight lang="R"> suppressMessages (library(reshape2)) suppressMessages (library(ggplot2)) suppressMessages (library(ggrepel)) suppressMessages (library(tidyverse)) suppressMessages (library(stringr)) # Isolate ATF5 WT122021_nano_ATF5 <- subset(FAR91556_chr19_epinano_result, Start >= 49928403) WT122021_nano_ATF5 <- subset(WT122021_nano_ATF5, End <= 49934438) IDHMUT_nano_ATF5 <- subset(FAM95931_chr19_epinano_result, Start >= 49928403) IDHMUT_nano_ATF5 <- subset(IDHMUT_nano_ATF5, End <= 49934438) #WT_mod <- subset(WT122021_nano_ATF5, m6A_Prob >= 0.05) #WT_unm <- subset(WT122021_nano_ATF5, m6A_Prob < 0.05) # Note: Some of these data were generated view the MotifFinder script. #49928702-49933935 # ATF5 Gene, Chr19:49929205-49933935 merge <- ggplot (WT122021_nano_ATF5) + ggtitle ("Merge") + # Underlying Strand geom_segment(aes(x=49928403, xend=49934438, y=0.125, yend=0.125), color="#383838", size=4) + # Exon 1 49,928,890 - 49,929,022 geom_segment(aes(x=49928890, xend=49929022, y=0.125, yend=0.125), color="#858585", size=6) + # Exon 2 49,929,223 - 49,929,421 geom_segment(aes(x=49929223, xend=49929421, y=0.125, yend=0.125), color="#858585", size=6) + # Exon 3 49,930,681 - 49,931,079 geom_segment(aes(x=49930681, xend=49931079, y=0.125, yend=0.125), color="#858585", size=6) + # Exon 4 49,932,271 - 49,934,086 geom_segment(aes(x=49932271, xend=49934086, y=0.125, yend=0.125), color="#858585", size=6) + # Location of RRACH sites in ATF5. This was ascertained using the Motif Finder function. #geom_linerange(data=ATF5_RRACH, aes(x=Hg38Position, ymin=0, ymax=1), color="#d9fbcb", size=0.4) + #geom_linerange(data=ATF5_MAZF, aes(x=Hg38Position, ymin=0, ymax=1), color="#476ad0", size=0.4) + # WT Dots geom_point(data=WT122021_nano_ATF5, aes(x=Start, y=m6A_Prob),shape = 21, colour = "black", fill = "#959595", size=2, stroke = 1.0) + # IDH Mut Dots geom_point(data=IDHMUT_nano_ATF5, aes(x=Start, y=m6A_Prob),shape = 21, colour = "black", fill = "#dd1818", size=2, stroke = 1.0) + # Instead of entering in the chromosome, paste it to the label from the data. xlab (paste0("Position on ", WT122021_nano_ATF5[1,1])) + ylab ("m6A Prob.") + theme_bw ()+ theme (axis.text.x = element_text(face="bold", color="black",size=11), axis.text.y = element_text(face="bold", color="black", size=11), plot.title = element_text(color="black", size=24, face="bold.italic", hjust = 0.5), axis.title.x = element_text(color="black", size=15, face="bold"), axis.title.y = element_text(color="black", size=15, face="bold"), panel.background = element_blank(), legend.position = "none", axis.line = element_line(colour = "black", size=0.5)) #geom_text_repel (data=mod, aes(x=Position, y=ProbM, label=Position), color="red", box.padding=3, point.padding=3, segment.size = 0.6, segment.color = "green") </syntaxhighlight> Final export on one graph: <code>R</code> <syntaxhighlight lang="R"> suppressMessages(library(gridExtra)) grid.arrange(wt, idh, merge, nrow = 2) </syntaxhighlight>
Summary:
Please note that all contributions to Neurobiology.Dev may be edited, altered, or removed by other contributors. If you do not want your writing to be edited mercilessly, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource (see
Neurobiology.Dev:Copyrights
for details).
Do not submit copyrighted work without permission!
Cancel
Editing help
(opens in new window)