We investigated the application of movAPA on a poly(A) site dataset of 14 tissues in Oryza sativa japonica from 3’end sequencing (Fu, et al., 2016). We used a subset of the rice data containing 1233 PACs in 455 genes from three tissues (embryo, anther, and mature pollen) for demonstration. The original dataset containing full list of PACs can be downloaded from plantAPAdb (Zhu et al, 2019). Here the poly(A) sites are already poly(A) site clusters (PACs) which were grouped from nearby cleavage sites.
movAPA implemented the PACdataset object for storing the expression levels and annotation of PACs from various conditions/samples. Almost all analyses of poly(A) site data in movAPA are based on the PACdataset. The “counts” matrix is the first element in the array list of PACdataset, which stores non-negative values representing expression levels of PACs. The “colData” matrix records the sample information and the “anno” matrix stores the genome annotation or additional information of the poly(A) site data.
The moveAPA package includes an example rice PAC data stored as a PACdataset object, which contains 1233 PACs from 455 genes. First load movAPA by library(movAPA) and then load the example data.
library(movAPA, warn.conflicts = FALSE, quietly=TRUE)
data("PACds")
PACds
#> PAC# 1233
#> gene# 455
#> nPAC
#> 3UTR 482
#> 5UTR 20
#> CDS 44
#> Ext_3UTR 391
#> intergenic 181
#> intron 115
#> sample# 9
#> anther1 anther2 anther3 embryo1 embryo2 ...
#> groups:
#> @colData...[9 x 1]
#> group
#> anther1 anther
#> anther2 anther
#> @counts...[1233 x 9]
#> anther1 anther2 anther3 embryo1 embryo2 embryo3
#> Os01g0151600:2792379 0 1 0 2 1 1
#> Os01g0151600:2795487 11 16 17 60 55 51
#> maturePollen1 maturePollen2 maturePollen3
#> Os01g0151600:2792379 0 0 0
#> Os01g0151600:2795487 24 3 10
#> @colData...[9 x 1]
#> group
#> anther1 anther
#> anther2 anther
#> @anno...[1233 x 10]
#> chr UPA_start UPA_end strand coord ftr gene
#> Os01g0151600:2792379 1 2792363 2792427 + 2792379 intron Os01g0151600
#> Os01g0151600:2795487 1 2795427 2795509 + 2795487 3UTR Os01g0151600
#> gene_type ftr_start ftr_end
#> Os01g0151600:2792379 protein_coding 2792174 2792920
#> Os01g0151600:2795487 protein_coding 2795347 2795857
summary(PACds)
#> PAC# 1233
#> sample# 9
#> summary of expression level of each PA
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 1 5 26 275 163 77248
#> summary of expressed sample# of each PA
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 1.000 3.000 6.000 5.637 8.000 9.000
#> gene# 455
#> nPAC
#> 3UTR 482
#> 5UTR 20
#> CDS 44
#> Ext_3UTR 391
#> intergenic 181
#> intron 115
# Transform the older version of PACdataset to newer version; the counts slot was converted from data.frame to matrix or dgCMatrix.
# PACds@counts=asAnyMatrix(PACds@counts)
The reference genome is not necessary, while it is required for removing internal priming or poly(A) signal analyses. movAPA uses reference genome sequences that are represented as a BSgenome object or stored in a fasta file. The BSgenome of rice for this example can be downloaded from the github website of movAPA. Please refer to the BSgenome package for making a BSgenome object if there is no corresponding BSgenome package for your species. Alternatively, the genome assembly can be stored in a fasta file, which can also be used as input for movAPA.
devtools::load_all("/media/bmi/My Passport/scPACext_HC_288cells/movAPA/movAPA/BSgenome.Oryza.ENSEMBL.IRGSP1")
library("BSgenome.Oryza.ENSEMBL.IRGSP1", quietly = TRUE)
bsgenome <- BSgenome.Oryza.ENSEMBL.IRGSP1
Genome annotation stored in a GFF/GTF file or a TXDB R object can be used for annotating PACs. The function parseGff or parseGenomeAnnotation is used to parse the given annotation and the processed annotation can be saved into an rdata object for further use. The GFF file or the processed rdata file of rice for this example can be downloaded from the github website of movAPA.
gffFile="Oryza_sativa.IRGSP-1.0.42.gff3"
gff=parseGff(gffFile)
save(gff, file='Oryza_sativa.IRGSP-1.0.42.gff.rda')
load('Oryza_sativa.IRGSP-1.0.42.gff.rda')
Internal priming (IP) artifacts can be removed by the removePACdsIP function. Here, PACs with six consecutive or more than six As within the -10 to +10 nt window are considered as internal priming artifacts. We scan the internal priming artifacts in PACds and get two PACdatasets recording internal priming PACs and real PACs. Since IP artifacts are already removed in the example PACds, we did not perform this step in this case study.
Note: removePACdsIP step should be performed in caution, because different parameter setting in removePACdsIP may result in very different number of internal priming artifacts.
PACdsIP=removePACdsIP(PACds, bsgenome, returnBoth=TRUE,
up=-10, dn=10, conA=6, sepA=7)
#> 345 IP PACs; 888 real PACs
length(PACdsIP$real)
#> [1] 888
length(PACdsIP$ip)
#> [1] 345
The function mergePACds can be used to group nearby cleavage sites into PACs. Here is an example to group nearby PACs within 100 bp into one PAC.
PACdsClust=mergePACds(PACds, d=100)
summary(PACds)
#> PAC# 1233
#> sample# 9
#> summary of expression level of each PA
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 1 5 26 275 163 77248
#> summary of expressed sample# of each PA
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 1.000 3.000 6.000 5.637 8.000 9.000
#> gene# 455
#> nPAC
#> 3UTR 482
#> 5UTR 20
#> CDS 44
#> Ext_3UTR 391
#> intergenic 181
#> intron 115
summary(PACdsClust)
#> PAC# 1132
#> sample# 9
#> summary of expression level of each PA
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 1.0 5.0 29.0 299.5 175.0 77248.0
#> summary of expressed sample# of each PA
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 1.000 3.000 6.000 5.691 8.000 9.000
The function mergePACds can also be used to merge multiple PACdatasets. Notably, the annotation columns (e.g., gene, ftr) are lost after merging, you need call annotatePAC to annotate the merged PACds.
In movAPA 0.2.0, a reference PACds can be used for merging PACdsList in a smarter way. Providing reference PACds for merging is useful when there are multiple large PAC lists to be merged, which can prevent generating PACs with a very wide range. If there is reference PACs from 3’seq, it is recommended to use it. Please see the help document of mergePACds for details.
## Constuct another demo PACdataset for merging.
PACds2=PACds
PACds2@anno$coord = PACds2@anno$coord + sample(-50:50, 1)
## You may also change the sample names and group names.
# rownames(PACds2@colData)=paste0(rownames(PACds2@colData),'v2')
# PACds2@colData$group=paste0(PACds2@colData$group,'v2')
# colnames(PACds2@counts)=paste0(colnames(PACds2@counts),'v2')
## Construct a list of PACds to be merged.
PACdsList=list(pac1=PACds, pac2=PACds2)
## Merge two PACdatasets, nearby PACs within 24bp of each other
## will be merged into one PAC.
pp=mergePACds(PACdsList, d=24)
#> mergePACds: there are 9 duplicated sample names in the PACdsList, will add .N to sample names of each PACds
#> mergePACds: total 2466 redundant PACs from 2 PACds to merge
#> mergePACds without refPACds: 2466 separate PACs reduce to 1233 PACs (d=24nt)
#> mergePACds: melted all counts tables, total 13900 triplet rows
#> mergePACds: link 2466 old PA IDs to 1233 new PA IDs by merge
#> mergePACds: convert 13900 triplets to dgCMatrix
#> mergePACds: construct Matrix[PA, sample], [1233, 18]
summary(pp)
#> PAC# 1233
#> sample# 18
#> summary of expression level of each PA
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 2 10 52 550 326 154496
#> summary of expressed sample# of each PA
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 2.00 6.00 12.00 11.27 16.00 18.00
The function normalizePACds can be called for normalization, which implements three strategies including TPM (Tags Per Million), the normalization method of DESeq (Anders and Huber, 2010), and the TMM method used in EdgeR (Robinson, et al., 2010).
Note: normalization should be performed in caution, because different methods would have significant and different impact on the data and downstream analysis!
## Here normalization method TMM (or EdgeR) is used,
## while you may also choose TPM or DESeq.
PACds=normalizePACds(PACds, method='TMM')
#> converting counts to integer mode
## Library sizes after normalization.
colSums(PACds@counts)
#> anther1 anther2 anther3 embryo1 embryo2
#> 20318 21529 21640 30468 31384
#> embryo3 maturePollen1 maturePollen2 maturePollen3
#> 30768 76027 62261 54242
Users can use annotatePAC to annotate a PACdataset with a GFF/GTF file or a TXDB R object. Here we parse the genome annotation file in GFF3 format and save the processed annotation into a rdata object for further use.
load('Oryza_sativa.IRGSP-1.0.42.gff.rda')
Here is an example to annotate PACds with the genome annotation. Because the demo data already contains the annotation, we removed the annotation columns before calling annotatePAC.
PACds1=PACds
PACds1@anno[,c('gene','ftr','gene_type','ftr_start','ftr_end')]=NULL
PACds1=annotatePAC(PACds1, gff)
We can output the annotated PACds and the sample information to text files.
writePACds(PACds1, file='rice_pac_data.txt',
colDataFile = 'rice_pac_data.coldata.txt')
Genes with or without annotated 3’UTR could be assigned an extended
3’UTR of a given length using the function ext3UTRPACds
,
which can improve the “recovery” of poly(A) sites falling within
authentic 3’UTRs.
Before extending, we can calculate the number of PACs falling into extended 3’UTRs of different lengths.
testExt3UTR(PACds1, seq(1000, 10000, by=1000))
#> ext3UTRLen addedPAnum
#> 1 1000 410
#> 2 2000 432
#> 3 3000 454
#> 4 4000 468
#> 5 5000 475
#> 6 6000 481
#> 7 7000 487
#> 8 8000 490
#> 9 9000 500
#> 10 10000 503
Here we extended 3’UTR length for 2000 bp. After extension, 70 PACs in intergenic region are now in extended 3’UTRs.
table(PACds1@anno$ftr)
#>
#> 3UTR 5UTR CDS intergenic intron
#> 482 18 44 572 117
PACds1=ext3UTRPACds(PACds1, ext3UTRlen=2000)
#> 432 PACs in extended 3UTR (ftr=intergenic >> ftr=3UTR)
#> Get 3UTR length (anno@toStop) for 3UTR/extended 3UTR PACs
table(PACds1@anno$ftr)
#>
#> 3UTR 5UTR CDS intergenic intron
#> 914 18 44 140 117
To make statistics of distributions of PACs for each sample, first we pooled replicates.
PACds1=subsetPACds(PACds, group='group', pool=TRUE)
head(PACds1@counts)
#> anther embryo maturePollen
#> Os01g0151600:2792379 1 3 0
#> Os01g0151600:2795487 33 116 65
#> Os01g0151600:2795636 51 60 11
#> Os01g0151600:2795858 17 45 3
#> Os01g0179300:4125553 6 13 0
#> Os01g0179300:4125845 3 1 0
Then we can make statistics of distribution of PACs using different PAT cutoffs. minPAT=5 means that only PACs with >=5 reads are used for statistics.
pstats=movStat(PACds1, minPAT=c(1, 5, 10, 20, 50, 60), ofilePrefix=NULL)
names(pstats)
#> [1] "pat1" "pat5" "pat10" "pat20" "pat50" "pat60"
pstats$pat10
#> nPAC nPAT nGene nAPAgene APAextent 3UTR_nPAT 5UTR_nPAT CDS_nPAT
#> anther 524 61855 340 135 0.3970588 33008 102 31
#> embryo 507 91051 307 150 0.4885993 66158 61 631
#> maturePollen 513 191317 332 122 0.3674699 47998 67 0
#> total 709 344223 388 200 0.5154639 147164 230 662
#> Ext_3UTR_nPAT intergenic_nPAT intron_nPAT 3UTR_nPAC 5UTR_nPAC
#> anther 25951 2235 528 274 5
#> embryo 17494 6090 617 288 2
#> maturePollen 138323 3793 1136 277 3
#> total 181768 12118 2281 359 5
#> CDS_nPAC Ext_3UTR_nPAC intergenic_nPAC intron_nPAC
#> anther 3 199 30 13
#> embryo 7 182 16 12
#> maturePollen 0 185 35 13
#> total 9 255 57 24
Statistical results can be visualized by barplots to show PAC#, PAT#, APA gene%, PAC%, PAT% across samples and genomic regions. Here we plot all statistical results with cutoffs 5 and 10, with each plot having two smaller plots corresponding to the two cutoffs.
plotPACdsStat(pstats, pdfFile='PACds_stat.pdf', minPAT=c(5,10))
Plot specific cutoffs and conditions.
plotPACdsStat(pstats, pdfFile='PACds_stat_anther_embryo.pdf',
minPAT=c(5,10), conds=c('anther1','embryo1'))
Plot the overall distributions using pooled samples (total) and two cutoffs.
plotPACdsStat(pstats, pdfFile='PACds_stat_total.pdf',
minPAT=c(5,10), conds=c('total'))
Plot the overall distributions using pooled samples (total) and one cutoff.
plotPACdsStat(pstats, pdfFile='PACds_stat_total_PAT10.pdf',
minPAT=c(10), conds=c('total'))
Plot figures to the current device.
plotPACdsStat(pstats, pdfFile=NULL, minPAT=c(5,10))
#> Plot Number of PACs
#> Plot Number of PATs
#> Plot APA extent
#> Plot PAC# distribution
#> Plot PAC% distribution
#> Plot PAT# distribution
#> Plot PAT% distribution
movAPA provides several functions, including annotateByPAS, faFromPACds, kcount, and plotATCGforFAfile, for sequence extraction and poly(A) signal identification.
Annotate PACs by corresponding signal of AATAAA located upstream 50 bp of the PAC.
PACdsPAS=annotateByPAS(PACds, bsgenome, grams='AATAAA',
from=-50, to=-1, label=NULL)
summary(PACdsPAS@anno$AATAAA_dist)
#> Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
#> 16.00 22.00 25.00 26.92 30.00 50.00 1132
Scan AATAAA’s 1nt variants.
PACdsPAS=annotateByPAS(PACds, bsgenome, grams='V1',
from=-50, to=-1, label=NULL)
table(PACdsPAS@anno$V1_gram)
#>
#> AAAAAA AACAAA AAGAAA AATAAA AATAAC AATAAG AATAAT AATACA AATAGA AATATA AATCAA
#> 91 24 50 74 15 31 31 25 26 55 26
#> AATGAA AATTAA ACTAAA AGTAAA ATTAAA CATAAA GATAAA TATAAA
#> 56 21 4 21 27 13 11 36
Scan custom k-grams.
PACdsPAS=annotateByPAS(PACds, bsgenome,
grams=c('AATAAA','ATTAAA','GATAAA','AAAA'),
from=-50, to=-1, label='GRAM')
table(PACdsPAS@anno$GRAM_gram)
#>
#> AAAA AATAAA ATTAAA GATAAA
#> 409 48 24 8
Scan patterns with priority: AATAAA >> ATTAAA >> remaining k-grams.
PACdsPAS=annotateByPAS(PACds, bsgenome,
grams=c('AATAAA','ATTAAA','GATAAA','AAAA'),
priority=c(1,2,3,3),
from=-50, to=-1, label='GRAM')
table(PACdsPAS@anno$GRAM_gram)
#>
#> AAAA AATAAA ATTAAA GATAAA
#> 337 101 44 7
Plot signal logos.
pas=PACdsPAS@anno$GRAM_gram[!is.na(PACdsPAS@anno$GRAM_gram)]
plotSeqLogo(pas)
Here we show another example to scan mouse signals in rice PACs. First, we get mouse signals and set the priority.
v=getVarGrams('mm')
priority=c(1,2,rep(3, length(v)-2))
Then scan upstream regions of PACs for mouse signals.
PACdsMM=annotateByPAS(PACds, bsgenome, grams=v,
priority=priority,
from=-50, to=-1, label='mm')
Prepare the data to plot PAS distributions.
library(magrittr)
#>
#> 载入程辑包:'magrittr'
#> The following object is masked from 'package:GenomicRanges':
#>
#> subtract
library(dplyr)
#>
#> 载入程辑包:'dplyr'
#> The following objects are masked from 'package:Biostrings':
#>
#> collapse, intersect, setdiff, setequal, union
#> The following object is masked from 'package:XVector':
#>
#> slice
#> The following objects are masked from 'package:GenomicRanges':
#>
#> intersect, setdiff, union
#> The following object is masked from 'package:GenomeInfoDb':
#>
#> intersect
#> The following objects are masked from 'package:IRanges':
#>
#> collapse, desc, intersect, setdiff, slice, union
#> The following objects are masked from 'package:S4Vectors':
#>
#> first, intersect, rename, setdiff, setequal, union
#> The following objects are masked from 'package:BiocGenerics':
#>
#> combine, intersect, setdiff, union
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
pas=as.data.frame(cbind(region=PACdsMM@anno$ftr, PAS=PACdsMM@anno$mm_gram))
pas$PAS[is.na(pas$PAS)]='NOPAS'
pas$PAS[pas$PAS %in% v[-c(1:2)]]='Variants'
n=pas %>% dplyr::group_by(region, PAS) %>% dplyr::summarise(nPAC=n())
#> `summarise()` has grouped output by 'region'. You can override using the
#> `.groups` argument.
n2=pas %>% dplyr::group_by(region) %>% dplyr::summarise(nTot=n())
n=merge(n, n2)
n$PAC=n$nPAC/n$nTot
n=n[n$PAS!='NOPAS', ]
n$PAS=factor(n$PAS, levels=rev(c('AATAAA', 'ATTAAA','Variants', 'NOPAS')))
n$region=factor(n$region,
levels=c('3UTR','Ext_3UTR', 'intergenic','intron','CDS','5UTR'))
Plot PAS distributions.
library(ggplot2)
ggplot(data=n, aes(x=region, y=PAC, fill=PAS)) +
geom_bar(stat="identity") +
ylab("PAC Fraction") + theme_bw()
The faFromPACds function provides various options to extract sequences of interest.
## Extract the sequence of PACs, from UPA_start to UPA_end.
faFromPACds(PACds, bsgenome, what='pac', fapre='pac')
## Extract upstream 300 bp ~ downstream 100 bp around PACs,
## where the position of PAC is 301.
faFromPACds(PACds, bsgenome, what='updn', fapre='updn',
up=-300, dn=100)
## Divide PACs into groups of genomic regions and then extract sequences for each group.
faFromPACds(PACds, bsgenome, what='updn', fapre='updn',
up=-100, dn=100, byGrp='ftr')
## Extract sequences for only 3UTR PACs.
faFromPACds(PACds, bsgenome, what='updn', fapre='updn',
up=-300, dn=100, byGrp=list(ftr='3UTR'))
## Extract sequences for only 3UTR PACs and separate sequences by strand.
faFromPACds(PACds, bsgenome, what='updn', fapre='updn',
up=-300, dn=100,
byGrp=list(ftr='3UTR', strand=c('+','-')))
## Extract sequences of genomic regions where PACs are located.
faFromPACds(PACds, bsgenome, what='region', fapre='region', byGrp='ftr')
Here we show some examples to extract sequences from different poly(A) signal regions.
## The suggested signal regions when species is 'chlamydomonas_reinhardtii'.
files=faFromPACds(PACds, bsgenome, what='updn', fapre='Chlamy.NUE',
up=-25, dn=-5, byGrp='ftr')
files=faFromPACds(PACds, bsgenome, what='updn', fapre='Chlamy.FUE',
up=-150, dn=-25, byGrp='ftr')
files=faFromPACds(PACds, bsgenome, what='updn', fapre='Chlamy.CE',
up=-5, dn=5, byGrp='ftr')
files=faFromPACds(PACds, bsgenome, what='updn', fapre='Chlamy.DE',
up=-5, dn=30, byGrp='ftr')
## The suggested signal regions when species is plant.
## In Arabidopsis or rice, signal regions are: FUE -200~-35, NUE -35~-10, CE -10~15.
files=faFromPACds(PACds, bsgenome, what='updn', fapre='plants.NUE',
up=-35, dn=-10, byGrp='ftr')
files=faFromPACds(PACds, bsgenome, what='updn', fapre='plants.FUE',
up=-200, dn=-35, byGrp='ftr')
files=faFromPACds(PACds, bsgenome, what='updn', fapre='plants.CE',
up=-10, dn=15, byGrp='ftr')
The function plotATCGforFAfile is for plotting single nucleotide profiles for given fasta file(s), which is particularly useful for discerning base compositions surrounding PACs.
First trim sequences surrounding PACs. Sequences surrounding PACs in different genomic regions are extracted into files. The PAC position is 301.
faFiles=faFromPACds(PACds, bsgenome, what='updn', fapre='updn',
up=-300, dn=100, byGrp='ftr')
#> 115 >>> updn.intron.fa
#> 482 >>> updn.3UTR.fa
#> 391 >>> updn.Ext_3UTR.fa
#> 44 >>> updn.CDS.fa
#> 181 >>> updn.intergenic.fa
#> 20 >>> updn.5UTR.fa
Then plot base compositions for specific sequence file(s).
faFiles=c("updn.3UTR.fa", "updn.Ext_3UTR.fa", "updn.intergenic.fa", "updn.intron.fa")
## Plot single nucleotide profiles using the extracted sequences and merge all plots into one.
plotATCGforFAfile(faFiles, ofreq=FALSE, opdf=FALSE,
refPos=301, mergePlots = TRUE)
We can also plot a single fasta file and specify a region.
plotATCGforFAfile (faFiles='updn.intron.fa',
ofreq=FALSE, opdf=FALSE, refPos=301)
plotATCGforFAfile (faFiles='updn.intron.fa',
ofreq=FALSE, opdf=FALSE, refPos=NULL,
filepre ='NUE', start=250, end=350)
Users can also generate these plots into a PDF file and save the calculated base compositions.
plotATCGforFAfile (faFiles, ofreq=TRUE, opdf=TRUE, refPos=301,
filepre='singleBasePlot', mergePlots = TRUE)
After extracting sequences, we can call the kcount function to obtain the number of occurrences or frequencies of k-grams from the whole sequences or a specified region of sequences. Particularly, specific k-grams (e.g., AAUAAA, AUUAAA) or a value of k (e.g., k=6 means all hexamers) can be set.
## Count top 10 hexamers (k=6) in the NUE region
## (normally from 265~295 if the PAC is at 301).
fafile='updn.3UTR.fa'
kcount(fafile=fafile, k=6, from=265, to=295, topn=10)
#> grams count perc
#> 1 AAAAAA 74 0.005904883
#> 274 ATATAT 38 0.003032237
#> 3073 GAAAAA 34 0.002713055
#> 2 AAAAAT 31 0.002473667
#> 257 ATAAAA 31 0.002473667
#> 5 AAAATA 30 0.002393872
#> 65 AATAAA 30 0.002393872
#> 1366 TTTTTT 28 0.002234280
#> 449 ATGAAA 27 0.002154485
#> 769 AGAAAA 27 0.002154485
## Count given hexamers.
kcount(fafile=fafile, grams=c('AATAAA','ATTAAA'),
from=265, to=295, sort=FALSE)
#> grams count perc
#> 1 AATAAA 30 0.7142857
#> 2 ATTAAA 12 0.2857143
## Count AATAAA and its 1nt variants in a given region.
kcount(fafile=fafile, grams='v1', from=265, to=295, sort=FALSE)
#> grams count perc
#> 1 AATAAA 30 0.092024540
#> 2 TATAAA 14 0.042944785
#> 3 CATAAA 8 0.024539877
#> 4 GATAAA 9 0.027607362
#> 5 ATTAAA 12 0.036809816
#> 6 ACTAAA 3 0.009202454
#> 7 AGTAAA 8 0.024539877
#> 8 AAAAAA 74 0.226993865
#> 9 AACAAA 13 0.039877301
#> 10 AAGAAA 21 0.064417178
#> 11 AATTAA 14 0.042944785
#> 12 AATCAA 11 0.033742331
#> 13 AATGAA 19 0.058282209
#> 14 AATATA 26 0.079754601
#> 15 AATACA 9 0.027607362
#> 16 AATAGA 8 0.024539877
#> 17 AATAAT 23 0.070552147
#> 18 AATAAC 7 0.021472393
#> 19 AATAAG 17 0.052147239
movAPA provides various metrics to measure the usages of PACs across samples, including three metrics for the quantification of the usage of each single poly(A) site by the movPAindex function and four metrics for the quantification of APA site usage of a gene by the movAPAindex function.
movPAindex provides three metrics for the quantification of each PAC in a gene, including “ratio”, “Shannon”, and “geo”. First you can merge replicates of the same sample and remove lowly expressed PACs before calculate the index.
p=subsetPACds(PACds, group='group', pool=TRUE, totPACtag=20)
Calculate the tissue-specificity. Q or H=0 means that the PAC is only expressed in one tissue. NA means the PAC is not expressed in the respective tissue.
paShan=movPAindex(p, method='shan')
#> Using count for Shannon.
#> Tissue-specific PAC's H_cutoff (mean-2*sd): 0.275623
#> Tissue-specific PAC's Q_cutoff (mean-2*sd): 0.2052354
#> Tissue-specific PAC# (H<H_cutoff): 35
#> Tissue-specific PAC# (Q<Q_cutoff): 25
#> Constitutive PAC's H_cutoff (mean+2*sd): 1.957418
#> Constitutive PAC's Q_cutoff (mean+2*sd): 3.42726
#> Constitutive PACs (H>H_cutoff): 0
#> Constitutive PACs (Q>Q_cutoff): 0
## Show some rows with low H value (which means high overall tissue-specificity).
head(paShan[paShan$H<0.2742785, ], n=2)
#> H Q_min Q_min_cond anther embryo
#> Os01g0266100:9088974 0.2006223 0.246426 embryo 5.200622 0.246426
#> Os01g0571300:21939527 0.0000000 0.000000 embryo NA 0.000000
#> maturePollen
#> Os01g0266100:9088974 NA
#> Os01g0571300:21939527 NA
Use the relative expression levels (ratio) to calculate tissue-specificity.
paShan2=movPAindex(p, method='shan', shan.ratio = TRUE)
#> Using ratio for Shannon.
#> Tissue-specific PAC's H_cutoff (mean-2*sd): 0.6462506
#> Tissue-specific PAC's Q_cutoff (mean-2*sd): 0.9463281
#> Tissue-specific PAC# (H<H_cutoff): 20
#> Tissue-specific PAC# (Q<Q_cutoff): 24
#> Constitutive PAC's H_cutoff (mean+2*sd): 2.053762
#> Constitutive PAC's Q_cutoff (mean+2*sd): 3.81471
#> Constitutive PACs (H>H_cutoff): 0
#> Constitutive PACs (Q>Q_cutoff): 0
head(paShan2, n=2)
#> H Q_min Q_min_cond anther embryo
#> ENSRNA049472915:32398829 0.7060639 0.9176406 embryo 4.950709 0.9176406
#> ENSRNA049472915:32398407 1.5828394 3.1107952 anther 3.110795 3.2821041
#> maturePollen
#> ENSRNA049472915:32398829 4.285457
#> ENSRNA049472915:32398407 3.116965
Cacluate the geo metric, which is only suitable for APA genes. NA means no PAC of the gene is expressed in the respective tissue. geo>0 means the PAC is used more than average usage of all PACs in the gene. geo~0 means similar usage; <0 means less usage.
paGeo=movPAindex(p, method='geo')
head(paGeo, n=2)
#> anther embryo maturePollen
#> ENSRNA049472915:32398829 -3.454947 -1.44546 -3.084963
#> ENSRNA049472915:32398407 3.454947 1.44546 3.084963
Cacluate the ratio metric, which is only suitable for APA genes. NA means no PAC of the gene is expressed in the respective tissue.
paRatio=movPAindex(p, method='ratio')
head(paRatio)
#> anther embryo maturePollen
#> ENSRNA049472915:32398829 0.007231405 0.1183852 0.01146789
#> ENSRNA049472915:32398407 0.992768595 0.8816148 0.98853211
#> Os01g0151600:2795487 0.326732673 0.5248869 0.82278481
#> Os01g0151600:2795636 0.504950495 0.2714932 0.13924051
#> Os01g0151600:2795858 0.168316832 0.2036199 0.03797468
#> Os01g0179300:4126216 0.927272727 0.8699634 0.99152542
Plot a heatmap to show the distribution of tissue-specificity of PACs. It is only reasonable to plot the heatmap of the Shanno metric. Or you may filter the proximal or distal PAC of the gene first and plot the ratio or geo metrics.
First, remove rows with NA and then plot the heatmap.
paShanHm=paShan[, -(1:3)]
paShanHm=paShanHm[rowSums(is.na(paShanHm))==0, ]
library(ComplexHeatmap, quietly = TRUE)
Heatmap(paShanHm, show_row_names=FALSE, cluster_columns = FALSE,
heatmap_legend_param = list(title = 'Tissue\nspecificity'))
Calculate the tissue-specificity for each replicate.
paShan=movPAindex(PACds, method='shan')
#> Using count for Shannon.
#> Tissue-specific PAC's H_cutoff (mean-2*sd): 0.269235
#> Tissue-specific PAC's Q_cutoff (mean-2*sd): 0.3825375
#> Tissue-specific PAC# (H<H_cutoff): 91
#> Tissue-specific PAC# (Q<Q_cutoff): 91
#> Constitutive PAC's H_cutoff (mean+2*sd): 3.69439
#> Constitutive PAC's Q_cutoff (mean+2*sd): 6.527702
#> Constitutive PACs (H>H_cutoff): 0
#> Constitutive PACs (Q>Q_cutoff): 0
## Plot heamap to show the consistency among replicates.
paShanHm=paShan[, -(1:3)]
paShanHm=paShanHm[rowSums(is.na(paShanHm))==0, ]
Heatmap(paShanHm, show_row_names=FALSE, cluster_columns = TRUE,
heatmap_legend_param = list(title = 'Tissue\nspecificity'))
data## Quantification of APA by movAPAindex The movAPAindex function provides four gene-level metrics for the quantification of APA site usage, including RUD (Relative Usage of Distal PAC) (Ji, et al., 2009), WUL (Weighted 3’ UTR Length) (Ulitsky, et al., 2012; Fu, et al., 2016), SLR (Short to Long Ratio) (Begik, et al., 2017), and GPI (Geometric Proximal Index) (Shulman and Elkon, 2019).
Get APA index using the smart RUD method (available in movAPA v2.0).
pd=get3UTRAPApd(pacds=p, minDist=50, maxDist=1000, minRatio=0.05, fixDistal=FALSE, addCols='pd')
rud=movAPAindex(pd, method="smartRUD", sRUD.oweight=TRUE)
head(rud$rud)
head(rud$weight)
geneRUD=rud$rud
geneRUD=geneRUD[rowSums(is.na(geneRUD))==0, ]
head(geneRUD, n=2)
Heatmap(geneRUD, show_row_names=FALSE, cluster_columns = F,
heatmap_legend_param = list(title = 'RUD'))
Get APA index using the WUL method.
geneWUL=movAPAindex(p, method="WUL", choose2PA=NULL)
head(geneWUL, n=2)
#> anther embryo maturePollen
#> Os01g0151600 231.4643 191.7955 162.5658
#> Os01g0254900 265.4275 286.5083 233.5099
Plot gene’s metric values across samples by heatmap with the ComplexHeatmap package.
## Remove NA rows before plotting heatmap.
geneWUL=geneWUL[rowSums(is.na(geneWUL))==0, ]
Heatmap(geneWUL, show_row_names=FALSE)
Get APA index using the RUD method.
geneRUD=movAPAindex(p, method="RUD",
choose2PA=NULL, RUD.includeNon3UTR=TRUE)
geneRUD=geneRUD[rowSums(is.na(geneRUD))==0, ]
head(geneRUD, n=2)
Heatmap(geneRUD, show_row_names=FALSE, cluster_columns = F,
heatmap_legend_param = list(title = 'RUD'))
Get APA index by method=SLR, using the proximal and distal PACs.
geneSLR=movAPAindex(p, method="SLR", choose2PA='PD')
head(geneSLR, n=2)
geneSLR=geneSLR[rowSums(is.na(geneSLR))==0, ]
Heatmap(geneSLR, show_row_names=FALSE)
Get APA index by method=GPI, using the proximal and distal PACs.
geneGPI=movAPAindex(PACds, method="GPI", choose2PA='PD')
head(geneGPI)
#> anther1 anther2 anther3 embryo1 embryo2
#> Os01g0151600 -0.4587689 -0.3107442 -0.1400540 0.3912043 0.6609640
#> Os01g0254900 0.2721603 0.3175016 0.3064884 -0.1752486 -0.1705185
#> Os01g0261200 -1.0253130 -0.9437626 -0.6497801 -0.7075187 -0.1315172
#> Os01g0263600 -0.5000000 NaN -0.5000000 -1.2924813 -0.2924813
#> Os01g0314000 1.1609640 0.7924813 1.0000000 0.5000000 1.0000000
#> Os01g0524500 -0.3140156 -0.5000000 -0.2237295 NaN NaN
#> embryo3 maturePollen1 maturePollen2 maturePollen3
#> Os01g0151600 3.552467e-01 2.014874 0.5000000 0.35021986
#> Os01g0254900 -9.965440e-02 1.347822 1.2221596 0.35975231
#> Os01g0261200 -3.203427e-16 -2.043731 -0.5351947 -1.00000000
#> Os01g0263600 -1.292481e+00 NaN NaN NaN
#> Os01g0314000 2.924813e-01 NaN NaN NaN
#> Os01g0524500 0.000000e+00 1.100817 0.2311716 0.03700029
geneGPI=geneGPI[rowSums(is.na(geneGPI))==0, ]
Heatmap(geneGPI, show_row_names=FALSE, cluster_columns = TRUE,
heatmap_legend_param = list(title = 'GPI'))
3’ seq data have been demonstrated informative in quantifying expression levels of genes by summing up 3’ seq reads of all PACs in a gene (Lianoglou, et al., 2013). To detect DE genes between samples with 3’ seq, we implemented the function movDEgene with the widely used R package DESeq2.
Note: DE detection should be performed in caution, because different methods would have significant and different impact on the DE results!
First we show an example of detecting DE genes for two conditions.
library(DESeq2)
## Subset two conditions first.
pacds=subsetPACds(PACds, group='group', cond1='anther', cond2='embryo')
## Detect DE genes using DESeq2 method,
## only genes with total read counts in all samples >=50 are used.
DEgene=movDEGene(PACds=pacds, method='DESeq2', group='group', minSumPAT=50)
Make statistics of the DE gene results; genes with padj<0.05 & log2FC>=0.5 are considered as DE genes.
stat=movStat(object=DEgene, padjThd=0.05, valueThd=0.5)
stat$nsig
#> sig.num
#> anther.embryo 219
head(stat$siglist$anther.embryo)
#> [1] "ENSRNA049472915" "Os01g0151600" "Os01g0179300" "Os01g0210600"
#> [5] "Os01g0247600" "Os01g0254900"
We can also detect DE genes among more than two conditions.
DEgene=movDEGene(PACds=PACds, method='DESeq2', group='group', minSumPAT=50)
stat=movStat(object=DEgene, padjThd=0.05, valueThd=1)
## Number of DE genes in each pair of conditions.
stat$nsig
#> sig.num
#> anther.embryo 150
#> anther.maturePollen 77
#> embryo.maturePollen 192
## Overlap between condition pairs.
stat$ovp
#> pair n1.sig.num n2.sig.num novp.sig.num
#> 1 anther.embryo-anther.maturePollen 150 77 47
#> 2 anther.embryo-embryo.maturePollen 150 192 122
#> 3 anther.maturePollen-embryo.maturePollen 77 192 62
Output movStat results into files: “DEgene.plots.pdf” and ‘DEgene.stat’. Several heatmaps are generated.
outputHeatStat(heatStats=stat, ostatfile='DEgene.stat', plotPre='DEgene')
You can further call movSelect() to select DE gene results with more information. Select DE gene results with full information including the read counts in each sample.
selFull=movSelect(DEgene, condpair='embryo.anther', padjThd=0.05, valueThd=1,
out='full', PACds=PACds)
#> Warning: condpair is flip of movRes@conds, so movRes@pairwise$value*(-1)
head(selFull)
#> gene anther1 anther2 anther3 embryo1 embryo2 embryo3 maturePollen1
#> 1 Os01g0179300 38 42 39 195 195 170 58
#> 2 Os01g0210600 59 65 52 376 359 295 22
#> 3 Os01g0224200 1 1 0 15 13 16 0
#> 4 Os01g0238500 8 11 12 0 0 2 62
#> 5 Os01g0247600 106 132 123 20 15 12 6
#> 6 Os01g0524500 26 25 24 0 0 2 138
#> maturePollen2 maturePollen3 padj value
#> 1 31 29 3.594718e-10 2.234465
#> 2 60 82 5.726760e-08 2.548997
#> 3 5 1 4.048010e-03 4.459420
#> 4 17 111 6.343326e-03 -3.954189
#> 5 63 7 1.080655e-03 -2.941261
#> 6 105 76 1.104847e-05 -5.228812
Select DE gene results with only padj and value. Here value is log2(anther/embryo).
sel=movSelect(DEgene, condpair='anther.embryo',
padjThd=0.05, valueThd=1, out='pv')
head(sel)
#> padj value
#> Os01g0179300 3.594718e-10 -2.234465
#> Os01g0210600 5.726760e-08 -2.548997
#> Os01g0224200 4.048010e-03 -4.459420
#> Os01g0238500 6.343326e-03 3.954189
#> Os01g0247600 1.080655e-03 2.941261
#> Os01g0524500 1.104847e-05 5.228812
Output gene names of DE genes.
sel=movSelect(DEgene, condpair='embryo.anther',
padjThd=0.05, upThd=0.5, out='gene')
#> Warning: condpair is flip of movRes@conds, so movRes@pairwise$value*(-1)
head(sel)
#> [1] "ENSRNA049472915" "Os01g0179300" "Os01g0210600" "Os01g0224200"
#> [5] "Os01g0571300" "Os01g0586600"
movAPA provides the function movDEPAC to identify DE PACs between samples. Three strategies were utilized: (i) using DESeq2 with replicates; (ii) using DEXseq with replicates; (iii) using chi-squared test without replicates (“chisq”).
First we show an example of detecting DE PACs among all pairwise conditions using three different methods. Only PACs with total read counts in all samples >=20 are used.
DEPAC=movDEPAC(PACds, method='DESeq2', group='group', minSumPAT=20)
DEXPAC=movDEPAC(PACds, method='DEXseq', group='group', minSumPAT=20)
DEqPAC=movDEPAC(PACds, method='chisq', group='group', minSumPAT=20)
Number of DE PACs among methods.
library(ggplot2)
## Get significant DE results.
stat1=movStat(object=DEPAC, padjThd=0.05, valueThd=1)
stat2=movStat(object=DEXPAC, padjThd=0.05, valueThd=1)
stat3=movStat(object=DEqPAC, padjThd=0.05, valueThd=0.95)
## Count the number of DE PACs by different methods.
nsig=as.data.frame(cbind(stat1$nsig, stat2$nsig, stat3$nsig))
colnames(nsig)=c('DESeq2','DEXseq','Chisq.test')
nsig$tissueA.tissueB=rownames(nsig)
nsig
#> DESeq2 DEXseq Chisq.test tissueA.tissueB
#> anther.embryo 242 146 58 anther.embryo
#> anther.maturePollen 114 121 41 anther.maturePollen
#> embryo.maturePollen 313 202 93 embryo.maturePollen
## Plot a barplot.
nsig=reshape2::melt(nsig, variable.name='Method')
#> Using tissueA.tissueB as id variables
ggplot(data=nsig, aes(x=tissueA.tissueB, y=value, fill=Method)) +
geom_bar(stat="identity", position=position_dodge()) +
ylab("DE PAC#") + theme_bw()
We can also detect DE PACs between two given conditions.
## First subset PACs in two conditions.
PACds1=subsetPACds(PACds, group='group',
cond1='anther', cond2='embryo', choosePA='apa')
## Detect DE PACs.
DEPAC1=movDEPAC(PACds1, method='DESeq2', group='group', minSumPAT=10)
DEXPAC1=movDEPAC(PACds1, method='DEXseq', group='group', minSumPAT=10)
DEqPAC1=movDEPAC(PACds1, method='chisq', group='group', minSumPAT=10)
Make statistics of the DE PACs result by DESeq2 method (DEPAC).
stat=movStat(object=DEPAC, padjThd=0.05, valueThd=1)
## Number of DE PACs between conditions.
stat$nsig
#> sig.num
#> anther.embryo 242
#> anther.maturePollen 114
#> embryo.maturePollen 313
## Overlap of DE PACs between different pairs of conditions.
head(stat$ovp)
#> pair n1.sig.num n2.sig.num novp.sig.num
#> 1 anther.embryo-anther.maturePollen 242 114 68
#> 2 anther.embryo-embryo.maturePollen 242 313 199
#> 3 anther.maturePollen-embryo.maturePollen 114 313 90
## DE PAC list
head(stat$siglist[[1]])
#> [1] "Os01g0151600:2795487" "Os01g0179300:4126216" "Os01g0179300:4126779"
#> [4] "Os01g0238500:7668102" "Os01g0247600:8130944" "Os01g0247600:8131074"
We can also plot a venn diagram to show the overlap among results from different pairwise comparisons.
library(VennDiagram, quietly = TRUE)
x=venn.diagram(stat$siglist, fill=brewer.pal(3, "Set1"), cex=2,
cat.fontface=4, filename='DEPAC.venn')
Stat the DE PAC result from the chisq-test method, here the value column of DEqPAC is 1-pvalue_of_the_gene. So using padjThd=0.05 and valueThd=0.95 means filtering DE PACs with adjusted pvalue of PAC <0.05 and adjusted pvalue of gene <0.05.
stat=movStat(object=DEqPAC, padjThd=0.05, valueThd=0.95)
We can use movSelect to output full or simple list of DE PACs.
## Here method is DEXseq, so the valueThd (log2FC) threshold is automatelly determined.
sel=movSelect(aMovRes=DEXPAC, condpair='embryo.anther',
padjThd=0.1, out='full', PACds=PACds)
#> Warning: condpair is flip of movRes@conds, so movRes@pairwise$value*(-1)
#> Warning: movRes is DEXPAC, but valueThd/upThd/dnThd are all NULL, mannually set valueThd=min(maxfc)= 0.002953872
#> Warning: movRes is DEXPAC, also filter by rowMax(movRes@pairwise$value)
head(sel, n=2)
#> PA chr UPA_start UPA_end strand coord ftr
#> 1 ENSRNA049444301:25040070 12 25040068 25040071 - 25040070 intergenic
#> 2 ENSRNA049472915:32398407 3 32398154 32398573 - 32398407 Ext_3UTR
#> gene gene_type ftr_start ftr_end anther1 anther2 anther3 embryo1
#> 1 ENSRNA049444301 tRNA 25043356 25032325 0 0 0 0
#> 2 ENSRNA049472915 snoRNA 32398830 32398830 285 324 352 510
#> embryo2 embryo3 maturePollen1 maturePollen2 maturePollen3 padj
#> 1 0 0 24 3 0 6.971614e-02
#> 2 558 548 130 191 110 1.081613e-13
#> value
#> 1 -0.7671602
#> 2 0.8623713
## You can also mannually set a log2FC threshold.
sel=movSelect(aMovRes=DEXPAC, condpair='embryo.anther',
padjThd=0.1, valueThd=2, out='pa');
#> Warning: condpair is flip of movRes@conds, so movRes@pairwise$value*(-1)
#> Warning: movRes is DEXPAC, also filter by rowMax(movRes@pairwise$value)
head(sel)
#> [1] "Os01g0263600:8948833" "Os01g0327400:12630082" "Os01g0327400:12630218"
#> [4] "Os01g0812200:34534443" "Os01g0841000:36096864" "Os01g0881300:38257576"
## Filter only up-regulated PACs in embryo
## (value=log2(embryo_this_others/anther_this_others)).
sel=movSelect(aMovRes=DEXPAC, condpair='embryo.anther',
padjThd=0.1, upThd=2, out='full', PACds=PACds)
#> Warning: condpair is flip of movRes@conds, so movRes@pairwise$value*(-1)
#> Warning: movRes is DEXPAC, also filter by rowMax(movRes@pairwise$value)
head(sel, 2)
#> PA chr UPA_start UPA_end strand coord ftr
#> 1 ENSRNA049472915:32398407 3 32398154 32398573 - 32398407 Ext_3UTR
#> 2 Os01g0327400:12630082 1 12629955 12630117 + 12630082 intergenic
#> gene gene_type ftr_start ftr_end anther1 anther2 anther3
#> 1 ENSRNA049472915 snoRNA 32398830 32398830 285 324 352
#> 2 Os01g0327400 protein_coding 12582256 12646586 1 1 1
#> embryo1 embryo2 embryo3 maturePollen1 maturePollen2 maturePollen3
#> 1 510 558 548 130 191 110
#> 2 0 0 0 0 0 0
#> padj value
#> 1 1.081613e-13 0.8623713
#> 2 3.171817e-02 11.5717831
Here we take the DEPAC result for example to show the visualization of DE PACs in a gene.
## Filter less results and plot the heatmap clearly.
stat=movStat(object=DEPAC, padjThd=0.001, valueThd=8)
outputHeatStat(heatStats=stat, ostatfile='DEPAC.stat', plotPre='DEPAC',
show_rownames = TRUE)
Visualize DE PACs in an example gene by movViz. First, we examine all PACs in this gene. There are three PACs, two in 3’UTR and one in extended 3’UTR. But the expression level of the PAC in extended 3UTR is only 3.
gene='Os05g0541900'
gp=PACds[PACds@anno$gene==gene, ]
cbind(gp@anno$ftr, rowSums(gp@counts))
#> [,1] [,2]
#> Os05g0541900:26902813 "3UTR" "79"
#> Os05g0541900:26902492 "3UTR" "40"
#> Os05g0541900:26902388 "3UTR" "8"
#> Os05g0541900:26902140 "3UTR" "340"
#> Os05g0541900:26901800 "Ext_3UTR" "3"
#> Os05g0541900:26900274 "intergenic" "2"
Visualize PACs of this gene in individual conditions. Here the Y-axis is read count, the scale of which is different among conditions.DE PACs identified by DESeq2 method with padj < padjThd are highlighted in dashed yellow lines.
movViz(object=DEPAC, gene=gene, txdb=gff, PACds=PACds, collapseConds=FALSE,
padjThd=0.01, showRatio=FALSE, showAllPA=TRUE)
We can also show condition pairs in individual tracks and only display and/or highlight given condition pairs. If padjThd is given, then the DE PACs (padj < padjThd) will be highlighted (dashed yellow line).
movViz(object=DEPAC, gene=gene, txdb=gff, PACds=PACds, collapseConds=TRUE,
padjThd=0.01, showPV=TRUE, showAllPA=FALSE, showRatio=F,
conds=DEPAC@conds[c(1,3), ], highlightConds=DEPAC@conds[c(3), ])
APA dynamics (i.e., APA site switching or 3’UTR lengthening/shortening) of a gene can be deduced by comparing the ratios of expression levels of one poly(A) site (e.g., the short isoform) over the other poly(A) site (e.g., the long isoform) between two biological samples. For unity, here we refer 3’UTR lengthening/shortening to 3’UTR switching, and refer APA dynamics involving a pair of PACs to APA site switching. Function movUTRtrend is used to identify 3’UTR switching events between samples. We developed three methods in movUTRtrend for detecting 3’UTR switching events from samples with or without replicates: (i) the strategy based on the chi-squared test for trend in proportions (“linearTrend”); (ii) the strategy based on DE PACs from DESeq2 (“DE”); (iii) the strategy based on DE PACs from DEXSeq (“DEX”).
First, we used the ‘linearTrend’ method to detect 3’UTR switching events. Only PACs and genes with average read count between the two conditions >=10 and >=20 are used.
utr=movUTRtrend(PACds, group='group', method='linearTrend',
avgPACtag=10, avgGeneTag=20)
#> anther.embryo
#> anther.maturePollen
#> embryo.maturePollen
## Number of genes for analyzing, including those not significant.
lapply(utr@fullList, nrow)
#> $anther.embryo
#> [1] 44
#>
#> $anther.maturePollen
#> [1] 31
#>
#> $embryo.maturePollen
#> [1] 47
head(utr@fullList[["anther.embryo"]], n=2)
#> gene nPAC geneTag1 geneTag2 avgUTRlen1 avgUTRlen2
#> Os01g0151600 Os01g0151600 2 28 59 231.4643 191.5085
#> Os01g0254900 Os01g0254900 2 221 101 265.5520 286.9604
#> pvalue padj change cor logRatio
#> Os01g0151600 0.018098744 0.4886661 -1 -0.2534036 -1.072588
#> Os01g0254900 0.008878056 0.2840978 1 0.1458238 1.128916
#> PAs1
#> Os01g0151600 Os01g0151600:2795487=11;Os01g0151600:2795636=17
#> Os01g0254900 Os01g0254900:8475658=133;Os01g0254900:8475521=88
#> PAs2
#> Os01g0151600 Os01g0151600:2795487=39;Os01g0151600:2795636=20
#> Os01g0254900 Os01g0254900:8475658=45;Os01g0254900:8475521=56
Make statistics of the results; genes with padj<0.1 and abs(cor)>0 are considered as 3’UTR switching.
stat=movStat(object=utr, padjThd=0.1, valueThd=0)
#> All cond pairs in heat@colData, get de01 and deNum
stat$nsig
#> sig.num
#> anther.embryo 9
#> anther.maturePollen 4
#> embryo.maturePollen 24
Output 3’UTR switching results for a pair of conditions.
## Only output gene ids.
out=movSelect(aMovRes=utr, condpair='anther.embryo',
padjThd=0.1, valueThd=0, out='gene')
## Output PAC ids.
out=movSelect(aMovRes=utr, condpair='anther.maturePollen',
padjThd=0.1, valueThd=0, out='pa')
## Output gene ids with padj and value.
out=movSelect(aMovRes=utr, condpair='anther.embryo',
padjThd=0.1, valueThd=0, out='pv')
## Output full information with expression levels, 3UTR length,
## read counts of each PA in each sample, etc.
out=movSelect(aMovRes=utr, condpair='anther.embryo',
padjThd=0.1, valueThd=0, out='full')
## Output full information for 3UTR lengthening genes from anther to embryo (change=1).
out=movSelect(aMovRes=utr, condpair='anther.embryo',
padjThd=0.1, upThd=0, out='full')
## Output full information for 3UTR shortening genes from anther to embryo (change=-1).
out=movSelect(aMovRes=utr, condpair='anther.embryo',
padjThd=0.1, dnThd=0, out='full')
head(out, n=2)
#> gene nPAC geneTag1 geneTag2 avgUTRlen1 avgUTRlen2
#> Os02g0759700 Os02g0759700 2 77 54 539.7662 323.3704
#> Os05g0438800 Os05g0438800 2 93 197 348.7204 294.7005
#> pvalue padj change cor logRatio
#> Os02g0759700 2.499868e-09 1.049944e-07 -1 -0.5208556 0.5111023
#> Os05g0438800 6.160952e-06 2.464381e-04 -1 -0.2654699 -1.0820747
#> PAs1
#> Os02g0759700 Os02g0759700:31988970=10;Os02g0759700:31989403=67
#> Os05g0438800 Os05g0438800:21501003=4;Os05g0438800:21500764=89
#> PAs2
#> Os02g0759700 Os02g0759700:31988970=34;Os02g0759700:31989403=20
#> Os05g0438800 Os05g0438800:21501003=53;Os05g0438800:21500764=144
Here is another example of using DEX method to detect 3’UTR switching events. First get DE PAC results by DEXseq and then get 3’UTR switching events.
DEXPAC=movDEPAC(PACds, method='DEXseq', group='group', minSumPAT=10)
swDEX=movUTRtrend(PACds, group='group', method='DEX',
avgPACtag=10, avgGeneTag=20,
aMovDEPACRes=DEXPAC, DEPAC.padjThd=0.01,
mindist=50, fisherThd=0.01, logFCThd=1, selectOne='farest')
Get 3’UTR switching genes with padj<0.1 and |log2FC|>=1.
stat=movStat(object=swDEX, padjThd=0.01, valueThd=1)
#> All cond pairs in heat@colData, get de01 and deNum
stat$nsig
#> sig.num
#> anther.embryo 6
#> anther.maturePollen 1
#> embryo.maturePollen 15
out=movSelect(aMovRes=swDEX, condpair='anther.embryo',
padjThd=0.01, valueThd=1, out='full')
head(out, n=2)
#> gene nPAC geneTag1 geneTag2 avgUTRlen1 avgUTRlen2 fisherPV
#> 1 Os02g0759700 2 77 54 539.7662 323.3704 4.912714e-09
#> 2 Os05g0438800 2 93 197 348.7204 294.7005 3.852680e-06
#> logFC change PA1 PA2 dist nDEPA
#> 1 -3.364997 -1 Os02g0759700:31988970 Os02g0759700:31989403 434 2
#> 2 -2.744903 -1 Os05g0438800:21501003 Os05g0438800:21500764 240 1
#> nSwitchPair PAs1
#> 1 1 Os02g0759700:31988970=10;Os02g0759700:31989403=67
#> 2 1 Os05g0438800:21501003=4;Os05g0438800:21500764=89
#> PAs2
#> 1 Os02g0759700:31988970=34;Os02g0759700:31989403=20
#> 2 Os05g0438800:21501003=53;Os05g0438800:21500764=144
Here we used three methods to call 3’UTR switching and then compared the results from these methods.
swLinear=movUTRtrend(PACds, group='group',method='linearTrend',
avgPACtag=10, avgGeneTag=20)
swDEX=movUTRtrend(PACds, group='group', method='DEX',
avgPACtag=10, avgGeneTag=20,
aMovDEPACRes=DEXPAC, DEPAC.padjThd=0.01,
mindist=50, fisherThd=0.01, logFCThd=1, selectOne='fisherPV')
swDE=movUTRtrend(PACds, group='group', method='DE',
avgPACtag=10, avgGeneTag=20,
aMovDEPACRes=DEPAC, DEPAC.padjThd=0.01,
mindist=50, fisherThd=0.01, logFCThd=1, selectOne='fisherPV')
Get significant 3’UTR switching events.
stat1=movStat(object=swLinear, padjThd=0.1, valueThd=0)
stat2=movStat(object=swDEX, padjThd=0.01, valueThd=1)
stat3=movStat(object=swDE, padjThd=0.01, valueThd=1)
Count number of 3’UTR switching events by different methods
nsig=as.data.frame(cbind(stat1$nsig, stat2$nsig, stat3$nsig))
colnames(nsig)=c('LinearTrend','DE-DEXseq','DE-DESeq')
nsig$tissueA.tissueB=rownames(nsig)
nsig
#> LinearTrend DE-DEXseq DE-DESeq tissueA.tissueB
#> anther.embryo 9 6 9 anther.embryo
#> anther.maturePollen 4 1 6 anther.maturePollen
#> embryo.maturePollen 24 15 19 embryo.maturePollen
nsig=reshape2::melt(nsig, variable.name='Method')
#> Using tissueA.tissueB as id variables
ggplot(data=nsig, aes(x=tissueA.tissueB, y=value, fill=Method)) +
geom_bar(stat="identity", position=position_dodge()) +
ylab("3\'UTR switching events #") + theme_bw()
Gene Os02g0759700 is identified as 3’UTR switching. This gene has one PAC in CDS and two PACs in 3UTR; the 3UTR switching happens between anther~embryo and between embryo~maturePollen.
gene='Os02g0759700'
gp=PACds[PACds@anno$gene==gene, ]
cbind(gp@anno$ftr, rowSums(gp@counts))
#> [,1] [,2]
#> Os02g0759700:31988483 "CDS" "5"
#> Os02g0759700:31988970 "3UTR" "204"
#> Os02g0759700:31989403 "3UTR" "649"
#> Os02g0759700:31990233 "intergenic" "4"
Plot all PACs of this gene in all conditions and replicates. Highlight PACs involving in the switching analysis in orange.
movViz(object=swDE, gene=gene, txdb=NULL, PACds=PACds, showRatio=TRUE,
padjThd=0.01, showAllPA=TRUE)
Show in each track a condition pair and use line to link PACs to show the trend. There is 3’UTR switching between anther and maturePollen, and embryo and maturePollen. Highlight specific condition pair with blue background and only show PACs involving the switching analysis with a dashed line in orange.
movViz(object=swDE, gene=gene, txdb=gff, PACds=PACds, collapseConds=TRUE,
conds=swDE@conds, highlightConds=swDE@conds[c(1,3), ], showRatio=TRUE,
linkPAs=TRUE, padjThd=0.01, showAllPA=FALSE, showPV=TRUE)
Show only the condition pair anther~embryo and only PACs involving the 3UTR switching. Do not show gene model but only the genomic region of PACs, and show all PACs but hightlight the switching PACs in dashed yellow line. Show only switching PACs.
movViz(object=swDE, gene=gene, txdb=NULL, PACds=PACds, collapseConds=TRUE,
conds=swDE@conds[1, ], highlightConds=NULL, showRatio=TRUE, linkPAs=TRUE,
padjThd=0.01, showAllPA=FALSE, showPV=FALSE)
This example shows using heatmaps for DEPAC results. First call the differential analysis and then call movStat to stat the results.
stat=movStat(object=swDE, padjThd=0.01, valueThd=1)
#> All cond pairs in heat@colData, get de01 and deNum
stat$nsig
#> sig.num
#> anther.embryo 9
#> anther.maturePollen 6
#> embryo.maturePollen 19
Output stat results into files. The pdf file stores the plots about the number of significant events and the overlap among different condition pairs.
outputHeatStat(heatStats=stat, ostatfile='3UTR_switching_DE.stat',
plotPre='3UTR_switching_DE', show_rownames = TRUE)
To plot heatmap mannually, first convert the movRes object to a heatmap object and then filter switching genes.
heat=movRes2heatmapResults(swDE)
heatUp=subsetHeatmap(heat, padjThd=0.05, valueThd=1)
From the heatmap, we can see gene Os06g0682633 is shorter from anther to embryo (value=-8) and longer from embryo to maturePollen (value=7).
plotHeatmap(heatUp@value, show_rownames=TRUE, plotPre=NULL, cluster_rows=TRUE)
Get the switching information for this gene.
fl=swDE@fullList$anther.embryo
fl[fl$gene=='Os06g0682633',]
#> gene nPAC geneTag1 geneTag2 avgUTRlen1 avgUTRlen2 fisherPV
#> 7 Os06g0682633 2 11 353 1181 558.3938 1.420396e-12
#> logFC change PA1 PA2 dist nDEPA
#> 7 -7.370687 -1 Os06g0682633:28447307 Os06g0682633:28447973 667 1
#> nSwitchPair PAs1
#> 7 3 Os06g0682633:28447307=0;Os06g0682633:28447973=11
#> PAs2
#> 7 Os06g0682633:28447307=330;Os06g0682633:28447973=23
The function movAPAswitch is used to detect both canonical and non-canonical APA site switching events. The strategy of movAPAswitch is similar to the strategy based on DE PACs in movUTRtrend but with higher flexibility. If a gene has more than two PACs, then each pair of PACs (denoted as PA1 and PA2) are analyzed. The following criteria are used to determine a APA switching event: whether PA1 or PA2 are DE; average read count for both sites; distance between PA1 and PA2; average read count for a gene; relative change of PA1 and PA2 (RC); read count ratio (PA1:PA2) >1 in one sample and <1 in another sample; p-value of the Fisher’ s exact test for PA1 and PA2 read counts between samples. Pairs of PACs that meet user specified conditions are considered as APA site switching events. Users can use the movSelect function to filter 3’ UTR switching events or APA site switching events with higher flexibility.
First get DE PAC results by DEXseq.
DEXPAC=movDEPAC(PACds, method='DEXseq', group='group', minSumPAT=10)
Then get 3’UTR switching genes, usig selectOne=NULL to detect all pairs of switching PACs.
swDEX=movAPAswitch(PACds, group='group',aMovDEPACRes=DEXPAC,
avgPACtag=5, avgGeneTag=10,
only3UTR=TRUE,
DEPAC.padjThd=0.1, nDEPAC=1,
mindist=50, fisherThd=0.1, logFCThd=0.5,
cross=FALSE, selectOne=NULL)
Stat the switching results.
stat=movStat(object=swDEX, padjThd=0.1, valueThd=1)
#> All cond pairs in heat@colData, get de01 and deNum
stat$nsig
#> sig.num
#> anther.embryo 32
#> anther.maturePollen 11
#> embryo.maturePollen 38
Output switching genes with full information for anther~embryo.
sel=movSelect(aMovRes=swDEX, condpair='anther.embryo',
padjThd=0.1, valueThd=1, out='full')
head(sel, n=2)
#> gene nPAC geneTag1 geneTag2 avgUTRlen1 avgUTRlen2 fisherPV
#> 1 Os01g0151600 2 84 176 231.4643 191.7955 6.670538e-05
#> 3 Os01g0655400 2 70 76 144.7857 200.5000 2.241096e-02
#> logFC change PA1 PA2 dist nDEPA
#> 1 -1.552604 -1 Os01g0151600:2795487 Os01g0151600:2795636 150 2
#> 3 1.120104 1 Os01g0655400:26602269 Os01g0655400:26601984 286 1
#> nSwitchPair PAs1
#> 1 1 Os01g0151600:2795487=33;Os01g0151600:2795636=51
#> 3 1 Os01g0655400:26602269=45;Os01g0655400:26601984=25
#> PAs2
#> 1 Os01g0151600:2795487=116;Os01g0151600:2795636=60
#> 3 Os01g0655400:26602269=34;Os01g0655400:26601984=42
Detect APA switching events involving non-3’UTR PACs, using selectOne=NULL to get all pairs of switching PACs.
swDE=movAPAswitch(PACds, group='group', aMovDEPACRes=DEXPAC,
avgPACtag=10, avgGeneTag=20,
only3UTR=FALSE,
DEPAC.padjThd=0.1, nDEPAC=1,
mindist=50, fisherThd=0.1, logFCThd=0.5,
cross=FALSE, selectOne=NULL)
Stat the switching results.
stat=movStat(object=swDE, padjThd=0.1, valueThd=1)
#> All cond pairs in heat@colData, get de01 and deNum
stat$nsig
#> sig.num
#> anther.embryo 43
#> anther.maturePollen 21
#> embryo.maturePollen 57
Output switching genes with full information for anther~embryo.
sw=movSelect(aMovRes=swDE, condpair='anther.embryo',
padjThd=0.01, valueThd=1, out='full')
head(sw[order(sw$fisherPV), ], n=2)
#> gene nPAC geneTag1 geneTag2 fisherPV logFC change
#> 25 Os04g0635800 2 94 3437 7.680786e-75 -6.255737 -1
#> 16 Os02g0790500 2 549 124 6.634803e-35 -3.837820 -1
#> PA1 PA2 dist nDEPA nSwitchPair
#> 25 Os04g0635800:32341126 Os04g0635800:32339713 1414 2 1
#> 16 Os02g0790500:33573091 Os02g0790500:33573166 76 2 1
#> PAs1
#> 25 Os04g0635800:32341126=21;Os04g0635800:32339713=73
#> 16 Os02g0790500:33573091=81;Os02g0790500:33573166=468
#> PAs2 ftr
#> 25 Os04g0635800:32341126=3293;Os04g0635800:32339713=144 3UTR,Ext_3UTR
#> 16 Os02g0790500:33573091=89;Os02g0790500:33573166=35 3UTR,Ext_3UTR
First get list of genes or PACs of switching events, then subset PACds by genes or PACs.
genes=movSelect(aMovRes=swDE, condpair='anther.embryo',
padjThd=0.01, valueThd=1, out='gene')
swPAC=subsetPACds(PACds, genes=genes, verbose=TRUE)
#> count
#> before subsetPACds 1233
#> minExprConds>=1 1233
#> genes 138
table(swPAC@anno$ftr)
#>
#> 3UTR 5UTR CDS Ext_3UTR intergenic intron
#> 66 8 7 31 11 15
PAs=movSelect(aMovRes=swDE, condpair='anther.embryo', padjThd=0.01,
valueThd=1, out='pa')
swPAC=subsetPACds(PACds, PAs=PAs, verbose=TRUE)
#> count
#> before subsetPACds 1233
#> minExprConds>=1 1233
#> PAs 77
table(swPAC@anno$ftr)
#>
#> 3UTR Ext_3UTR intron
#> 53 22 2
Show one switching gene (Os05g0451900), where switching happens between a 3’UTR PAC and an intronic PAC. This gene has 2 PACs in intron and 1 PAC in 3UTR; the APA-site switching happens between anther~maturePollen.
gene='Os05g0451900'
gp=PACds[PACds@anno$gene==gene, ]
cbind(gp@anno$ftr, rowSums(gp@counts))
#> [,1] [,2]
#> Os05g0451900:22185329 "intron" "294"
#> Os05g0451900:22185573 "intron" "6"
#> Os05g0451900:22188660 "3UTR" "223"
Plot all PACs of this gene in all conditions and replicates. Highlight PACs involving in the switching analysis in orange.
movViz(object=swDE, gene=gene, txdb=gff, PACds=PACds,
showRatio=TRUE, padjThd=0.01, showAllPA=TRUE)
Show in each track a condition pair and use line to link PACs to show the trend. Highlight specific condition pair with blue background and only show PACs involving the switching analysis with a dashed line in orange. There is APA-site switching between anther and maturePollen.
movViz(object=swDE, gene=gene, txdb=gff, PACds=PACds, collapseConds=TRUE,
conds=swDE@conds, highlightConds=swDE@conds[c(2,3), ], showRatio=TRUE,
linkPAs=TRUE, padjThd=0.01, showAllPA=FALSE)
The session information records the versions of all the packages used in the generation of the present document.
sessionInfo()
#> R version 4.2.2 (2022-10-31 ucrt)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 22621)
#>
#> Matrix products: default
#>
#> locale:
#> [1] LC_COLLATE=Chinese (Simplified)_China.utf8
#> [2] LC_CTYPE=Chinese (Simplified)_China.utf8
#> [3] LC_MONETARY=Chinese (Simplified)_China.utf8
#> [4] LC_NUMERIC=C
#> [5] LC_TIME=Chinese (Simplified)_China.utf8
#>
#> attached base packages:
#> [1] grid stats4 stats graphics grDevices utils datasets
#> [8] methods base
#>
#> other attached packages:
#> [1] DESeq2_1.38.1 SummarizedExperiment_1.28.0
#> [3] Biobase_2.58.0 MatrixGenerics_1.10.0
#> [5] matrixStats_0.63.0 ComplexHeatmap_2.14.0
#> [7] ggplot2_3.4.0 dplyr_1.0.10
#> [9] magrittr_2.0.3 BSgenome.Oryza.ENSEMBL.IRGSP1_1.4.2
#> [11] BSgenome_1.66.2 rtracklayer_1.58.0
#> [13] Biostrings_2.66.0 XVector_0.38.0
#> [15] GenomicRanges_1.50.1 GenomeInfoDb_1.34.9
#> [17] IRanges_2.32.0 S4Vectors_0.36.0
#> [19] BiocGenerics_0.44.0 movAPA_0.2.0
#>
#> loaded via a namespace (and not attached):
#> [1] utf8_1.2.2 R.utils_2.12.2
#> [3] tidyselect_1.2.0 poweRlaw_0.70.6
#> [5] RSQLite_2.2.18 AnnotationDbi_1.60.0
#> [7] htmlwidgets_1.5.4 BiocParallel_1.32.1
#> [9] munsell_0.5.0 codetools_0.2-18
#> [11] interp_1.1-3 statmod_1.4.37
#> [13] withr_2.5.0 colorspace_2.0-3
#> [15] filelock_1.0.2 OrganismDbi_1.40.0
#> [17] highr_0.9 knitr_1.41
#> [19] rstudioapi_0.14 motifStack_1.42.0
#> [21] labeling_0.4.2 GenomeInfoDbData_1.2.9
#> [23] hwriter_1.3.2.1 bit64_4.0.5
#> [25] farver_2.1.1 vctrs_0.5.1
#> [27] generics_0.1.3 xfun_0.35
#> [29] biovizBase_1.46.0 BiocFileCache_2.6.0
#> [31] R6_2.5.1 doParallel_1.0.17
#> [33] clue_0.3-63 locfit_1.5-9.6
#> [35] AnnotationFilter_1.22.0 grImport2_0.2-0
#> [37] bitops_1.0-7 cachem_1.0.6
#> [39] reshape_0.8.9 DelayedArray_0.24.0
#> [41] assertthat_0.2.1 BiocIO_1.8.0
#> [43] scales_1.2.1 nnet_7.3-18
#> [45] gtable_0.3.1 Cairo_1.6-0
#> [47] ggbio_1.46.0 ensembldb_2.22.0
#> [49] seqLogo_1.64.0 rlang_1.0.6
#> [51] genefilter_1.80.0 GlobalOptions_0.1.2
#> [53] splines_4.2.2 lazyeval_0.2.2
#> [55] dichromat_2.0-0.1 checkmate_2.1.0
#> [57] BiocManager_1.30.19 yaml_2.3.6
#> [59] reshape2_1.4.4 GenomicFeatures_1.50.2
#> [61] backports_1.4.1 Hmisc_5.0-0
#> [63] RBGL_1.74.0 tools_4.2.2
#> [65] ellipsis_0.3.2 jquerylib_0.1.4
#> [67] RColorBrewer_1.1-3 Rcpp_1.0.9
#> [69] plyr_1.8.8 base64enc_0.1-3
#> [71] progress_1.2.2 zlibbioc_1.44.0
#> [73] RCurl_1.98-1.9 prettyunits_1.1.1
#> [75] rpart_4.1.19 deldir_1.0-6
#> [77] GetoptLong_1.0.5 cluster_2.1.4
#> [79] data.table_1.14.6 magick_2.7.3
#> [81] circlize_0.4.15 ProtGenerics_1.30.0
#> [83] hms_1.1.2 evaluate_0.18
#> [85] xtable_1.8-4 XML_3.99-0.12
#> [87] jpeg_0.1-10 gridExtra_2.3
#> [89] shape_1.4.6 compiler_4.2.2
#> [91] biomaRt_2.54.0 tibble_3.1.8
#> [93] crayon_1.5.2 R.oo_1.25.0
#> [95] htmltools_0.5.3 tzdb_0.3.0
#> [97] Formula_1.2-4 TFBSTools_1.36.0
#> [99] geneplotter_1.76.0 DBI_1.1.3
#> [101] dbplyr_2.2.1 MASS_7.3-58.1
#> [103] rappdirs_0.3.3 Matrix_1.5-3
#> [105] ade4_1.7-22 readr_2.1.3
#> [107] cli_3.4.1 R.methodsS3_1.8.2
#> [109] parallel_4.2.2 pkgconfig_2.0.3
#> [111] GenomicAlignments_1.34.0 TFMPvalue_0.0.9
#> [113] foreign_0.8-83 xml2_1.3.3
#> [115] foreach_1.5.2 annotate_1.76.0
#> [117] bslib_0.4.1 DirichletMultinomial_1.40.0
#> [119] DEXSeq_1.44.0 stringr_1.4.1
#> [121] VariantAnnotation_1.44.0 digest_0.6.30
#> [123] pracma_2.4.2 graph_1.76.0
#> [125] CNEr_1.34.0 rmarkdown_2.18
#> [127] htmlTable_2.4.1 restfulr_0.0.15
#> [129] curl_4.3.3 Rsamtools_2.14.0
#> [131] gtools_3.9.4 rjson_0.2.21
#> [133] lifecycle_1.0.3 jsonlite_1.8.3
#> [135] fansi_1.0.3 pillar_1.8.1
#> [137] lattice_0.20-45 GGally_2.1.2
#> [139] KEGGREST_1.38.0 fastmap_1.1.0
#> [141] httr_1.4.4 survival_3.4-0
#> [143] GO.db_3.16.0 glue_1.6.2
#> [145] png_0.1-7 iterators_1.0.14
#> [147] bit_4.0.5 stringi_1.7.8
#> [149] sass_0.4.4 blob_1.2.3
#> [151] latticeExtra_0.6-30 caTools_1.18.2
#> [153] memoise_2.0.1
[1] Fu, H., Yang, D., Su, W., et al. Genome-wide dynamics of alternative polyadenylation in rice. Genome Res. 2016;26(12):1753-1760.
[2] Zhu, S., Ye, W., Ye, L., et al. PlantAPAdb: A Comprehensive Database for Alternative Polyadenylation Sites in Plants. Plant Physiol. 2020;182(1):228-242.
[3] Anders, S. and Huber, W. Differential expression analysis for sequence count data. Genome Biol. 2010;11(10):2010-2011.
[4] Robinson, M.D., McCarthy, D.J. and Smyth, G.K. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 2010;26(1):139-140.
[5] Ji, Z., Lee, J.Y., Pan, Z., et al. Progressive lengthening of 3’ untranslated regions of mRNAs by alternative polyadenylation during mouse embryonic development. Proc. Natl. Acad. Sci. USA 2009;106(17):7028-7033.
[6] Ulitsky, I., Shkumatava, A., Jan, C.H., et al. Extensive alternative polyadenylation during zebrafish development. Genome Res. 2012;22(10):2054-2066.
[7] Begik, O., Oyken, M., Cinkilli Alican, T., et al. Alternative Polyadenylation Patterns for Novel Gene Discovery and Classification in Cancer. Neoplasia 2017;19(7):574-582.
[8] Shulman, E.D. and Elkon, R. Cell-type-specific analysis of alternative polyadenylation using single-cell transcriptomics data. Nucleic Acids Res 2019;47(19):10027-10039.
[9] Lianoglou, S., Garg, V., Yang, J.L., et al. Ubiquitously transcribed genes use alternative polyadenylation to achieve tissue-specific expression. Genes Dev. 2013;27(21):2380-2396.