This documentation describes how to read an external file of single-cell poly(A) sites generated by scAPAtrap and analyze it with movAPA.
Actually, PAC list with columns chr/strand/coord and counts can be
easily converted as PACdataset
and loaded into movAPA by
movAPA::createPACdataset
or readPACds
.
In this document, “PA, PAC, or pA” all are short for a poly(A) site.
movAPA implemented the PACdataset object for storing the expression levels and annotation of PACs from various conditions/samples. Almost all analyses of poly(A) site data in movAPA are based on the PACdataset. The “counts” matrix is the first element in the array list of PACdataset, which stores non-negative values representing expression levels of PACs. The “colData” matrix records the sample information and the “anno” matrix stores the genome annotation or additional information of the poly(A) site data.
The demo_peaks dataset in the movAPA package contains 5w peaks in
2538 cells, including two variables peaks.meta
and
peaks.demo
. This file was obtained by scAPAtrap.
library(movAPA)
# load demo peaks generated by scAPAtrap,
# which contains a list(peaks.meta, peaks.count)
data("demo_peaks")
peaks=demo_peaks
names(peaks)
#> [1] "peaks.meta" "peaks.count"
head(peaks$peaks.meta)
#> peakID chr start end strand coord
#> peak_83012 peak_83012 chr1 237148255 237148422 + 237148422
#> peak_627325 peak_627325 chr10 104147771 104148105 + 104148105
#> peak_1566732 peak_1566732 chr8 18856854 18856917 - 18856854
#> peak_983111 peak_983111 chr20 25228760 25229062 + 25229062
#> peak_2087967 peak_2087967 chrX 48698679 48698884 - 48698679
#> peak_87658 peak_87658 chr1 248867076 248867152 + 248867152
head(peaks$peaks.count[, 1:10])
#> 6 x 10 sparse Matrix of class "dgCMatrix"
#> [[ suppressing 10 column names 'AAACCCACAAATTGCC', 'AAACCCAGTCAATGGG', 'AAACCCATCTTAGCAG' ... ]]
#>
#> peak_83012 . . . . . . . . . .
#> peak_627325 . . . . . . . . . .
#> peak_1566732 . . . . . . . . . .
#> peak_983111 . . . . . . . . . .
#> peak_2087967 . . . . . . . . . .
#> peak_87658 . . . . . . . . . .
Then we can create a PACdataset object.
PACds=createPACdataset(counts=peaks$peaks.count, anno=peaks$peaks.meta)
PACds
#> PAC# 50000
#> sample# 2538
#> AAACCCACAAATTGCC AAACCCAGTCAATGGG AAACCCATCTTAGCAG AAACGAAAGATTGAGT AAACGCTAGGAGTCTG ...
#> groups:
#> @colData...[2538 x 1]
#> group
#> AAACCCACAAATTGCC group1
#> AAACCCAGTCAATGGG group1
#> @counts...[50000 x 2538]
#> 2 x 10 sparse Matrix of class "dgCMatrix"
#> [[ suppressing 10 column names 'AAACCCACAAATTGCC', 'AAACCCAGTCAATGGG', 'AAACCCATCTTAGCAG' ... ]]
#>
#> peak_83012 . . . . . . . . . .
#> peak_627325 . . . . . . . . . .
#> @colData...[2538 x 1]
#> group
#> AAACCCACAAATTGCC group1
#> AAACCCAGTCAATGGG group1
#> @anno...[50000 x 6]
#> peakID chr start end strand coord
#> peak_83012 peak_83012 chr1 237148255 237148422 + 237148422
#> peak_627325 peak_627325 chr10 104147771 104148105 + 104148105
rm(peaks)
This dataset contains many PAs (also called peaks in scAPAtrap), which may not be suitable for downstream analysis. We can first remove extremely lowly expressed peaks.
First, we make summary of the PACdataset. It seems that >50% of PAs with less than 2 reads.
summary(PACds)
#> PAC# 50000
#> sample# 2538
#> summary of expression level of each PA
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 1.00 1.00 2.00 11.63 7.00 16419.00
#> summary of expressed sample# of each PA
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 1.00 1.00 2.00 10.03 6.00 1944.00
Here we remove PAs with <10 tags supported by all cells and PAs that are expressed in <10 cells. This filtering removed ~80% PAs.
PACds=subsetPACds(PACds, totPACtag = 10, minExprConds = 10, verbose=TRUE)
#> count
#> before subsetPACds 50000
#> totPACtag>=10 9449
#> minExprConds>=10 9141
summary(PACds)
#> PAC# 9141
#> sample# 2538
#> summary of expression level of each PA
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 10.00 14.00 22.00 51.09 43.00 16419.00
#> summary of expressed sample# of each PA
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 10.00 13.00 21.00 42.79 40.00 1944.00
After read the data into a PACdataset, users can use many functions in movAPA for removing internal priming artifacts, annotating PACs, polyA signal analysis, etc. Please follow the vignette of “movAPA_on_rice_tissues” for more details.
For example, users can remove internal priming artifacts with
removePACdsIP
. Before starting, it is better to check the
chromosome names are consistent between the BSgenome
and
PACdataset
. We need make sure the chromosome name of your
PAC data is the same as the BSgenome.
library(BSgenome.Hsapiens.UCSC.hg38, quietly = TRUE, verbose = FALSE)
bsgenome<-BSgenome.Hsapiens.UCSC.hg38
head(GenomeInfoDb::seqnames(bsgenome))
#> [1] "chr1" "chr2" "chr3" "chr4" "chr5" "chr6"
head(unique(PACds@anno$chr))
#> [1] "chr5" "chr17" "chr2" "chr1" "chr14" "chr11"
PACdsIP=removePACdsIP(PACds, bsgenome, returnBoth=TRUE,
up=-10, dn=10, conA=6, sepA=7)
#> 5013 IP PACs; 4128 real PACs
length(PACdsIP$real)
#> [1] 4128
length(PACdsIP$ip)
#> [1] 5013
# summary of IPs and real PAs
summary(PACdsIP$real)
#> PAC# 4128
#> sample# 2538
#> summary of expression level of each PA
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 10.00 13.75 21.00 53.94 38.00 16419.00
#> summary of expressed sample# of each PA
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 10.00 13.00 20.00 42.51 36.00 1944.00
summary(PACdsIP$ip)
#> PAC# 5013
#> sample# 2538
#> summary of expression level of each PA
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 10.00 15.00 24.00 48.73 46.00 3405.00
#> summary of expressed sample# of each PA
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 10.00 14.00 23.00 43.01 43.00 1341.00
Nearly half of the PACs are internal priming artifacts. We can use the real PAs for further analysis. However, removing internal priming is a non-trifle task, which should be done in caution.
# use real PA for further analysis
PACds=PACdsIP$real
Another way to remove internal priming (IP) artifacts is to use known PAs. We can retain those IP sites that supported by known PAs as real. The following example shows how to use known human PAs from PolyA_DB v3 for IP removing.
First we loaded the known PAs.
annodb=read.table("Human_hg38.PolyA_DB.bed", sep="\t", header = FALSE)
head(annodb)
annodb=annodb[,c(1,2,6)]
colnames(annodb)=c("chr","coord","strand")
annodb=readPACds(annodb, colDataFile = NULL)
Next, we determine the overlap between PACds’ IP and known PAs. The PAs of scAPAtrap is represented by peak, with peak start and end. We believe that if a peak overlaps with any reference PA, then the PA of that peak is considered real.
The findOverapPACds
of movAPA use UPA_Start
and UPA_end
as the column names, so here we first modify
the start and end column names of peak to UPA_Start and UPA_end.
PACdsIP$ip@anno$UPA_start=PACdsIP$ip@anno$start
PACdsIP$ip@anno$UPA_end=PACdsIP$ip@anno$end
PACdsIP$real@anno$UPA_start=PACdsIP$real@anno$start
PACdsIP$real@anno$UPA_end=PACdsIP$real@anno$end
## find overlapping IPs
IP.ovp=findOvpPACds(qryPACds=PACdsIP$ip, sbjPACds=annodb,
d=50,
qryMode='region', sbjMode='point',
returnNonOvp=TRUE)
Merge IPs supported by reference PAs and the original real PAs as the PAs for subsequent analysis.
anno=rbind(PACdsIP$real@anno, IP.ovp$ovp@anno)
count=rbind(PACdsIP$real@counts, IP.ovp$ovp@counts)
PACds=createPACdataset(counts=count, anno=anno)
summary(PACds)
library(TxDb.Hsapiens.UCSC.hg38.knownGene, quietly = TRUE, verbose = FALSE)
txdb=TxDb.Hsapiens.UCSC.hg38.knownGene
# annotate PAs
PACds=annotatePAC(PACds, txdb)
#> Warning: remove PAs in [character(0)] genes 266 rows
# after annotation, the gene and ftr information are present in PACds@anno.
summary(PACds)
#> PAC# 3862
#> sample# 2538
#> summary of expression level of each PA
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 10.00 14.00 21.00 55.65 39.00 16419.00
#> summary of expressed sample# of each PA
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 10.00 13.00 20.00 43.62 37.00 1944.00
#> gene# 2430
#> nPAC
#> 3UTR 392
#> 5UTR 13
#> CDS 104
#> exon 31
#> intergenic 307
#> intron 3015
Genes with or without annotated 3’UTR could be assigned an extended
3’UTR of a given length using the function ext3UTRPACds
,
which can improve the “recovery” of poly(A) sites falling within
authentic 3’UTRs.
Before extending, we can calculate the number of PACs falling into extended 3’UTRs of different lengths.
testExt3UTR(PACds, seq(1000, 10000, by=1000))
#> ext3UTRLen addedPAnum
#> 1 1000 60
#> 2 2000 77
#> 3 3000 90
#> 4 4000 97
#> 5 5000 112
#> 6 6000 120
#> 7 7000 125
#> 8 8000 134
#> 9 9000 145
#> 10 10000 152
You can take a look at the length of the 3’UTRs PACds again to make a rough judgment. Here, we only select the length of the 3’UTRs where PA is located for approximate calculation.
utrid=which(PACds@anno$ftr=='3UTR')
utrs=unique(PACds@anno[utrid, c('ftr_end','ftr_start')])
summary(utrs$ftr_end-utrs$ftr_start+1)
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 34 559 1802 2682 3775 25446
Here we extended 3’UTR length for 2000 bp. After extension, 100+ PACs in intergenic region are now in extended 3’UTRs.
# extend 3UTR by 1000bp
PACds=ext3UTRPACds(PACds, 1000)
#> 60 PACs in extended 3UTR (ftr=intergenic >> ftr=3UTR)
#> Get 3UTR length (anno@toStop) for 3UTR/extended 3UTR PACs
# after 3UTR extension
summary(PACds)
#> PAC# 3862
#> sample# 2538
#> summary of expression level of each PA
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 10.00 14.00 21.00 55.65 39.00 16419.00
#> summary of expressed sample# of each PA
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 10.00 13.00 20.00 43.62 37.00 1944.00
#> gene# 2480
#> nPAC
#> 3UTR 452
#> 5UTR 13
#> CDS 104
#> exon 31
#> intergenic 247
#> intron 3015
For single cell data, we suggest analyzing only 3’UTR PAs and discarding PAs from other regions.
PACds=PACds[PACds@anno$ftr=='3UTR']
summary(PACds)
#> PAC# 452
#> sample# 2538
#> summary of expression level of each PA
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 10.0 20.0 42.0 155.8 130.5 3548.0
#> summary of expressed sample# of each PA
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 10.0 19.0 40.0 114.0 120.2 1412.0
#> gene# 440
#> nPAC
#> 3UTR 452
The function plotATCGforFAfile is for plotting single nucleotide profiles for given fasta file(s), which is particularly useful for discerning base compositions surrounding PACs.
First trim sequences surrounding PACs. Sequences surrounding PACs in different genomic regions are extracted into files. The PAC position is 301.
faFiles=faFromPACds(PACds, bsgenome, what='updn', fapre='updn',
up=-300, dn=100, byGrp='ftr')
#> 452 >>> updn.3UTR.fa
Then plot base compositions for specific sequence file(s).
plotATCGforFAfile(faFiles, ofreq=FALSE, opdf=FALSE,
refPos=301, mergePlots = TRUE)
movAPA provides several functions, including annotateByPAS, faFromPACds, kcount, and plotATCGforFAfile, for sequence extraction and poly(A) signal identification.
Here we show another example to scan known human polyA signals. First, we get mouse signals and set the priority.
v=getVarGrams('mm')
priority=c(1,2,rep(3, length(v)-2))
Then scan upstream regions of PACs for mouse signals.
PACdsMM=annotateByPAS(PACds, bsgenome, grams=v,
priority=priority,
from=-50, to=-1, label='mm')
table(PACdsMM@anno$mm_gram)
#>
#> AACAAA AACAAG AAGAAA AATAAA AATAAG AATAAT AATACA AATAGA AATATA AATGAA ACTAAA
#> 5 6 12 117 2 4 7 2 9 5 5
#> AGTAAA ATTAAA ATTACA ATTATA CATAAA GATAAA TATAAA
#> 5 34 4 5 3 6 6
## percent of PA without PAS
noPAS=sum(is.na(PACdsMM@anno$mm_gram))
noPAS/length(PACdsMM)
#> [1] 0.4756637
Plot signal logos.
pas=PACdsMM@anno$mm_gram[!is.na(PACdsMM@anno$mm_gram)]
plotSeqLogo(pas)
The function mergePACds can also be used to merge multiple PACdatasets. Notably, the annotation columns (e.g., gene, ftr) are lost after merging, you need call annotatePAC to annotate the merged PACds.
Here we show a multi-sample merging using the reference PA. We directly copy one more PACds for this example.
## the two PACds for merging
PACdsList=list(ds1=PACds, ds2=PACds)
pacds.merge=mergePACds(PACdsList, d=24, by='coord')
#> mergePACds: there are 2538 duplicated sample names in the PACdsList, will add .N to sample names of each PACds
#> mergePACds: total 904 redundant PACs from 2 PACds to merge
#> mergePACds without refPACds: 904 separate PACs reduce to 452 PACs (d=24nt)
#> mergePACds: melted all counts tables, total 103012 triplet rows
#> mergePACds: link 904 old PA IDs to 452 new PA IDs by merge
#> mergePACds: convert 103012 triplets to dgCMatrix
#> mergePACds: construct Matrix[PA, sample], [452, 5050]
# summary of PACds#1
summary(PACds)
#> PAC# 452
#> sample# 2538
#> summary of expression level of each PA
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 10.0 20.0 42.0 155.8 130.5 3548.0
#> summary of expressed sample# of each PA
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 10.0 19.0 40.0 114.0 120.2 1412.0
#> gene# 440
#> nPAC
#> 3UTR 452
# summary of the merged PACds
summary(pacds.merge)
#> PAC# 452
#> sample# 5050
#> summary of expression level of each PA
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 20.0 40.0 84.0 311.5 261.0 7096.0
#> summary of expressed sample# of each PA
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 20.0 38.0 80.0 227.9 240.5 2824.0
head(pacds.merge@counts[, 1:10])
#> 6 x 10 sparse Matrix of class "dgCMatrix"
#> [[ suppressing 10 column names 'AAACCCACAAATTGCC.1', 'AAACCCAGTCAATGGG.1', 'AAACCCATCTTAGCAG.1' ... ]]
#>
#> M1 . . . . . . 1 . . .
#> M2 . . . . . . . . . .
#> M3 . . . . . . . . 1 .
#> M4 . . . . . . . . . .
#> M5 . . . . . . . . . .
#> M6 2 . . . . . . . . .
head(pacds.merge@anno)
#> chr strand coord UPA_start UPA_end
#> M1 chr1 + 2308755 2308755 2308755
#> M2 chr1 + 9368021 9368021 9368021
#> M3 chr1 + 9614872 9614872 9614872
#> M4 chr1 + 20555007 20555007 20555007
#> M5 chr1 + 20618815 20618815 20618815
#> M6 chr1 + 46220370 46220370 46220370
In movAPA 0.2.0, a reference PACds can be used for merging PACdsList in a smarter way. Providing reference PACds for merging is useful when there are multiple large PAC lists to be merged, which can prevent generating PACs with a very wide range. If there is reference PACs from 3’seq, it is recommended to use it. Please see the help document of mergePACds for details.
Given a reference PACds, buildRefPACdsAnno
can be used
to combine multiple PACds to build a more complete reference. First we
loaded the known PAs.
annodb=read.table("./Human_hg38.PolyA_DB.bed", sep="\t", header = FALSE)
head(annodb)
#> V1 V2 V3 V4 V5 V6
#> 1 chr1 629219 629219 ENSG00000225972 Pseudogene +
#> 2 chr1 629249 629249 ENSG00000225972 Pseudogene +
#> 3 chr1 629284 629284 ENSG00000225972 Pseudogene +
#> 4 chr1 629328 629328 ENSG00000225972 Pseudogene +
#> 5 chr1 629572 629572 ENSG00000225972 Pseudogene +
#> 6 chr1 629626 629626 ENSG00000225972 Pseudogene +
annodb=annodb[,c(1,2,6)]
colnames(annodb)=c("chr","coord","strand")
annodb=readPACds(annodb, colDataFile = NULL)
#> 155808 PACs
## we can build a reference PACds with human known PA and the given PACds list
## or we can use annodb only
refPA=buildRefPACdsAnno(refPACds=annodb, PACdsList=PACdsList,
by='coord', d=24,
min.counts = 50, min.smps=10, max.width=500,
verbose=TRUE)
#> ds1: total=452 --> good=0 [>=min.counts(50) & >=min.smps(10)) & <=max.width(500)]
#> ds2: total=452 --> good=0 [>=min.counts(50) & >=min.smps(10)) & <=max.width(500)]
#> buildRefPACds: 0 highQuality PACs (filtered by min.counts=50, min.smps=10, max.width=500)
#> buildRefPACds: 155808 [155808 ref + 0 highQuality PACs] reduce to 149362 PACs (d=24nt)
#> buildRefPACds: 149362 reference PACs returned
Because the two PACds merged for this example are the same, the number of PACs remains unchanged after merging, but the sample size doubles.
pacds.merge=mergePACds(PACdsList, d=24, by='coord', refPACds=refPA)
#> mergePACds: there are 2538 duplicated sample names in the PACdsList, will add .N to sample names of each PACds
#> mergePACds: total 904 redundant PACs from 2 PACds to merge
#> mergePACds with refPACds: 149726 merged PACs (d=24nt)
#> mergePACds: melted all counts tables, total 103012 triplet rows
#> mergePACds: link 904 old PA IDs to 452 new PA IDs by merge
#> mergePACds: convert 103012 triplets to dgCMatrix
#> mergePACds: construct Matrix[PA, sample], [452, 5050]
#> Warning: in mergePACds, the number of rows (PAs) or rownames after merging is not the same between anno (149726) and counts (452)
#> This may be because refPACds is used. Will use common PAs between anno and counts (452 PAs)
# summary of PACds#1
summary(PACds)
#> PAC# 452
#> sample# 2538
#> summary of expression level of each PA
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 10.0 20.0 42.0 155.8 130.5 3548.0
#> summary of expressed sample# of each PA
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 10.0 19.0 40.0 114.0 120.2 1412.0
#> gene# 440
#> nPAC
#> 3UTR 452
# summary of the merged PACds
summary(pacds.merge)
#> PAC# 452
#> sample# 5050
#> summary of expression level of each PA
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 20.0 40.0 84.0 311.5 261.0 7096.0
#> summary of expressed sample# of each PA
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 20.0 38.0 80.0 227.9 240.5 2824.0
head(pacds.merge@counts[, 1:10])
#> 6 x 10 sparse Matrix of class "dgCMatrix"
#> [[ suppressing 10 column names 'AAACCCACAAATTGCC.1', 'AAACCCAGTCAATGGG.1', 'AAACCCATCTTAGCAG.1' ... ]]
#>
#> M100273 . . . . . . . . . .
#> M100417 1 . . . . . . . . .
#> M100519 . . . . . . . . . .
#> M10060 . . . . . . . . . .
#> M10092 . . . . . . . . . .
#> M101558 . . . . . . . . . .
head(pacds.merge@anno)
#> chr strand coord UPA_start UPA_end
#> M100273 chr3 - 114321139 114321139 114321139
#> M100417 chr3 - 121664101 121664101 121664101
#> M100519 chr3 - 124968500 124968500 124968500
#> M10060 chr1 - 108931856 108931856 108931856
#> M10092 chr1 - 109399042 109399042 109399042
#> M101558 chr3 - 165186766 165186766 165186766
The original annotation information of the merged data will be removed and needs to be re-annotated.
# annotate PAs
pacds.merge=annotatePAC(pacds.merge, txdb)
# after annotation, the gene and ftr information are present in PACds@anno.
summary(pacds.merge)
Get APA index using the smart RUD method (available in movAPA v0.2.0).
The smartRUD indicator is provided in movAPA v0.2.0, and it is recommended to use it. Pay attention to setting clearPAT=1 to remove cases of PAs with only 1 read count. At the same time, check the distance between the two PAs first to select a suitable dist for filtering the proximal and distal PAs of 3’UTR.
s=getNearbyPAdist(PACds)
#> Nearby pA distance summary:
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 281.0 867.8 1602.0 13879.8 30425.5 54144.0
You can see that the average distance is 10K nt, but the median distance is only 3000 nt, so setting 5000 nt is probably enough.
# get proximal and distal PA for each gene
pd=get3UTRAPApd(pacds=PACds, minDist=50, maxDist=5000, minRatio=0.05,
fixDistal=FALSE, addCols='pd')
#> get3UTRAPApd: filtering by minRatio: gene# before: 440 ; after: 11 ; remove: 429
#> get3UTRAPApd: filtering pd (dist between): gene# before: 11 ; after: 7 ; remove: 4
#> get3UTRAPApd: add four columns to pacds@anno: pdWhich, pdScore, pdRatio, pdDist
# get smartRUD index
rud=movAPAindex(pd, method="smartRUD", sRUD.oweight=TRUE, clearPAT=1)
#> clearPAT>0: set elements in @counts <= 1 to 0!
#> movAPAindex (smartRUD): sRUD.oweight=TRUE, output a list(rud, weight)
head(rud$rud[, 1:5])
#> 6 x 5 Matrix of class "dgeMatrix"
#> AAACCCACAAATTGCC AAACCCAGTCAATGGG AAACCCATCTTAGCAG AAACGAAAGATTGAGT
#> 205564 NaN NaN NaN NaN
#> 112399 NaN NaN NaN NaN
#> 25966 NaN NaN NaN NaN
#> 51710 NaN NaN NaN NaN
#> 6654 NaN NaN NaN NaN
#> 84896 NaN NaN NaN NaN
#> AAACGCTAGGAGTCTG
#> 205564 NaN
#> 112399 NaN
#> 25966 NaN
#> 51710 NaN
#> 6654 NaN
#> 84896 NaN
head(rud$weight[, 1:5])
#> 6 x 5 sparse Matrix of class "dgCMatrix"
#> AAACCCACAAATTGCC AAACCCAGTCAATGGG AAACCCATCTTAGCAG AAACGAAAGATTGAGT
#> 205564 . . . .
#> 112399 . . . .
#> 25966 . . . .
#> 51710 . . . .
#> 6654 . . . .
#> 84896 . . . .
#> AAACGCTAGGAGTCTG
#> 205564 .
#> 112399 .
#> 25966 .
#> 51710 .
#> 6654 .
#> 84896 .
## we can also calculate the RUD index
## rud1=movAPAindex(pd, method="RUD")
The vizAPA package is a comprehensive package for visualization of APA dynamics in single cells. vizAPA imports various types of APA data and genome annotation sources through unified data structures. vizAPA also enables identification of genes with differential APA usage (also called APA markers). Four unique modules are provided in vizAPA for visualizing APA dynamics across cell groups and at the single-cell level.
Here we use vizAPA to show the gene structure and PA location.
library(vizAPA, quietly = TRUE)
#>
#> Attaching package: 'vizAPA'
#> The following object is masked _by_ '.GlobalEnv':
#>
#> scPACds
#> The following object is masked from 'package:movAPA':
#>
#> scPACds
# to choose one gene, first we get the total counts of each PA
PACds@anno$TN=rowSums(PACds@counts)
pd@anno$TN=rowSums(pd@counts)
# construct the genome annotation object
annoSource=new("annoHub")
annoSource=addAnno(annoSource, txdb)
gene=pd@anno$gene[1]
vizTracks(gene=gene,
PACds.list=list(pA=PACds, pd=pd),
PA.show=c("pos","TN"),
annoSource=annoSource,
PA.columns="coord", PA.width=10,
space5=1000, space3=1000)
#> The region may be too large to plot [68858 nt], you can set genomicRegion to plot a smaller region
#> Plot tracks for region: chr3:+:196866856:196935714
#> Get gene model track from annoSource[ txdb ]...
#> Parsing transcripts...
#> Parsing exons...
#> Parsing cds...
#> Parsing utrs...
#> ------exons...
#> ------cdss...
#> ------introns...
#> ------utr...
#> aggregating...
#> Done
#> Constructing graphics...
#> Get PACds track...
#> chr3:+:196866856:196935714
#> Get PACds track...
#> chr3:+:196866856:196935714
#> Found more than one class "simpleUnit" in cache; using the first, from namespace 'ggbio'
#> Also defined by 'hexbin'
#> Found more than one class "simpleUnit" in cache; using the first, from namespace 'ggbio'
#> Also defined by 'hexbin'
#> Found more than one class "simpleUnit" in cache; using the first, from namespace 'ggbio'
#> Also defined by 'hexbin'
#> Found more than one class "simpleUnit" in cache; using the first, from namespace 'ggbio'
#> Also defined by 'hexbin'
#> Found more than one class "simpleUnit" in cache; using the first, from namespace 'ggbio'
#> Also defined by 'hexbin'
#> Found more than one class "simpleUnit" in cache; using the first, from namespace 'ggbio'
#> Also defined by 'hexbin'
#> Found more than one class "simpleUnit" in cache; using the first, from namespace 'ggbio'
#> Also defined by 'hexbin'
#> Found more than one class "simpleUnit" in cache; using the first, from namespace 'ggbio'
#> Also defined by 'hexbin'
#> Found more than one class "simpleUnit" in cache; using the first, from namespace 'ggbio'
#> Also defined by 'hexbin'
#> Found more than one class "simpleUnit" in cache; using the first, from namespace 'ggbio'
#> Also defined by 'hexbin'
#> Found more than one class "simpleUnit" in cache; using the first, from namespace 'ggbio'
#> Also defined by 'hexbin'
#> Found more than one class "simpleUnit" in cache; using the first, from namespace 'ggbio'
#> Also defined by 'hexbin'
#> Found more than one class "simpleUnit" in cache; using the first, from namespace 'ggbio'
#> Also defined by 'hexbin'
#> Found more than one class "simpleUnit" in cache; using the first, from namespace 'ggbio'
#> Also defined by 'hexbin'
#> Found more than one class "simpleUnit" in cache; using the first, from namespace 'ggbio'
#> Also defined by 'hexbin'
#> Found more than one class "simpleUnit" in cache; using the first, from namespace 'ggbio'
#> Also defined by 'hexbin'
#> Found more than one class "simpleUnit" in cache; using the first, from namespace 'ggbio'
#> Also defined by 'hexbin'
The gene IDs between different annotations may not be consistent, and we can also use vizAPA to map different IDs.
library(Homo.sapiens)
#> Loading required package: OrganismDbi
#> Loading required package: GO.db
#> Loading required package: org.Hs.eg.db
#>
#> Loading required package: TxDb.Hsapiens.UCSC.hg19.knownGene
orgdb=Homo.sapiens
genes=getAnnoGenes(orgdb)
head(genes)
#> chr strand start end gene_entrezid gene_ensembl gene_symbol
#> 1 chr19 - 58858172 58874214 1 ENSG00000121410 A1BG
#> 2 chr8 + 18248755 18258723 10 ENSG00000156006 NAT2
#> 3 chr20 - 43248163 43280376 100 ENSG00000196839 ADA
#> 4 chr18 - 25530930 25757445 1000 ENSG00000170558 CDH2
#> 5 chr1 - 243651535 244006886 10000 ENSG00000117020 AKT3
#> 6 chrX + 49217763 49233491 100008586 ENSG00000236362 GAGE12F
genes[genes$gene_entrezid==gene, ]
#> chr strand start end gene_entrezid gene_ensembl gene_symbol
#> 6278 chr3 + 196594727 196661584 205564 ENSG00000119231 SENP5
genes[genes$gene_entrezid==10003, ]
#> chr strand start end gene_entrezid gene_ensembl gene_symbol
#> 10 chr11 + 89867818 89925779 10003 ENSG00000077616 NAALAD2
The session information records the versions of all the packages used in the generation of the present document.
sessionInfo()
#> R version 4.2.2 (2022-10-31 ucrt)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 22621)
#>
#> Matrix products: default
#>
#> locale:
#> [1] LC_COLLATE=Chinese (Simplified)_China.utf8
#> [2] LC_CTYPE=Chinese (Simplified)_China.utf8
#> [3] LC_MONETARY=Chinese (Simplified)_China.utf8
#> [4] LC_NUMERIC=C
#> [5] LC_TIME=Chinese (Simplified)_China.utf8
#>
#> attached base packages:
#> [1] grid stats4 stats graphics grDevices utils datasets
#> [8] methods base
#>
#> other attached packages:
#> [1] Homo.sapiens_1.3.1
#> [2] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2
#> [3] org.Hs.eg.db_3.16.0
#> [4] GO.db_3.16.0
#> [5] OrganismDbi_1.40.0
#> [6] vizAPA_0.1.0
#> [7] TxDb.Hsapiens.UCSC.hg38.knownGene_3.16.0
#> [8] BSgenome.Hsapiens.UCSC.hg38_1.4.5
#> [9] DESeq2_1.38.1
#> [10] SummarizedExperiment_1.28.0
#> [11] MatrixGenerics_1.10.0
#> [12] matrixStats_0.63.0
#> [13] ComplexHeatmap_2.14.0
#> [14] dplyr_1.0.10
#> [15] magrittr_2.0.3
#> [16] BSgenome.Oryza.ENSEMBL.IRGSP1_1.4.2
#> [17] BSgenome_1.66.2
#> [18] rtracklayer_1.58.0
#> [19] Biostrings_2.66.0
#> [20] XVector_0.38.0
#> [21] TxDb.Mmusculus.UCSC.mm10.ensGene_3.4.0
#> [22] GenomicFeatures_1.50.2
#> [23] AnnotationDbi_1.60.0
#> [24] Biobase_2.58.0
#> [25] GenomicRanges_1.50.1
#> [26] GenomeInfoDb_1.34.9
#> [27] IRanges_2.32.0
#> [28] S4Vectors_0.36.0
#> [29] BiocGenerics_0.44.0
#> [30] ggplot2_3.4.0
#> [31] movAPA_0.2.0
#>
#> loaded via a namespace (and not attached):
#> [1] rappdirs_0.3.3 ggthemes_4.2.4
#> [3] GGally_2.1.2 R.methodsS3_1.8.2
#> [5] SeuratObject_4.1.3 tidyr_1.2.1
#> [7] bit64_4.0.5 knitr_1.41
#> [9] irlba_2.3.5.1 DelayedArray_0.24.0
#> [11] R.utils_2.12.2 hwriter_1.3.2.1
#> [13] data.table_1.14.6 rpart_4.1.19
#> [15] KEGGREST_1.38.0 TFBSTools_1.36.0
#> [17] RCurl_1.98-1.9 AnnotationFilter_1.22.0
#> [19] doParallel_1.0.17 generics_0.1.3
#> [21] RSQLite_2.2.18 future_1.30.0
#> [23] proxy_0.4-27 bit_4.0.5
#> [25] tzdb_0.3.0 xml2_1.3.3
#> [27] assertthat_0.2.1 DirichletMultinomial_1.40.0
#> [29] xfun_0.35 jquerylib_0.1.4
#> [31] hms_1.1.2 evaluate_0.18
#> [33] DEoptimR_1.0-11 fansi_1.0.3
#> [35] restfulr_0.0.15 progress_1.2.2
#> [37] caTools_1.18.2 dbplyr_2.2.1
#> [39] DBI_1.1.3 geneplotter_1.76.0
#> [41] htmlwidgets_1.5.4 reshape_0.8.9
#> [43] purrr_0.3.5 ellipsis_0.3.2
#> [45] RSpectra_0.16-1 backports_1.4.1
#> [47] grImport2_0.2-0 annotate_1.76.0
#> [49] biomaRt_2.54.0 deldir_1.0-6
#> [51] vctrs_0.5.1 SingleCellExperiment_1.20.0
#> [53] ensembldb_2.22.0 Cairo_1.6-0
#> [55] TTR_0.24.3 abind_1.4-5
#> [57] cachem_1.0.6 RcppEigen_0.3.3.9.3
#> [59] withr_2.5.0 progressr_0.12.0
#> [61] robustbase_0.95-0 checkmate_2.1.0
#> [63] vcd_1.4-11 GenomicAlignments_1.34.0
#> [65] xts_0.13.0 prettyunits_1.1.1
#> [67] cluster_2.1.4 lazyeval_0.2.2
#> [69] seqLogo_1.64.0 laeken_0.5.2
#> [71] crayon_1.5.2 genefilter_1.80.0
#> [73] edgeR_3.40.0 pkgconfig_2.0.3
#> [75] labeling_0.4.2 ProtGenerics_1.30.0
#> [77] nnet_7.3-18 globals_0.16.2
#> [79] rlang_1.0.6 lifecycle_1.0.3
#> [81] filelock_1.0.2 BiocFileCache_2.6.0
#> [83] dichromat_2.0-0.1 RcppHNSW_0.4.1
#> [85] lmtest_0.9-40 graph_1.76.0
#> [87] Matrix_1.5-3 carData_3.0-5
#> [89] boot_1.3-28 zoo_1.8-11
#> [91] base64enc_0.1-3 GlobalOptions_0.1.2
#> [93] png_0.1-7 rjson_0.2.21
#> [95] bitops_1.0-7 R.oo_1.25.0
#> [97] blob_1.2.3 shape_1.4.6
#> [99] stringr_1.4.1 parallelly_1.33.0
#> [101] readr_2.1.3 jpeg_0.1-10
#> [103] CNEr_1.34.0 scales_1.2.1
#> [105] memoise_2.0.1 plyr_1.8.8
#> [107] hexbin_1.28.3 zlibbioc_1.44.0
#> [109] compiler_4.2.2 tinytex_0.43
#> [111] BiocIO_1.8.0 RColorBrewer_1.1-3
#> [113] pcaMethods_1.90.0 clue_0.3-63
#> [115] Rsamtools_2.14.0 cli_3.4.1
#> [117] ade4_1.7-22 listenv_0.9.0
#> [119] htmlTable_2.4.1 Formula_1.2-4
#> [121] ggplot.multistats_1.0.0 MASS_7.3-58.1
#> [123] tidyselect_1.2.0 stringi_1.7.8
#> [125] highr_0.9 yaml_2.3.6
#> [127] locfit_1.5-9.6 latticeExtra_0.6-30
#> [129] sass_0.4.4 VariantAnnotation_1.44.0
#> [131] tools_4.2.2 future.apply_1.10.0
#> [133] parallel_4.2.2 circlize_0.4.15
#> [135] rstudioapi_0.14 TFMPvalue_0.0.9
#> [137] foreach_1.5.2 foreign_0.8-83
#> [139] DEXSeq_1.44.0 gridExtra_2.3
#> [141] smoother_1.1 scatterplot3d_0.3-43
#> [143] farver_2.1.1 digest_0.6.30
#> [145] BiocManager_1.30.19 pracma_2.4.2
#> [147] Rcpp_1.0.9 car_3.1-1
#> [149] httr_1.4.4 motifStack_1.42.0
#> [151] ggbio_1.46.0 biovizBase_1.46.0
#> [153] colorspace_2.0-3 XML_3.99-0.12
#> [155] ranger_0.14.1 splines_4.2.2
#> [157] statmod_1.4.37 RBGL_1.74.0
#> [159] sp_1.5-1 xtable_1.8-4
#> [161] jsonlite_1.8.3 poweRlaw_0.70.6
#> [163] destiny_3.12.0 R6_2.5.1
#> [165] Hmisc_5.0-0 pillar_1.8.1
#> [167] htmltools_0.5.3 glue_1.6.2
#> [169] fastmap_1.1.0 VIM_6.2.2
#> [171] BiocParallel_1.32.1 class_7.3-20
#> [173] codetools_0.2-18 utf8_1.2.2
#> [175] bslib_0.4.1 lattice_0.20-45
#> [177] tibble_3.1.8 curl_4.3.3
#> [179] gtools_3.9.4 magick_2.7.3
#> [181] interp_1.1-3 survival_3.4-0
#> [183] limma_3.54.0 rmarkdown_2.18
#> [185] munsell_0.5.0 e1071_1.7-13
#> [187] GetoptLong_1.0.5 GenomeInfoDbData_1.2.9
#> [189] iterators_1.0.14 reshape2_1.4.4
#> [191] gtable_0.3.1