long read isoform reconstruction and quantification

This function takes bam file of genomic alignments and performs isoform recontruction and gene and transcript expression quantification. It also allows saving of read class files of alignments, extending provided annotations, and quantification based on extended annotations. When multiple samples are provided, extended annotations will be combined across samples to allow comparison.

bambu(
    reads = NULL,
    rcFile = NULL,
    rcOutDir = NULL,
    annotations = NULL,
    genome = NULL,
    stranded = FALSE,
    ncore = 1,
    NDR = 0.1,
    yieldSize = NULL,
    opt.discovery = NULL,
    opt.em = NULL,
    trackReads = FALSE,
    returnDistTable = FALSE,
    discovery = TRUE,
    quant = TRUE,
    fusionMode = TRUE,
    verbose = FALSE,
    lowMemory = FALSE
)

Arguments

reads	A string or a vector of strings specifying the paths of bam files for genomic alignments, or a `BamFile` object or a `BamFileList` object (see `Rsamtools`).
rcFile	A string or a vector of strings specifying the read class files that are saved during previous run of `bambu`.
rcOutDir	A string variable specifying the path to where read class files will be saved.
annotations	A `TxDb` object or A GRangesList object obtained by `prepareAnnotations`.
genome	A fasta file or a BSGenome object.
stranded	A boolean for strandedness, defaults to FALSE.
ncore	specifying number of cores used when parallel processing is used, defaults to 1.
NDR	specifying the maximum NDR rate to novel transcript output from detected read classes, defaults to 0.1
yieldSize	see `Rsamtools`.
opt.discovery	A list of controlling parameters for isoform reconstruction process: prefix specifying prefix for new gene Ids (genePrefix.number), defaults to empty remove.subsetTx indicating whether filter to remove read classes which are a subset of known transcripts(), defaults to TRUE min.readCount specifying minimun read count to consider a read class valid in a sample, defaults to 2 min.readFractionByGene specifying minimum relative read count per gene, highly expressed genes will have many high read count low relative abundance transcripts that can be filtered, defaults to 0.05 min.sampleNumber specifying minimum sample number with minimum read count, defaults to 1 min.exonDistance specifying minum distance to known transcript to be considered valid as new, defaults to 35bp min.exonOverlap specifying minimum number of bases shared with annotation to be assigned to the same gene id, defaults to 10bp min.primarySecondaryDist specifying the minimum number of distance threshold, defaults to 5bp min.primarySecondaryDistStartEnd1 specifying the minimum number of distance threshold, used for extending annotation, defaults to 5bp min.primarySecondaryDistStartEnd2 specifying the minimum number of distance threshold, used for estimating distance to annotation, defaults to 5bp min.txScore.multiExon specifying the minimum transcript level threshold for multi-exon transcripts during sample combining, defaults to 0 min.txScore.singleExon specifying the minimum transcript level threshold for single-exon transcripts during sample combining, defaults to 1
opt.em	A list of controlling parameters for quantification algorithm estimation process: maxiter specifying maximum number of run iterations, defaults to 10000 degradationBias correcting for degradation bias, defaults to TRUE conv specifying the covergence threshold control, defaults to 0.0001 minvalue specifying the minvalue for convergence consideration, defaults to 0.00000001
trackReads	When TRUE read names will be tracked and output as metadata in the final output as readToTranscriptMaps detailing. the assignment of reads to transcripts. The output is a list with an entry for each sample.
returnDistTable	When TRUE the calculated distance table between read classes and annotations will be output as metadata as distTables. The output is a list with an entry for each sample.
discovery	A logical variable indicating whether annotations are to be extended for quantification.
quant	A logical variable indicating whether quantification will be performed
fusionMode	A logical variable indicating whether run in fusion mode
verbose	A logical variable indicating whether processing messages will be printed.
lowMemory	Read classes will be processed by chromosomes when lowMemory is specified. This option provides an efficient way to process big samples.

Value

bambu will output different results depending on whether quant mode is on. By default, quant is set to TRUE, so bambu will generate a SummarizedExperiment object that contains the transcript expression estimates. Transcript expression estimates can be accessed by counts(), including the following variables

counts: expression estimates
CPM: sequencing depth normalized estimates
fullLengthCounts: estimates of read counts mapped as full length reads for each transcript
partialLengthCounts: estimates of read counts mapped as partial length reads for each transcript
uniqueCounts: counts of reads that are uniquely mapped to each transcript
theta: raw estimates

Output annotations that are usually the annotations with/without novel transcripts/genes added, depending on whether discovery mode is on can be accessed by rowRanges() Transcript to gene map can be accessed by rowData(), with eqClass that defining equivalent class for each transcript

In the case when quant is set to FALSE, i.e., only transcript discovery is performed, bambu will report the grangeslist of the extended annotations.

Details

Main function

Examples

## =====================
test.bam <- system.file("extdata",
    "SGNex_A549_directRNA_replicate5_run1_chr9_1_1000000.bam",
    package = "bambu")
fa.file <- system.file("extdata", 
    "Homo_sapiens.GRCh38.dna_sm.primary_assembly_chr9_1_1000000.fa", 
    package = "bambu")
gr <- readRDS(system.file("extdata", 
    "annotationGranges_txdbGrch38_91_chr9_1_1000000.rds",
    package = "bambu"))
se <- bambu(reads = test.bam, annotations = gr, 
    genome = fa.file, discovery = TRUE, quant = TRUE)
#> 'getOption("repos")' replaces Bioconductor standard repositories, see
#> '?repositories' for details
#> 
#> replacement repositories:
#>     CRAN: https://cloud.r-project.org
#> Start generating read class files
#> iteration: 
#> not all chromosomes present in reference annotations,
#>             annotations might be incomplete. Please compare objects
#>             on the same reference
#> not all chromosomes from reads present in reference genome 
#>             sequence, reads without reference chromosome sequence are dropped
#> [16:20:41] WARNING: amalgamation/../src/learner.cc:1040: 
#>   If you are loading a serialized model (like pickle in Python, RDS in R) generated by
#>   older XGBoost, please export the model by calling `Booster.save_model` from that version
#>   first, then load it back in current version. See:
#> 
#>     https://xgboost.readthedocs.io/en/latest/tutorials/saving_model.html
#> 
#>   for more details about differences between saving model and serializing.
#> 
#> [16:20:41] WARNING: amalgamation/../src/learner.cc:749: Found JSON model saved before XGBoost 1.6, please save the model using current version again. The support for old JSON model will be discontinued in XGBoost 2.3.
#> [16:20:41] WARNING: amalgamation/../src/learner.cc:438: 
#>   If you are loading a serialized model (like pickle in Python, RDS in R) generated by
#>   older XGBoost, please export the model by calling `Booster.save_model` from that version
#>   first, then load it back in current version. See:
#> 
#>     https://xgboost.readthedocs.io/en/latest/tutorials/saving_model.html
#> 
#>   for more details about differences between saving model and serializing.
#> 
#> [16:20:41] WARNING: amalgamation/../src/learner.cc:1040: 
#>   If you are loading a serialized model (like pickle in Python, RDS in R) generated by
#>   older XGBoost, please export the model by calling `Booster.save_model` from that version
#>   first, then load it back in current version. See:
#> 
#>     https://xgboost.readthedocs.io/en/latest/tutorials/saving_model.html
#> 
#>   for more details about differences between saving model and serializing.
#> 
#> [16:20:41] WARNING: amalgamation/../src/learner.cc:749: Found JSON model saved before XGBoost 1.6, please save the model using current version again. The support for old JSON model will be discontinued in XGBoost 2.3.
#> [16:20:41] WARNING: amalgamation/../src/learner.cc:438: 
#>   If you are loading a serialized model (like pickle in Python, RDS in R) generated by
#>   older XGBoost, please export the model by calling `Booster.save_model` from that version
#>   first, then load it back in current version. See:
#> 
#>     https://xgboost.readthedocs.io/en/latest/tutorials/saving_model.html
#> 
#>   for more details about differences between saving model and serializing.
#> 
#> [16:20:41] WARNING: amalgamation/../src/learner.cc:1040: 
#>   If you are loading a serialized model (like pickle in Python, RDS in R) generated by
#>   older XGBoost, please export the model by calling `Booster.save_model` from that version
#>   first, then load it back in current version. See:
#> 
#>     https://xgboost.readthedocs.io/en/latest/tutorials/saving_model.html
#> 
#>   for more details about differences between saving model and serializing.
#> 
#> [16:20:41] WARNING: amalgamation/../src/learner.cc:749: Found JSON model saved before XGBoost 1.6, please save the model using current version again. The support for old JSON model will be discontinued in XGBoost 2.3.
#> [16:20:41] WARNING: amalgamation/../src/learner.cc:438: 
#>   If you are loading a serialized model (like pickle in Python, RDS in R) generated by
#>   older XGBoost, please export the model by calling `Booster.save_model` from that version
#>   first, then load it back in current version. See:
#> 
#>     https://xgboost.readthedocs.io/en/latest/tutorials/saving_model.html
#> 
#>   for more details about differences between saving model and serializing.
#> 
#> [16:20:41] WARNING: amalgamation/../src/learner.cc:1040: 
#>   If you are loading a serialized model (like pickle in Python, RDS in R) generated by
#>   older XGBoost, please export the model by calling `Booster.save_model` from that version
#>   first, then load it back in current version. See:
#> 
#>     https://xgboost.readthedocs.io/en/latest/tutorials/saving_model.html
#> 
#>   for more details about differences between saving model and serializing.
#> 
#> [16:20:41] WARNING: amalgamation/../src/learner.cc:749: Found JSON model saved before XGBoost 1.6, please save the model using current version again. The support for old JSON model will be discontinued in XGBoost 2.3.
#> [16:20:41] WARNING: amalgamation/../src/learner.cc:438: 
#>   If you are loading a serialized model (like pickle in Python, RDS in R) generated by
#>   older XGBoost, please export the model by calling `Booster.save_model` from that version
#>   first, then load it back in current version. See:
#> 
#>     https://xgboost.readthedocs.io/en/latest/tutorials/saving_model.html
#> 
#>   for more details about differences between saving model and serializing.
#> 
#> Junction correction with not enough data,
#>             precalculated model is used
#> Not enough data points
#> Not enough TRUE/FALSE labels
#> Transcript model not trained. Using pre-trained models
#> [16:20:44] WARNING: amalgamation/../src/learner.cc:1040: 
#>   If you are loading a serialized model (like pickle in Python, RDS in R) generated by
#>   older XGBoost, please export the model by calling `Booster.save_model` from that version
#>   first, then load it back in current version. See:
#> 
#>     https://xgboost.readthedocs.io/en/latest/tutorials/saving_model.html
#> 
#>   for more details about differences between saving model and serializing.
#> 
#> [16:20:44] WARNING: amalgamation/../src/learner.cc:749: Found JSON model saved before XGBoost 1.6, please save the model using current version again. The support for old JSON model will be discontinued in XGBoost 2.3.
#> [16:20:44] WARNING: amalgamation/../src/learner.cc:438: 
#>   If you are loading a serialized model (like pickle in Python, RDS in R) generated by
#>   older XGBoost, please export the model by calling `Booster.save_model` from that version
#>   first, then load it back in current version. See:
#> 
#>     https://xgboost.readthedocs.io/en/latest/tutorials/saving_model.html
#> 
#>   for more details about differences between saving model and serializing.
#> 
#> [16:20:44] WARNING: amalgamation/../src/learner.cc:1040: 
#>   If you are loading a serialized model (like pickle in Python, RDS in R) generated by
#>   older XGBoost, please export the model by calling `Booster.save_model` from that version
#>   first, then load it back in current version. See:
#> 
#>     https://xgboost.readthedocs.io/en/latest/tutorials/saving_model.html
#> 
#>   for more details about differences between saving model and serializing.
#> 
#> [16:20:44] WARNING: amalgamation/../src/learner.cc:749: Found JSON model saved before XGBoost 1.6, please save the model using current version again. The support for old JSON model will be discontinued in XGBoost 2.3.
#> [16:20:44] WARNING: amalgamation/../src/learner.cc:438: 
#>   If you are loading a serialized model (like pickle in Python, RDS in R) generated by
#>   older XGBoost, please export the model by calling `Booster.save_model` from that version
#>   first, then load it back in current version. See:
#> 
#>     https://xgboost.readthedocs.io/en/latest/tutorials/saving_model.html
#> 
#>   for more details about differences between saving model and serializing.
#> 
#> 1
#> 
#> Finished generating read classes from genomic alignments.
#> iteration: 
#> 1
#> 
#> iteration: 
#> 1
#> 
#> iteration: 
#> 1
#> 
#> iteration: 
#> 1
#> 
#> Less than 50 TRUE or FALSE read classes for precision stabilization. 
#>           Filtering by prediction score instead
#> -- Predicting annotation completeness to determine NDR threshold --
#> Calculated NDR: 0.005
#> all detect novel transcripts are already present in the annotations, try a higher NDR
#> GRangesList object of length 105:
#> $ENST00000190165
#> GRanges object with 2 ranges and 2 metadata columns:
#>       seqnames        ranges strand | exon_rank exon_endRank
#>          <Rle>     <IRanges>  <Rle> | <integer>    <integer>
#>   [1]        9 976964-977455      + |         1            2
#>   [2]        9 990041-991731      + |         2            1
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths
#> 
#> $ENST00000305248
#> GRanges object with 2 ranges and 2 metadata columns:
#>       seqnames      ranges strand | exon_rank exon_endRank
#>          <Rle>   <IRanges>  <Rle> | <integer>    <integer>
#>   [1]        9 34965-35264      - |         2            1
#>   [2]        9 35504-35871      - |         1            2
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths
#> 
#> $ENST00000314367
#> GRanges object with 16 ranges and 2 metadata columns:
#>        seqnames        ranges strand | exon_rank exon_endRank
#>           <Rle>     <IRanges>  <Rle> | <integer>    <integer>
#>    [1]        9 121060-121573      - |        16            1
#>    [2]        9 121961-122090      - |        15            2
#>    [3]        9 123217-123282      - |        14            3
#>    [4]        9 123386-123454      - |        13            4
#>    [5]        9 134979-135030      - |        12            5
#>    ...      ...           ...    ... .       ...          ...
#>   [12]        9 172081-172172      - |         5           12
#>   [13]        9 173270-173366      - |         4           13
#>   [14]        9 175698-175784      - |         3           14
#>   [15]        9 177723-177820      - |         2           15
#>   [16]        9 178816-179047      - |         1           16
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths
#> 
#> ...
#> <102 more elements>
#> DataFrame with 105 rows and 10 columns
#>                          TXNAME          GENEID      txid
#>                     <character>     <character> <logical>
#> ENST00000190165 ENST00000190165 ENSG00000064218        NA
#> ENST00000305248 ENST00000305248 ENSG00000218839        NA
#> ENST00000314367 ENST00000314367 ENSG00000172785        NA
#> ENST00000354485 ENST00000354485 ENSG00000107104        NA
#> ENST00000356521 ENST00000356521 ENSG00000172785        NA
#> ...                         ...             ...       ...
#> ENST00000620292 ENST00000620292 ENSG00000172785        NA
#> ENST00000620326 ENST00000620326 ENSG00000277631        NA
#> ENST00000621255 ENST00000621255 ENSG00000277631        NA
#> ENST00000622412 ENST00000622412 ENSG00000277631        NA
#> ENST00000628764 ENST00000628764 ENSG00000227155        NA
#>                                eqClass eqClassById readCount  newTxClass
#>                            <character>   <logical> <integer> <character>
#> ENST00000190165        ENST00000190165          NA        NA  annotation
#> ENST00000305248 ENST00000305248.ENST..          NA        NA  annotation
#> ENST00000314367        ENST00000314367          NA        NA  annotation
#> ENST00000354485 ENST00000354485.ENST..          NA        NA  annotation
#> ENST00000356521 ENST00000356521.ENST..          NA        22  annotation
#> ...                                ...         ...       ...         ...
#> ENST00000620292 ENST00000498044.ENST..          NA        NA  annotation
#> ENST00000620326        ENST00000620326          NA        NA  annotation
#> ENST00000621255        ENST00000621255          NA        NA  annotation
#> ENST00000622412        ENST00000622412          NA        NA  annotation
#> ENST00000628764        ENST00000628764          NA        NA  annotation
#>                     txNDR relReadCount relSubsetCount
#>                 <numeric>    <numeric>      <numeric>
#> ENST00000190165        NA           NA             NA
#> ENST00000305248        NA           NA             NA
#> ENST00000314367        NA           NA             NA
#> ENST00000354485        NA           NA             NA
#> ENST00000356521 0.0654418     0.468085              1
#> ...                   ...          ...            ...
#> ENST00000620292        NA           NA             NA
#> ENST00000620326        NA           NA             NA
#> ENST00000621255        NA           NA             NA
#> ENST00000622412        NA           NA             NA
#> ENST00000628764        NA           NA             NA
#> Finished extending annotations.
#> Start isoform quantification
#> iteration: 
#> 1
#> 
#> Finished isoform quantification.