Description

This tool extracts reads from a BAM file based on alignment intervals. E.g if one is interested in a specific location this tool extracts the full reads from the location. The tool is also very useful to create test data sets.

Installation

ExtractAlignedFastq requires Java 8 to be installed on your device. Download Java 8 here or install via your distribution's package manager.

Download the latest version of ExtractAlignedFastq here. To generate the usage run:

java -jar <ExtractAlignedFastq_jar> --help

Manual

This tool creates FASTQ file(s) containing reads mapped to the given alignment intervals. A set of FASTQ files that was used in creating the BAM file is also required since this is used for retrieving full sequences of FASTQ records which map to the given region. This is useful since some of the records may have undergone modifications such as quality trimming before alignment. In this case, retrieving the aligned SAM records will only give the modified sequence.

  • Note that this tool works for single end and paired end data. The below example can be easily extended for paired end data. The only thing one should add is: --in2 myFastqR2.fastq --out2 myOutFastqR2.fastq
  • The interval is just a genomic position or multiple genomic positions wherefrom one wants to extract the reads.

Example

To extract reads from myBam.bam that originate from myFastq_r1.fasta and align to chr5 at positions 100-200:

java -jar <ExtractAlignedFastq_jar> \
--input_file myBam.bam \
--in1 myFastq_R1.fastq \
--interval chr5:100-200 \
--out1 output.fastq

Usage

Usage for ExtractAlignedFastq:

Option Required Can occur multiple times Description
--log_level, -l no no Level of log information printed. Possible levels: 'debug', 'info', 'warn', 'error'
--help, -h no no Print usage
--version, -v no no Print version
--input_file, -I yes no Input BAM file
--interval, -r yes yes (unlimited) Interval strings (e.g. chr1:1-100)
--in1, -i yes no Input FASTQ file 1
--in2, -j no no Input FASTQ file 2 (default: none)
--out1, -o yes no Output FASTQ file 1
--out2, -p no no Output FASTQ file 2 (default: none)
--min_mapq, -Q no no Minimum MAPQ of reads in target region to remove (default: 0)
--read_suffix_length, -s no no Length of suffix mark from each read pair (default: 0). This is used for distinguishing read pairs with different suffices. For example, if your FASTQ records end with `/1` for the first pair and `/2` for the second pair, the value of `read_suffix_length` should be 2."
no no This tool creates FASTQ file(s) containing reads mapped to the given alignment intervals.

About

ExtractAlignedFastq is part of BIOPET tool suite that is developed at LUMC by the SASC team. Each tool in the BIOPET tool suite is meant to offer a standalone function that can be used to perform a dedicate data analysis task or added as part of a pipeline, for example the SASC team's biowdl pipelines.

All tools in the BIOPET tool suite are Free/Libre and Open Source Software.

Contributing

The source code of ExtractAlignedFastq can be found here. We welcome any contributions. Bug reports, feature requests and feedback can be submitted at our issue tracker.

ExtractAlignedFastq is build using sbt. Before submitting a pull request, make sure all tests can be passed by running sbt test from the project's root. We recommend using an IDE to work on ExtractAlignedFastq. We have had good results with this IDE.

Contact

For any question related to ExtractAlignedFastq, please use the github issue tracker or contact the SASC team directly at: sasc@lumc.nl.