Description

SeqStat is a package that contains tools to generate stats from a FastQ file, merge those stats for multiple samples, and validate the generated stats files.

Mode - Generate

Generate outputs several stats on a FASTQ file.

Outputted stats:

Bases

Total number

Base qualities, with the number of bases having that quality

Number of each nucleotide
Reads

Total number

minimum length

maximum length

A histogram of the average base qualities

The quality encoding (Sanger, solexa etc.)

A histogram of the read lengths.

Mode - Merge

This module will merge seqstat files together and keep the sample/library/readgroup structure. If required it's also possible to collapse this, the output file then des not have any sample/library/readgroup structure.

Mode - Validate

A file from SeqStat will validate the input files. If aggregation values can not be regenerated the file is considered corrupt. This should only happen when the user will edit the seqstat file manually.

Installation

SeqStat requires Java 8 to be installed on your device. Download Java 8 here or install via your distribution's package manager.

Download the latest version of SeqStat here. To generate the usage run:

java -jar <SeqStat_jar> --help

Manual

Mode - Generate

By default stats are outputted to stdout in json format. If an output file is specified it writes to the file in json format.

Mode - Merge

When merging the files SeqStat will validate the input files and the output files. If aggregation values can not be regenerated the file is considered corrupt.

Mode - Validate

See example.

Example

Mode - Generate

To run SeqStat and save the output in a JSON file:

java -jar <SeqStat_jar> generate \
-i input.fastq \
-o output.json \
--sample <sample_name> \
--library <library name> \
--readgroup <readgroup name>

Mode - Merge

Merging multiple files:

java -jar <SeqStat_jar> merge \
-i <seqstat file> \
-i <seqstat file> \
-o <output file>

Merging multiple files as collapsed format:

java -jar <SeqStat_jar> merge \
-i <seqstat file> \
-i <seqstat file> \
--combinedOutputFile <output file>

Both output formats at the same time:

java -jar <SeqStat_jar> merge \
-i <seqstat file> \
-i <seqstat file> \
-o <output file> \
--combinedOutputFile <output file>

Mode - Validate

Default:

java -jar <SeqStat_jar> validate \
-i <input file>

Usage

Usage for SeqStat:

Option	Required	Can occur multiple times	Description
--log_level, -l	no	no	Level of log information printed. Possible levels: 'debug', 'info', 'warn', 'error'
--help, -h	no	no	Print usage
--version, -v	no	no	Print version
toolName	no	no	Name of the tool to execute
tool args	no	yes (unlimited)	Arguments for the tool

Usage for Mode - Generate:

Option	Required	Can occur multiple times	Description
--log_level, -l	no	no	Level of log information printed. Possible levels: 'debug', 'info', 'warn', 'error'
--help, -h	no	no	Print usage
--version, -v	no	no	Print version
--fastqR1, -i	yes	no	FastQ file to generate stats from
--fastqR2, -j	no	no	FastQ file to generate stats from
--output, -o	yes	no	File to write output to, if not supplied output go to stdout
--sample	yes	no	Sample name
--library	yes	no	Library name
--readgroup	yes	no	Readgroup name

Usage for Mode - Merge:

Option	Required	Can occur multiple times	Description
--log_level, -l	no	no	Level of log information printed. Possible levels: 'debug', 'info', 'warn', 'error'
--help, -h	no	no	Print usage
--version, -v	no	no	Print version
--inputFile, -i	yes	yes (unlimited)	Files to merge into a single file
--outputFile, -o	no	no	Output file
--combinedOutputFile	no	no	Combined output file

Usage for Mode - Validate:

Option	Required	Can occur multiple times	Description
--log_level, -l	no	no	Level of log information printed. Possible levels: 'debug', 'info', 'warn', 'error'
--help, -h	no	no	Print usage
--version, -v	no	no	Print version
--inputFile, -i	yes	no	File to validate schema

About

SeqStat is part of BIOPET tool suite that is developed at LUMC by the SASC team. Each tool in the BIOPET tool suite is meant to offer a standalone function that can be used to perform a dedicate data analysis task or added as part of a pipeline, for example the SASC team's biowdl pipelines.

All tools in the BIOPET tool suite are Free/Libre and Open Source Software.

Contributing

The source code of SeqStat can be found here. We welcome any contributions. Bug reports, feature requests and feedback can be submitted at our issue tracker.

SeqStat is build using sbt. Before submitting a pull request, make sure all tests can be passed by running sbt test from the project's root. We recommend using an IDE to work on SeqStat. We have had good results with this IDE.

Contact

For any question related to SeqStat, please use the github issue tracker or contact the SASC team directly at: sasc@lumc.nl.