Description

SeqStat is a package that contains tools to generate stats from a FastQ file, merge those stats for multiple samples, and validate the generated stats files.

Mode - Generate

Generate outputs several stats on a FASTQ file.

Outputted stats:

  • Bases
  • Total number
  • Base qualities, with the number of bases having that quality
  • Number of each nucleotide
  • Reads
  • Total number
  • minimum length
  • maximum length
  • A histogram of the average base qualities
  • The quality encoding (Sanger, solexa etc.)
  • A histogram of the read lengths.

Mode - Merge

This module will merge seqstat files together and keep the sample/library/readgroup structure. If required it's also possible to collapse this, the output file then des not have any sample/library/readgroup structure.

Mode - Validate

A file from SeqStat will validate the input files. If aggregation values can not be regenerated the file is considered corrupt. This should only happen when the user will edit the seqstat file manually.

Installation

SeqStat requires Java 8 to be installed on your device. Download Java 8 here or install via your distribution's package manager.

Download the latest version of SeqStat here. To generate the usage run:

java -jar <SeqStat_jar> --help

Manual

Mode - Generate

By default stats are outputted to stdout in json format. If an output file is specified it writes to the file in json format.

Mode - Merge

When merging the files SeqStat will validate the input files and the output files. If aggregation values can not be regenerated the file is considered corrupt.

Mode - Validate

See example.

Example

Mode - Generate

To run SeqStat and save the output in a JSON file:

java -jar <SeqStat_jar> generate \
-i input.fastq \
-o output.json \
--sample <sample_name> \
--library <library name> \
--readgroup <readgroup name>

Mode - Merge

Merging multiple files:

java -jar <SeqStat_jar> merge \
-i <seqstat file> \
-i <seqstat file> \
-o <output file>

Merging multiple files as collapsed format:

java -jar <SeqStat_jar> merge \
-i <seqstat file> \
-i <seqstat file> \
--combinedOutputFile <output file>

Both output formats at the same time:

java -jar <SeqStat_jar> merge \
-i <seqstat file> \
-i <seqstat file> \
-o <output file> \
--combinedOutputFile <output file>

Mode - Validate

Default:

java -jar <SeqStat_jar> validate \
-i <input file>

Usage

Usage for SeqStat:

Option Required Can occur multiple times Description
--log_level, -l no no Level of log information printed. Possible levels: 'debug', 'info', 'warn', 'error'
--help, -h no no Print usage
--version, -v no no Print version
toolName no no Name of the tool to execute
tool args no yes (unlimited) Arguments for the tool

Usage for Mode - Generate:

Option Required Can occur multiple times Description
--log_level, -l no no Level of log information printed. Possible levels: 'debug', 'info', 'warn', 'error'
--help, -h no no Print usage
--version, -v no no Print version
--fastqR1, -i yes no FastQ file to generate stats from
--fastqR2, -j no no FastQ file to generate stats from
--output, -o yes no File to write output to, if not supplied output go to stdout
--sample yes no Sample name
--library yes no Library name
--readgroup yes no Readgroup name

Usage for Mode - Merge:

Option Required Can occur multiple times Description
--log_level, -l no no Level of log information printed. Possible levels: 'debug', 'info', 'warn', 'error'
--help, -h no no Print usage
--version, -v no no Print version
--inputFile, -i yes yes (unlimited) Files to merge into a single file
--outputFile, -o no no Output file
--combinedOutputFile no no Combined output file

Usage for Mode - Validate:

Option Required Can occur multiple times Description
--log_level, -l no no Level of log information printed. Possible levels: 'debug', 'info', 'warn', 'error'
--help, -h no no Print usage
--version, -v no no Print version
--inputFile, -i yes no File to validate schema

About

SeqStat is part of BIOPET tool suite that is developed at LUMC by the SASC team. Each tool in the BIOPET tool suite is meant to offer a standalone function that can be used to perform a dedicate data analysis task or added as part of a pipeline, for example the SASC team's biowdl pipelines.

All tools in the BIOPET tool suite are Free/Libre and Open Source Software.

Contributing

The source code of SeqStat can be found here. We welcome any contributions. Bug reports, feature requests and feedback can be submitted at our issue tracker.

SeqStat is build using sbt. Before submitting a pull request, make sure all tests can be passed by running sbt test from the project's root. We recommend using an IDE to work on SeqStat. We have had good results with this IDE.

Contact

For any question related to SeqStat, please use the github issue tracker or contact the SASC team directly at: sasc@lumc.nl.