Description

Vcfstats is a tool that can generate metrics from a vcf file.

  • General stats (default, can be disabled)
  • Genotype stats (default, can be disabled)
  • Sample compare (default, can be disabled)
  • Sample distributions (default, can be disabled)
  • Field histograms

This tool can run locally single threaded but also on a Apache Spark cluster.

Installation

VcfStats requires Java 8 to be installed on your device. Download Java 8 here or install via your distribution's package manager.

Download the latest version of VcfStats here. To generate the usage run:

java -jar <VcfStats_jar> --help

Manual

Almost all stats are default but can be disabled if required. When having a lot of samples the sample compare step if very intensive.

The reference fasta should have at least a dict file next to it.

For the field histograms there are some methods to choose from. This method will be used to choose what value to report in the histogram if there multiple values. Some method require the value to be a Int or Double. If this is not the case an exception is thrown.

  • Min, takes minimal number (Int or double required)
  • Max, takes maximum number (Int or double required)
  • Avg, calculates the average (Int or double required)
  • All, returns all values as separated values
  • Unique, returns all unique values as separated values

Example

Default vcfstats run:

java -jar <VcfStats_jar> \
-I <input_vcf> \
-o <output_dir> \
-R <reference_fasta>

Run only specific regions:

java -jar <VcfStats_jar> \
-I <input_vcf> \
-o <output_dir> \
-R <reference_fasta> \
--intervals <bed_file>

Create a histogram of a info field, for the methods see the manual section:

java -jar <VcfStats_jar> \
-I <input_vcf> \
-o <output_dir> \
-R <reference_fasta> \
--infoTag <tag_id>:All

Create a histogram of a info field, for the methods see the manual section:

java -jar <VcfStats_jar> \
-I <input_vcf> \
-o <output_dir> \
-R <reference_fasta> \
--genotypeTag <tag_id>:All

Run vcfstats on spark:

spark-submit <spark arguments> <VcfStats_jar> \
-I <input_vcf> \
-o <output_dir> \
-R <reference_fasta> \
--sparkMaster <spark_master>

Usage

Usage for VcfStats:

Option Required Can occur multiple times Description
--log_level, -l no no Level of log information printed. Possible levels: 'debug', 'info', 'warn', 'error'
--help, -h no no Print usage
--version, -v no no Print version
--inputFile, -I yes no Input VCF file (required)
--referenceFile, -R yes no Fasta reference which was used to call input VCF (required)
--outputDir, -o yes no Path to directory for output (required)
--intervals no no Path to interval (BED) file (optional)
--infoTag no yes (unlimited) Summarize these info tags
--genotypeTag no yes (unlimited) Summarize these genotype tags
--sampleToSampleMinDepth no no Minimal depth require to consider sample to sample comparison
--binSize no no Binsize in estimated base pairs
--maxContigsInSingleJob no no Max number of bins to be combined, default is 250
--writeBinStats no no Write bin statistics. Default False
--localThreads, -t no no Number of local threads to use
--notWriteContigStats no no Number of local threads to use
--skipGeneral no no Skipping general stats
--skipGenotype no no Skipping genotype stats
--skipSampleDistributions no no Skipping sample distributions stats
--skipSampleCompare no no Skipping sample compare
--repartition no no Repartition after reading records (only in spark mode)
--executeModulesAsJobs no no Execute modules as jobs (only in spark mode)
--sparkMaster no no Spark master to use
--sparkExecutorMemory no no Spark executor memory to use
--sparkConfigValue no yes (unlimited) Add values to the spark config

About

VcfStats is part of BIOPET tool suite that is developed at LUMC by the SASC team. Each tool in the BIOPET tool suite is meant to offer a standalone function that can be used to perform a dedicate data analysis task or added as part of BIOPET pipelines.

All tools in the BIOPET tool suite are Free/Libre and Open Source Software.

Contributing

The source code of VcfStats can be found here. We welcome any contributions. Bug reports, feature requests and feedback can be submitted at our issue tracker.

VcfStats is build using sbt. Before submitting a pull request, make sure all tests can be passed by running sbt test from the project's root. We recommend using an IDE to work on VcfStats. We have had good results with this IDE.

Contact

For any question related to VcfStats, please use the github issue tracker or contact the SASC team directly at: sasc@lumc.nl.