Vcfstats is a tool that can generate metrics from a vcf file.
This tool can run locally single threaded but also on a Apache Spark cluster.
VcfStats requires Java 8 to be installed on your device. Download Java 8 here or install via your distribution's package manager.
Download the latest version of VcfStats here. To generate the usage run:
java -jar <VcfStats_jar> --help
Almost all stats are default but can be disabled if required. When having a lot of samples the sample compare step if very intensive.
The reference fasta should have at least a dict file next to it.
For the field histograms there are some methods to choose from. This method will be used to choose what value to report in the histogram if there multiple values. Some method require the value to be a Int or Double. If this is not the case an exception is thrown.
Default vcfstats run:
java -jar <VcfStats_jar> \
-I <input_vcf> \
-o <output_dir> \
-R <reference_fasta>
Run only specific regions:
java -jar <VcfStats_jar> \
-I <input_vcf> \
-o <output_dir> \
-R <reference_fasta> \
--intervals <bed_file>
Create a histogram of a info field, for the methods see the manual section:
java -jar <VcfStats_jar> \
-I <input_vcf> \
-o <output_dir> \
-R <reference_fasta> \
--infoTag <tag_id>:All
Create a histogram of a info field, for the methods see the manual section:
java -jar <VcfStats_jar> \
-I <input_vcf> \
-o <output_dir> \
-R <reference_fasta> \
--genotypeTag <tag_id>:All
Run vcfstats on spark:
spark-submit <spark arguments> <VcfStats_jar> \
-I <input_vcf> \
-o <output_dir> \
-R <reference_fasta> \
--sparkMaster <spark_master>
Usage for VcfStats:
Option | Required | Can occur multiple times | Description |
---|---|---|---|
--log_level, -l | no | no | Level of log information printed. Possible levels: 'debug', 'info', 'warn', 'error' |
--help, -h | no | no | Print usage |
--version, -v | no | no | Print version |
--inputFile, -I | yes | no | Input VCF file (required) |
--referenceFile, -R | yes | no | Fasta reference which was used to call input VCF (required) |
--outputDir, -o | yes | no | Path to directory for output (required) |
--intervals | no | no | Path to interval (BED) file (optional) |
--infoTag | no | yes (unlimited) | Summarize these info tags |
--genotypeTag | no | yes (unlimited) | Summarize these genotype tags |
--sampleToSampleMinDepth | no | no | Minimal depth require to consider sample to sample comparison |
--binSize | no | no | Binsize in estimated base pairs |
--maxContigsInSingleJob | no | no | Max number of bins to be combined, default is 250 |
--writeBinStats | no | no | Write bin statistics. Default False |
--localThreads, -t | no | no | Number of local threads to use |
--notWriteContigStats | no | no | Number of local threads to use |
--skipGeneral | no | no | Skipping general stats |
--skipGenotype | no | no | Skipping genotype stats |
--skipSampleDistributions | no | no | Skipping sample distributions stats |
--skipSampleCompare | no | no | Skipping sample compare |
--repartition | no | no | Repartition after reading records (only in spark mode) |
--executeModulesAsJobs | no | no | Execute modules as jobs (only in spark mode) |
--sparkMaster | no | no | Spark master to use |
--sparkExecutorMemory | no | no | Spark executor memory to use |
--sparkConfigValue | no | yes (unlimited) | Add values to the spark config |
VcfStats is part of BIOPET tool suite that is developed at LUMC by the SASC team. Each tool in the BIOPET tool suite is meant to offer a standalone function that can be used to perform a dedicate data analysis task or added as part of BIOPET pipelines.
All tools in the BIOPET tool suite are Free/Libre and Open Source Software.
The source code of VcfStats can be found here. We welcome any contributions. Bug reports, feature requests and feedback can be submitted at our issue tracker.
VcfStats is build using sbt. Before submitting a pull request, make sure all tests can be passed by
running sbt test
from the project's root. We recommend using an IDE to work on VcfStats. We have had
good results with this IDE.
For any question related to VcfStats, please use the github issue tracker or contact the SASC team directly at: sasc@lumc.nl.