FindOverlapMatch looks for overlapping samples in a sample matrix. An example of a tool that creates such a matrix is VcfStats.
It compares samples and lists similar samples based on a cutoff point. It can also check if columns in a sample matrix match a certain regex.
FindOverlapMatch requires Java 8 to be installed on your device. Download Java 8 here or install via your distribution's package manager.
Download the latest version of FindOverlapMatch here. To generate the usage run:
java -jar <FindOverlapMatch_jar> --help
Input can be a text file like the following input file:
sample1 sample2 sample3
sample1 1.0 0.5 0.9
sample2 0.5 1.0 0.5
sample3 0.9 0.5 1.0
To check above example with threshold 0.9:
java -jar <FindOverlapMatch_jar> \
-i input.txt \
-c 0.9 \
-o output.txt
Will yield the following file:
sample1 (sample3,0.9)
sample2
sample3 (sample1,0.9)
With --use_same_names
set it should be:
sample1 (sample1,1.0) (sample3,0.9)
sample2 (sample2,1.0)
sample3 (sample1,0.9) (sample3,1.0)
Usage for FindOverlapMatch:
Option | Required | Can occur multiple times | Description |
---|---|---|---|
--log_level, -l | no | no | Level of log information printed. Possible levels: 'debug', 'info', 'warn', 'error' |
--help, -h | no | no | Print usage |
--version, -v | no | no | Print version |
--input, -i | yes | no | Input should be a table where the first row and column have the ID's, those can be different |
--shouldMatchRegexFile | no | no | File with regexes what should be the correct matches. first column is the row samples regex, second column the column regex. When no second column given first column is used. |
--output, -o | no | no | default to stdout |
--cutoff, -c | yes | no | minimum value to report it as pair |
--use_same_names | no | no | Do not compare samples with the same name |
--showBestMatch | no | no | Show best match, even when it's below cutoff |
--rowSampleRegex | no | no | Samples in the row should match this regex |
--columnSampleRegex | no | no | Samples in the column should match this regex |
FindOverlapMatch is part of BIOPET tool suite that is developed at LUMC by the SASC team. Each tool in the BIOPET tool suite is meant to offer a standalone function that can be used to perform a dedicate data analysis task or added as part of a pipeline, for example the SASC team's biowdl pipelines.
All tools in the BIOPET tool suite are Free/Libre and Open Source Software.
The source code of FindOverlapMatch can be found here. We welcome any contributions. Bug reports, feature requests and feedback can be submitted at our issue tracker.
FindOverlapMatch is build using sbt. Before submitting a pull request, make sure all tests can be passed by
running sbt test
from the project's root. We recommend using an IDE to work on FindOverlapMatch. We have had
good results with this IDE.
For any question related to FindOverlapMatch, please use the github issue tracker or contact the SASC team directly at: sasc@lumc.nl.