joinx - a tool for manipulating genomic data
joinx [--version|-v] SUBCOMMAND [SUBCOMMAND-ARGS]
Joinx is a lightweight tool for performing manipulating genomic data contained in BED or VCF files. It also provides some limited analysis functions (concordance reports), and can be used to generate contigs from variant files and a reference sequence. An important assumption that joinx makes is that the input data is always sorted. This allows it to compute its results in an efficient manner.
joinx intersect [OPTIONS] a.bed b.bed
This subcommand is used to perform set operations (intersect, difference, symmetric difference) on BED files.
-h, --help Display usage information
-a, --file-a <path> (or first positional argument) Path to sorted input .bed file "A" (- for stdin)
-b, --file-b <path> (or second positional argument) Path to sorted input .bed file "B" (- for stdin, note: not both "A" and "B" can be stdin)
-o, --output-file <path> Output bed file for intersection results
--miss-a <path> Write entries in A that are not in B (set difference, A\B) to <path>.
--miss-b <path> Write entries in B that are not in A (set difference, B\A) to <path>.
-F, --format Specify a format string for intersection output. The string consists of a space separated list of column specifiers:
I - the actual intersection of the two bed entries (3 columns) A - the complete entry from file 'A' B - the complete entry from file 'B' Additionally, you can select individual columns from A or B like so: A3 - column 3 from A A1,3 - columns 1 and 3 from A A2-4,6 - columns 2 to 4, and 6 from A A complete example is: --format "I A B4" which outputs the range of the intersection, followed by the complete entry from A, and the 4th column from B.
-f, --first-only Notice only the first thing in A to hit records in B, not the full intersection
--output-both Concatenate the columns of A and B in the intersection output
--exact-pos For indels, define intersection to mean an exact position match. The default considers any overlap a match.
--exact-allele Require exact allele matching for intersection.
--iub-match When using exact-allele, treat any overlap between an allele and an IUB code as a match (e.g., G & K match).
--dbsnp-match Undocumented. Should not be used.
--adjacent-insertions When specified, insertions (specifically, features with 0 length) will be counted as intersecting adjacent features, e.g.: position 1 1 2 (snv or 1bp deletion at 1) will intersect position 1 2 2 (insertion at 2).
joinx check-ref [OPTIONS] -f ref.fasta -b variants.bed
This subcommand can be used to check the reference information embedded in variant files against a FASTA file. This is useful when trying to determine whether or not a particular set of features are actually described relative to a particular reference sequence.
-h, --help Display usage information
-b, --bed <path> Specifies the input .bed file to check
-f, --fasta <path> Specifies the fasta file to check against
-o, --report-file <path> Output file (empty or - means stdout, which is the default)
-m, --miss-file <path> Output entries which do not match the reference to this file
joinx create-contigs [OPTIONS] -r ref.fasta -v variants.bed
Generates FASTA format contigs from a variant file and a reference FASTA file.
-h, --help Display usage information
-r, --reference <path> Specifies the reference sequence to generate against (FASTA format)
-v, --variants <path> Specifies the variants file to generate contigs from (.bed format)
-o, --output-file <path> Output file (empty or - means stdout, which is the default)
-f, --flank <path> Flank size on either end of the variant (default=99)
-q, --quality <path> Minimum quality cutoff for variants (default=0)
joinx snv-concordance [OPTIONS] a.bed b.bed
Generates basic concordance reports given 2 .bed files.
-h, --help Display usage information
-a, --file-a <path> (or first positional argument) Path to sorted input .bed file "A" (- for stdin)
-b, --file-b <path> (or second positional argument) Path to sorted input .bed file "B" (- for stdin, note: not both "A" and "B" can be stdin)
-d, --depth Use read depth (bed column 5) instead of quality (column 4) in report
--output-file <path> Output the report to the given path (stdout by default)
--hits <path> Output the resulting intersection of "A" and "B" to "path" (discarded by default)
--miss-a <path> Write entries in A that are not in B (set difference, A\B) to <path>.
--miss-b <path> Write entries in B that are not in A (set difference, B\A) to <path>.
joinx sort [OPTIONS] a.bed b.bed c.bed ...
Sort BED or VCF files by chromosome/sequence, start, end position.
-h, --help Display usage information
-i, --input-file <path> Input file(s) to be sorted (positional arguments work also)
-m, --merge-only Assume all input files are pre-sorted, merge only
-o, --output-file The path of the sorted output file to write
-M, --max-mem-lines (default=1000000) The maximum number of lines to hold in memory while sorting (before writing to tmp files)
-s, --stable Perform a "stable" sort (equivalent records maintain their original order)
-u, --unique Print only unique entries (applies to BED format only)
joinx vcf-merge [OPTIONS] file1.vcf file2.vcf [file3.vcf ...]
This subcommand performs multi-sample merging of features contained in VCF files. By default, no two input files are allowed to contain data for the same sample. This behavior can be changed with the -s flag, which will attempt to merge per-sample data, favoring data from the first file.
Merging of INFO fields can be controlled by specifying a merge strategy file. This is a simple KEY=VALUE file that specifies actions for particular info fields. Valid actions are:
ignore strip this info field out first retain the first value seen for this field uniq-concat (variable length fields only) unique values will be merged enforce-equal if an attempt is made to merge two records with different values for this field, an error is raised sum (numeric fields only) sum the values for this field
An example merge strategy that by default takes info fields from the first entry, sums up DP (depth) values, and concatenates information in a field called caller is:
default=first info.DP=sum info.CALLER=uniq-concat
-h, --help Display usage information
-i, --input-file <path> Input file(s) (positional arguments work also)
-o, --output-file <path> Output file (empty or - means stdout, which is the default)
-M, --merge-strategy-file <path> Merge strategy file for info fields.
-s, --merge-samples Allow input files with overlapping sample data.
-c, --clear-filters When set, merged entries will have FILTER data stripped out
-P, --sample-priority <o|u|f> (default=o) Set the sample priority when merging samples (-s). When multiple files have data for the same sample at the same site, the source with the highest priority is chosen. The valid values, o, u, and f are short for order, unfiltered, and filtered. Prioritization by order grants files listed earlier on the command line a higher priority, while unfiltered and filtered rank sources first by filter status, and then by order.
-R, --require-consensus "ratio,FilterName,Filter description" When merging samples, require a certain degree of consensus among the sample sources, filtering sites that fail with FilterName (which will be added to the vcf header if needed). So for example: -R "1.0,ConsensusFilter,Sample consensus filter" means that for any given sample/site, if all files containing the sample do not agree, then the ConsensusFilter will be applied to the sample/site.
Joinx was written by Travis Abbott <tabbott@genome.wustl.edu>, and is maintained on Github at: https://github.com/genome/joinx/.
Joinx is free software, distributed under the terms of the GNU GPL v3 or later: http://gnu.org/licenses/gpl.html.