Sie sind auf Seite 1von 4

05/11/2018 GitHub - shenwei356/datakit: CSV/TSV file manipulation and more.

Please use my another tool: csvtk, https://gith…

datakit
CSV file manipulation and more.

Please use my another tool: csvtk, Another cross-platform, efficient and practical CSV/TSV
tool kit

intersection
Intersecion of multiple (>=2) files.

unique
uniq with no need pre-sorting.

csv2tab

usage: csv2tab [-h] [-f F] [-q Q] [csvfile [csvfile ...]]

csv2tab

positional arguments:
csvfile Input file(s)

optional arguments:
-h, --help show this help message and exit
-f F Field separator [,]
-q Q Quote char["]

csv_grep.py
** Please use golang version of csv_grep**

Grepping CSV file, tab-delimited file by default, by exactly matching or query by regluar
expression, multiple keys (indice) supported. The query patterns could be given from command
line or file.

Usage:

usage: csv_grep [-h] [-v] [-o OUTFILE] [-k KEY] [-H] [-F FS] [-Fo FS_OUT]
[-Q QC] [-t] [-p [PATTERN]] [-pf [PATTERNFILE]] [-pk [PK]]
[-r] [-d] [-i]
[csvfile [csvfile ...]]

Grep CSV file. Multiple keys supported.

positional arguments:
https://github.com/shenwei356/datakit 1/4
05/11/2018 GitHub - shenwei356/datakit: CSV/TSV file manipulation and more. Please use my another tool: csvtk, https://gith…
positional arguments:
csvfile Input file(s)

optional arguments:
-h, --help show this help message and exit
-v, --verbose Verbosely print information
-o OUTFILE, --outfile OUTFILE
Output file [STDOUT]
-k KEY, --key KEY Column number of key in csvfile. Multiple values shoud
be separated by comma
-H, --ignoretitle Ignore title
-F FS, --fs FS Field separator [,]
-Fo FS_OUT, --fs-out FS_OUT
Field separator of ouput [same as --fs]
-Q QC, --qc QC Quote char["]
-t Field separator is "\t". Quote char is "\t"
-p [PATTERN], --pattern [PATTERN]
Query pattern
-pf [PATTERNFILE], --patternfile [PATTERNFILE]
Pattern file
-pk [PK] Column number of key in pattern file. Multiple values
shoud be separated by comma
-r, --regexp Pattern is regular expression
-d, --speedup Delete matched pattern when matching one record
-i, --invert Invert match (do not match)

https://github.com/shenwei356/datakit

Examples
1. For a table file. Note that the 3rd column of 4th line contains "\t".

$ cat testdata/data.tab column1 column 2 3rd c str 123 abde 123 134 我 245 135 "string with
tab"

Find lines of which the 2nd column are digitals, ignoring title

$ cat testdata/data.tab | csv_grep -H -t -k 2 -r -p '^\d+$'


str 123 abde
123 134 我
245 135 "string with tab"

Find lines that have ID (first column, by default) in (or NOT in) a given ID files.

$ cat testdata/data.tab | csv_grep -t -pf testdata/data.pattern.tab


123 134 我
245 135 "string with tab"

$ cat testdata/data.tab | csv_grep -H -t -pf testdata/data.pattern.tab -i


str 123 abde

https://github.com/shenwei356/datakit 2/4
05/11/2018 GitHub - shenwei356/datakit: CSV/TSV file manipulation and more. Please use my another tool: csvtk, https://gith…

2. Find common records with same headers in two fasta files. fasta2tab transforms the FASTA
fromat to two-column table, fist column is the header and the second is
sequence. tab2fasta just tranform the table back to FASTA format.

fasta2tab seq1.fa | csv_grep -t -pf <(fasta2tab seq.fa) | tab2fasta

Records with same sequence (second column).

fasta2tab seq1.fa | csv_grep -t -pf <(fasta2tab seq.fa) -pk 2 -k 2 | tab2fasta

3. Find common records of two GTF file. The columns 1,4,5,7 together make up the key of a
record.

cat a.gff | csv_grep -t -k 1,4,5,7 -pk 1,4,5,7 -pf b.gff > commom.gff

csv_grep
Golang version. Faster than python version with concurrency.

You can download the executable files here.

Usage:

NAME:
csv_grep - grep for csv format

USAGE:
csv_grep [global options] command [command options] [arguments...]

VERSION:
1.0

AUTHOR(S):
Wei Shen <https://github.com/shenwei356/datakit>

COMMANDS:
help, h Shows a list of commands or help for one command

GLOBAL OPTIONS:
-k, --key "1" column number of key in csvfile. Multiple values sho
-H, --ignoretitle ignore title
--fs "," field separator [,]
--fs-out field separator of ouput [same as --fs]
-t, --tab field separator is "\t". Quote char is "\t"
-p, --pattern query pattern
--pf, --patternfile pattern file
--pk "1" column number of key in pattern file. Multiple value
--pfs "," field separator of pattern file [,]
-r, --use-regexp use regular expression
-d, --speedup delete matched pattern when matching one record
-i, --invert invert match (do not match)
-j, --ncpus "4" CPU number [4]

https://github.com/shenwei356/datakit 3/4
05/11/2018 GitHub - shenwei356/datakit: CSV/TSV file manipulation and more. Please use my another tool: csvtk, https://gith…

-c, --chunksize "1000" chunk size [1000]


-o, --outfile output file [stdout]
--vv, --verbose verbosely print information
--help, -h show help
--version, -v print the version

csv_join v2.0
Merge CSV files. Multiple keys supported. v2.0

Usage

usage: csv_join [-h] [-k [KEY [KEY ...]]] [-f F] [-q Q] [-of OF] [-t] [-s]
[-keep]
csvfile [csvfile ...]

Merge CSV files. Multiple keys supported. v2.0

positional arguments:
csvfile CSV files

optional arguments:
-h, --help show this help message and exit
-k [KEY [KEY ...]], --key [KEY [KEY ...]]
column number of key in csvfile. [1 for all files]
-f F field separator [,]
-q Q quote char ["]
-of OF field separator [,]
-t quote char in all files are "\t"
-s, --simplify simplify the result, by removing keys
-keep, --keep-unmatched
keep unmatched record in PREVIOUS files

https://github.com/shenwei356/datakit

Examples
1. for a lot of tab-delimited files in two-column key-value format

for f in testdata/*.tsv; do echo "----" $f "----"; cat $f; done


---- testdata/d1.tsv ---- key value1 1 123 2 abc 3 ccc ---- testdata/d2.tsv ---- key value2 1 234 2
opq 4 hello ---- testdata/d3.tsv ---- key value3 5 abc 2 jjj 1 what

csv_join -t testdata/*.tsv 1 123 1 234 1 what 2 abc 2 opq 2 jjj key value1 key value2 key value3

csv_join -t testdata/*.tsv -keep 1 123 1 234 1 what 2 abc 2 opq 2 jjj 3 ccc key value1 key
value2 key value3

csv_join -t testdata/*.tsv -s 1 123 234 what 2 abc opq jjj key value1 value2 value3

https://github.com/shenwei356/datakit 4/4

Das könnte Ihnen auch gefallen