For my diploma thesis (on clustering of high dimensional data, especially correlation clustering, i.e. clustering data by properties such as data correlation) I needed an easy way to filter out data from CSV files as used by gnuplot (actually, they are whitespace-separated).

I hacked together a small tool that would allow me to easily ‘grep’ out certain parts of the datasets. If you know of a similar tool, please send me an email to erich AT debian DOT org.

Currently, the tool allows you to do commands such as:

gpgrep "1~5$" "1<200" < out/3d-2lin-noise.variances

to select (‘grep’) all sets where the first column ends with a 5 (regular expression match) and where the first column is less than 199.

The tool is still in early development. Syntax may change, and I guess I’ll add some more filters. But you get the idea. Focus is on a very compact syntax.

A future filter I’m considering would e.g. select 20 random rows (with a fixed seed value, so it’s reproduceable!) for sampling and a modulo-match to select e.g. every 7th row.

Maybe I’ll also add some output processing later, such as averaging values, calculating variances and mean deviations, stripping away columns (but you can do that in Gnuplot already). I don’t want to overdo it though - it’s just meant as a mini filter you can add to script output visualization. It’s not meant to replace a full statistics toolkit.

So if you know of a simple tool that can do that already, please tell me.

P.S. a few people have pointed out awk. Yes, it can do most of this. One thing I also need, and I’m not sure on how to do that the easiest way in awk is to preserve ‘blocking’. That is empty lines. Because the data set

1 1
2 2

1 2
2 1

are two lines with two points each in gnuplot, not one with four points. I guess you could just do a “/^$/ {print}”, though… hmm… looks like I finally have to learn awk. So far I’ve always been refusing to learn awk, I was happy with sed, perl and python…