docs/src/data-diving-examples.md
The flins.csv file is some sample data obtained from https://support.spatialkey.com/spatialkey-sample-csv-data.
Vertical-tabular format is good for a quick look at CSV data layout -- seeing what columns you have to work with, as this is a file big enough that we can't just see it on a single screenful:
<pre class="pre-highlight-in-pair"> <b>wc -l data/flins.csv</b> </pre> <pre class="pre-non-highlight-in-pair"> 36635 data/flins.csv </pre> <pre class="pre-highlight-in-pair"> <b>mlr --c2x --from data/flins.csv head -n 2</b> </pre> <pre class="pre-non-highlight-in-pair"> policyID 119736 statecode FL county CLAY COUNTY eq_site_limit 498960 hu_site_limit 498960 fl_site_limit 498960 fr_site_limit 498960 tiv_2011 498960 tiv_2012 792148.9 eq_site_deductible 0 hu_site_deductible 9979.2 fl_site_deductible 0 fr_site_deductible 0 point_latitude 30.102261 point_longitude -81.711777 line Residential construction Masonry point_granularity 1 policyID 448094 statecode FL county CLAY COUNTY eq_site_limit 1322376.3 hu_site_limit 1322376.3 fl_site_limit 1322376.3 fr_site_limit 1322376.3 tiv_2011 1322376.3 tiv_2012 1438163.57 eq_site_deductible 0 hu_site_deductible 0 fl_site_deductible 0 fr_site_deductible 0 point_latitude 30.063936 point_longitude -81.707664 line Residential construction Masonry point_granularity 3 </pre>A few simple queries:
<pre class="pre-highlight-in-pair"> <b>mlr --c2p --from data/flins.csv count-distinct -f county | head</b> </pre> <pre class="pre-non-highlight-in-pair"> county count CLAY COUNTY 363 SUWANNEE COUNTY 154 NASSAU COUNTY 135 COLUMBIA COUNTY 125 ST JOHNS COUNTY 657 BAKER COUNTY 70 BRADFORD COUNTY 31 HAMILTON COUNTY 35 UNION COUNTY 15 </pre> <pre class="pre-highlight-in-pair"> <b>mlr --c2p --from data/flins.csv count-distinct -f line</b> </pre> <pre class="pre-non-highlight-in-pair"> line count Residential 30838 Commercial 5796 </pre>Categorization of total insured value:
<pre class="pre-highlight-in-pair"> <b>mlr --c2x --from data/flins.csv stats1 -a min,mean,max -f tiv_2012</b> </pre> <pre class="pre-non-highlight-in-pair"> tiv_2012_min 73.37 tiv_2012_mean 2571004.0973420837 tiv_2012_max 1701000000 </pre> <pre class="pre-highlight-in-pair"> <b>mlr --c2p --from data/flins.csv \</b> <b> stats1 -a min,mean,max -f tiv_2012 -g construction,line</b> </pre> <pre class="pre-non-highlight-in-pair"> construction line tiv_2012_min tiv_2012_mean tiv_2012_max Masonry Residential 261168.07 1041986.1292168079 3234970.92 Wood Residential 73.37 113493.01704925536 649046.12 Reinforced Concrete Commercial 6416016.01 20212428.681839883 60570000 Reinforced Masonry Commercial 1287817.34 4621372.981117158 16650000 Steel Frame Commercial 29790000 133492500 1701000000 </pre> <pre class="pre-highlight-in-pair"> <b>mlr --c2x --from data/flins.csv \</b> <b> stats1 -a p0,p10,p50,p90,p95,p99,p100 -f hu_site_deductible</b> </pre> <pre class="pre-non-highlight-in-pair"> hu_site_deductible_p0 0 hu_site_deductible_p10 0 hu_site_deductible_p50 0 hu_site_deductible_p90 76.5 hu_site_deductible_p95 6829.2 hu_site_deductible_p99 126270 hu_site_deductible_p100 7380000 </pre> <pre class="pre-highlight-in-pair"> <b>mlr --c2p --from data/flins.csv \</b> <b> stats1 -a p95,p99,p100 -f hu_site_deductible -g county \</b> <b> then sort -f county | head</b> </pre> <pre class="pre-non-highlight-in-pair"> county hu_site_deductible_p95 hu_site_deductible_p99 hu_site_deductible_p100 ALACHUA COUNTY 30630.6 107312.4 1641375 BAKER COUNTY 0 0 0 BAY COUNTY 26131.5 181912.5 630000 BRADFORD COUNTY 3355.2 8163 8163 BREVARD COUNTY 5360.4 78975 1973461.5 BROWARD COUNTY 0 148500 3258900 CALHOUN COUNTY 0 33339.6 33339.6 CHARLOTTE COUNTY 5400 52650 250994.7 CITRUS COUNTY 1332.9 79974.9 483785.1 </pre> <pre class="pre-highlight-in-pair"> <b>mlr --c2x --from data/flins.csv \</b> <b> stats2 -a corr,linreg-ols,r2 -f tiv_2011,tiv_2012</b> </pre> <pre class="pre-non-highlight-in-pair"> tiv_2011_tiv_2012_corr 0.9730497632351701 tiv_2011_tiv_2012_ols_m 0.9835583980337732 tiv_2011_tiv_2012_ols_b 433854.6428968301 tiv_2011_tiv_2012_ols_n 36634 tiv_2011_tiv_2012_r2 0.9468258417320204 </pre> <pre class="pre-highlight-in-pair"> <b>mlr --c2x --from data/flins.csv --ofmt '%.4f' \</b> <b> stats2 -a corr,linreg-ols,r2 -f tiv_2011,tiv_2012 -g county \</b> <b> then head -n 5</b> </pre> <pre class="pre-non-highlight-in-pair"> county CLAY COUNTY tiv_2011_tiv_2012_corr 0.9627 tiv_2011_tiv_2012_ols_m 1.0901 tiv_2011_tiv_2012_ols_b 46450.5313 tiv_2011_tiv_2012_ols_n 363 tiv_2011_tiv_2012_r2 0.9268 county SUWANNEE COUNTY tiv_2011_tiv_2012_corr 0.9892 tiv_2011_tiv_2012_ols_m 1.0747 tiv_2011_tiv_2012_ols_b 36253.0032 tiv_2011_tiv_2012_ols_n 154 tiv_2011_tiv_2012_r2 0.9785 county NASSAU COUNTY tiv_2011_tiv_2012_corr 0.9731 tiv_2011_tiv_2012_ols_m 1.2963 tiv_2011_tiv_2012_ols_b -45369.2427 tiv_2011_tiv_2012_ols_n 135 tiv_2011_tiv_2012_r2 0.9470 county COLUMBIA COUNTY tiv_2011_tiv_2012_corr 0.9995 tiv_2011_tiv_2012_ols_m 0.9314 tiv_2011_tiv_2012_ols_b 117183.5484 tiv_2011_tiv_2012_ols_n 125 tiv_2011_tiv_2012_r2 0.9990 county ST JOHNS COUNTY tiv_2011_tiv_2012_corr 0.9662 tiv_2011_tiv_2012_ols_m 1.2301 tiv_2011_tiv_2012_ols_b -596.6239 tiv_2011_tiv_2012_ols_n 657 tiv_2011_tiv_2012_r2 0.9335 </pre>The data/colored-shapes.dkvp file is some sample data produced by the mkdat2 script. The idea is:
flag field is 0 or 1, with probability dependent on coloru field is plain uniform on the unit interval.v field is the same, except tightly correlated with u for red circles.w field is autocorrelated for each color/shape pair.x field is boring Gaussian with mean 5 and standard deviation about 1.2, with no dependence on color or shape.Peek at the data:
<pre class="pre-highlight-in-pair"> <b>wc -l data/colored-shapes.dkvp</b> </pre> <pre class="pre-non-highlight-in-pair"> 10078 data/colored-shapes.dkvp </pre> <pre class="pre-highlight-in-pair"> <b>head -n 6 data/colored-shapes.dkvp | mlr --opprint cat</b> </pre> <pre class="pre-non-highlight-in-pair"> color shape flag i u v w x yellow triangle 1 56 0.632170 0.988721 0.436498 5.798188 red square 1 80 0.219668 0.001257 0.792778 2.944117 red circle 1 84 0.209017 0.290052 0.138103 5.065034 red square 0 243 0.956274 0.746720 0.775542 7.117831 purple triangle 0 257 0.435535 0.859129 0.812290 5.753095 red square 0 322 0.201551 0.953110 0.771991 5.612050 </pre>Look at uncategorized stats (using creach for spacing).
Here it looks reasonable that u is unit-uniform; something's up with v but we can't yet see what:
The histogram shows the different distribution of 0/1 flags:
<pre class="pre-highlight-in-pair"> <b>mlr --opprint histogram -f flag,u,v --lo -0.1 --hi 1.1 --nbins 12 data/colored-shapes.dkvp</b> </pre> <pre class="pre-non-highlight-in-pair"> bin_lo bin_hi flag_count u_count v_count -0.1 0.000000000000000013877787807814457 6058 0 36 0.000000000000000013877787807814457 0.10000000000000003 0 1062 988 0.10000000000000003 0.20000000000000004 0 985 1003 0.20000000000000004 0.30000000000000004 0 1024 1014 0.30000000000000004 0.40000000000000013 0 1002 991 0.40000000000000013 0.5000000000000001 0 989 1041 0.5000000000000001 0.6000000000000002 0 1001 1016 0.6000000000000002 0.7000000000000002 0 972 962 0.7000000000000002 0.8000000000000002 0 1035 1070 0.8000000000000002 0.9000000000000002 0 995 993 0.9000000000000002 1 4020 1013 939 1 1.1 0 0 25 </pre>Look at univariate stats by color and shape. In particular, color-dependent flag probabilities pop out, aligning with their original Bernoulli probabilities from the data-generator script:
<pre class="pre-highlight-in-pair"> <b>mlr --opprint stats1 -a min,mean,max -f flag,u,v -g color \</b> <b> then sort -f color \</b> <b> data/colored-shapes.dkvp</b> </pre> <pre class="pre-non-highlight-in-pair"> color flag_min flag_mean flag_max u_min u_mean u_max v_min v_mean v_max blue 0 0.5843537414965987 1 0.000044 0.5177171537414964 0.999969 0.001489 0.4910564278911574 0.999576 green 0 0.20919747520288548 1 0.000488 0.5048610595130744 0.999936 0.000501 0.49908475924256035 0.999676 orange 0 0.5214521452145214 1 0.001235 0.49053241584158375 0.998885 0.002449 0.4877637788778878 0.998475 purple 0 0.09019264448336252 1 0.000266 0.49400496322241666 0.999647 0.000364 0.4970507127845888 0.999975 red 0 0.3031674208144796 1 0.000671 0.49255964641241273 0.999882 -0.092709 0.4965350941607402 1.0725 yellow 0 0.8924274593064402 1 0.0013 0.4971291160651098 0.999923 0.000711 0.5106265987261144 0.999919 </pre> <pre class="pre-highlight-in-pair"> <b>mlr --opprint stats1 -a min,mean,max -f flag,u,v -g shape \</b> <b> then sort -f shape \</b> <b> data/colored-shapes.dkvp</b> </pre> <pre class="pre-non-highlight-in-pair"> shape flag_min flag_mean flag_max u_min u_mean u_max v_min v_mean v_max circle 0 0.3998456194519491 1 0.000044 0.498554505982246 0.999923 -0.092709 0.49552416171362396 1.0725 square 0 0.39611178614823817 1 0.000188 0.4993854558930749 0.999969 0.000089 0.49653825929526124 0.999975 triangle 0 0.4015421115065243 1 0.000881 0.49685854240806604 0.999661 0.000717 0.5010495260972719 0.999995 </pre>Look at bivariate stats by color and shape. In particular, u,v pairwise correlation for red circles pops out: