I’ve been working with some govermental data that is available as huge (>50G) CSV files.
While there are workarounds to working with large files, I wanted to keep the stream processing I do with JSON files.
However, this was not a JSON file. Stream processin with CSV is hard.
jq is so much easier.
#!/bin/env python3 import csv import sys import json csv.field_size_limit(sys.maxsize) for row in csv.DictReader(sys.stdin): sys.stdout.write(json.dumps(row)+"\n")
This tiny Python script reads STDIN line by line, converting each line from CSV to JSON and printing it out.
I then can use my standard tooling to continue chewing on the file:
python /tmp/convert.py </tmp/big_file.csv \ | jq 'select(.type=="049" or .type=="048") | .url' -r \ | head -n20 \ | xargs wget