CSV to JSON stream converter in Python
I’ve been working with some govermental data that is available as huge (>50G) CSV files.
While there are workarounds to working with large files, I wanted to keep the stream processing I do with JSON files.
However, this was not a JSON file. Stream processin with CSV is hard. jq
is so much easier.
#!/bin/env python3
import csv
import sys
import json
csv.field_size_limit(sys.maxsize)
for row in csv.DictReader(sys.stdin):
sys.stdout.write(json.dumps(row)+"\n")
This tiny Python script reads STDIN line by line, converting each line from CSV to JSON and printing it out.
I then can use my standard tooling to continue chewing on the file:
python /tmp/convert.py </tmp/big_file.csv \
| jq 'select(.type=="049" or .type=="048") | .url' -r \
| head -n20 \
| xargs wget