Processing Israeli FOIA calendars, Part 1

4 minute read


I’m a volunteer in the Israeli Public Knowledge Workshop (wiki), which is a nonprofit working to increase transparency of Israeli’s ruling bodies (e.g. Goverment, Municipal authorities, courts).
I recently picked up a nice project that I thought would be worth a read.

Under the Israeli equivalent of FOIA (Freedom Of Information Act), citizens are able to request information form public authorities.
Some nonprofits are using this law to try and acquire calendars/diaries/journals of various public servants, and publish them to the public.

Aside from the obvious advantage in being able to ensure that a public servant spends their time in actually serving the public, there are instances where these calendars are useful in exposing conspiracies and hints of corruption (article in Hebrew).

We now have a lot of these calendars as individual files. We’d like to do some more advanced analytics on them, including cross-referencing them with one another, and for that we’d like to load them all up into a nice database.
This is my mission.

I called this project “Dear Diary”, as the Hebrew words for “Calendar” and “Diary” are identical.

Series Logistics

Instead of making a huge post, I’m going to divide it into a series to make it easier to read (and to write).
I’ll also try to make it readable for people who are not Linux / Python / whatever pros. Hopefully my family and friends will be able to read it and understand what I’m working on 😀

If you’re not a programmer, feel free to skip the code bits.

This is part 1

Data exploration and what we’re targeting

The first thing I wanted to do is to understand which data I’m dealing with. I went into the website that holds all of the FOIA-collected documents we have, and searched for “יומן” (Hebrew for “Calendar”). The URL looks like this:

Looks promising. I now tried to find an API equivalent, and it wasn’t that hard. Reading the docs and poking at some URLs gave me this:יומן

You can try the url on your browser, and you’ll get a huge JSON object, detailing all of the files relating to calendars.

Using cURL (which fetches the list) and feeding the result to jq (which allows us to dive into the JSON), we can see what a single item looks like:

$ curl -s 'יומן' | jq '.result.results[0]'
  "mimetype": null,
  "cache_url": null,
  "state": "active",
  "hash": "",
  "description": "",
  "format": "PDF",
  "url": "",
  "datastore_active": false,
  "created": "2017-06-19T11:48:30.996554",
  "cache_last_updated": null,
  "package_id": "39092605-8599-470a-ba76-36e434acd2f7",
  "mimetype_inner": null,
  "last_modified": "2017-06-19T11:48:30.969395",
  "position": 0,
  "revision_id": "6ed049da-8e71-4c9f-a8e2-25315290412c",
  "size": null,
  "url_type": "upload",
  "id": "81d2cc0b-beae-46a3-81d7-e7d529ae9fb8",
  "resource_type": null,
  "name": "יומן ראש עיריית באר שבע 2013.pdf"

There are a lot of interesting fields, but for now we’ll look into “mimetype” (wiki), which is a web standard indicating of what kind of file this is.
Unfortunately, this file has “null” (meaning unknown), but most of the files don’t.
In order to see what we’re dealing with, we’re going to do a more complex command, that gets all the mimetypes and groups them, showing us the most common ones.
The command and result look like this:

$ curl -s 'יומן' | jq '.result.results[].mimetype' | sortiq | head
    799 "application/pdf"
    357 "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"
     78 "text/csv"
     63 null
     47 "application/"
      9 "application/vnd.openxmlformats-officedocument.wordprocessingml.document"
      6 "image/tiff"
      4 "application/zip"
      3 "text/calendar"
      2 "application/vnd.openxmlformats-officedocument.spreadsheetml.template"

We can see that most of the files are in the PDF format, with “officedocument.spreadsheetml.sheet” (Excel 2007+ spreadsheets) in second place. From manually looking at some of the PDF files, I saw that they’re hard to read.

  1. this is a printed and scanned document, which is completly unintelligible for a machine without a lot of guesswork.
  2. this, while made of text, is nontrivial for a computer to understand.

Because the PDFs seemed to be problematic, I decided to start looking at Excel files as a first step.

Now that we know that we want the Excel files, we should download them locally so we can experiment on them.
For this, we’re going to get all of the files, filter out only the Excel ones, and download them all to a specific directory.
Let’s do this:

$ curl 'יומן' -s | jq '.result.results[] | select(.mimetype == "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet") | "mkdir -p output/reapings/\(.id) && curl -s \(.url) -o \"output/reapings/\(.id)/\(.name)\""' -r | parallel

The result in output/reapings is a bunch of directories, each one named after a file ID. inside, there are files:

$ ls output/reapings/ | head

$ ls output/reapings/00181757-0c0d-4f82-b389-b3624bdeec87
'יומן מנכל העירייה 1.1.22-30.6.22.xlsx'

$ ls output/reapings/0616d2bf-f8e3-4a09-ab77-ed13071ac9e9
'יומן מרב מיכאלי.xlsx'

Let’s do a quick check of how many files we have:

$ find -type f | wc -l

Let’s also ask how much space this directory takes:

$ du -hs output/reapings/
34M	output/reapings/

So we now have 2013 Excel files, weighing 34 MegaBytes. Neat.

Next episode

Chewing, or how to take these Excel files and extract specific calendar events.