Wikidata cache
Roughly a year ago I used cachelib to cache Wikidata requests. But now I have way too many requests to still hit live Wikidata. So I decided to use the wikidata dump.
The issue with the dump is, that it is one big bzipped jsonl file. Keeping the file compressed and jumping to a specific entry is hard. Processing the file with something like indexed-bzip2 could be possible, but for me it doesn't feels worth it.
So I decided on a different solution: Converting the jsonl.bz2 file to SQLite. The data structure I decided on is
CREATE TABLE entities ( entity_id TEXT, label_en TEXT, label_de TEXT, data BLOB NOT NULL, modified TEXT );
In the data field is the bz2 stored version of the json of a Wikidata entry.
The modified is copied from the data the same as the entity_id and the labels.
The two label columns are in there mainly for debugging reasons.
The main bottleneck for the processing is the bzip2 extraction and compression. So the first speed improvement is to install lbzip2 for decompression of the wikidata export. The other improvement is to split the processing up to as many threads as cores available:
The interesting parts of the process.py code looks like this:
# get the jobslot from parallel worker_id = os.environ.get("PARALLEL_JOBSLOT", "1") db_path = f"wikidata-cache-worker-{worker_id}.db" for line in sys.stdin: line = line.strip() if line.endswith(","): line = line[:-1] if line.startswith(("[", "]")) or not line: continue data = json.loads(line) data_compressed = bz2.compress(json.dumps(data, separators=(",", ":")).encode("utf-8")) # save batch to db
I use sqlite-utils to insert a list of entities with .insert_all().
After the full processing of the Wikidata dump is finished, another Python script is merging the databases.
The fastest way here was to drop the indexes first and then insert like this:
conn = sqlite3.connect(main_db_path) for worker_path in worker_dbs: conn.execute(f"ATTACH DATABASE '{worker_path}' AS worker") conn.execute( """ INSERT INTO entities SELECT * FROM worker.entities """ ) conn.commit() conn.execute("DETACH DATABASE worker")
The whole convertion takes roughly 1.5 days on an 10 year old i7 with 8 cores. There is obviously the tradeoff between using all cores for compression/decompression vs. the timesink of merging the dbs. So I benchmarked this on the same machine by only using one SQLite db. I stopped the single job experiment after 2 days with 40% finished.
Now I have a 400GB SQLite database generated out of a 100GB wikidata-*-all.json.bz2.
To retrieve the data I added a small FastAPI app:
import asyncio import bz2 import json from fastapi import FastAPI, HTTPException from sqlite_utils import Database app = FastAPI() db = Database("wikidata-cache.db", check_same_thread=False) def fetch_entity(entity_id: str) -> dict | None: rows = db["entities"].rows_where("entity_id = ?", [entity_id], limit=1) row = next(rows, None) if row is None: return None return json.loads(bz2.decompress(row["data"])) @app.get("/{entity_id}.json") async def get_entity(entity_id: str): entity = await asyncio.to_thread(fetch_entity, entity_id) if not entity: raise HTTPException(status_code=404, detail="Entity not found") return {"entities": {entity_id: entity}}
The format is on purpose the same as for the Wikidata Json Special page, i.e. for Q42. Now I can process a lot of Wikidata entries without hitting the Wikidata servers.