Wikidata cache

Roughly a year ago I used cachelib to cache Wikidata requests. But now I have way too many requests to still hit live Wikidata. So I decided to use the wikidata dump.

The issue with the dump is, that it is one big bzipped jsonl file. Keeping the file compressed and jumping to a specific entry is hard. Processing the file with something like indexed-bzip2 could be possible, but for me it doesn't feels worth it.

So I decided on a different solution: Converting the jsonl.bz2 file to SQLite. The data structure I decided on is

CREATE TABLE entities (
     entity_id TEXT,
     label_en TEXT,
     label_de TEXT,
     data BLOB NOT NULL,
     modified TEXT
 );

In the data field is the bz2 stored version of the json of a Wikidata entry. The modified is copied from the data the same as the entity_id and the labels. The two label columns are in there mainly for debugging reasons.

The main bottleneck for the processing is the bzip2 extraction and compression. So the first speed improvement is to install lbzip2 for decompression of the wikidata export. The other improvement is to split the processing up to as many threads as cores available:

pv "$1" | lbzip2 -dc | parallel -j 8 --pipe --block 200M -N 100 uv run process.py

The interesting parts of the process.py code looks like this:

# get the jobslot from parallel
worker_id = os.environ.get("PARALLEL_JOBSLOT", "1")
db_path = f"wikidata-cache-worker-{worker_id}.db"

for line in sys.stdin:
    line = line.strip()
    if line.endswith(","):
        line = line[:-1]
    if line.startswith(("[", "]")) or not line:
        continue

    data = json.loads(line)
    data_compressed = bz2.compress(json.dumps(data, separators=(",", ":")).encode("utf-8"))

# save batch to db

I use sqlite-utils to insert a list of entities with .insert_all(). After the full processing of the Wikidata dump is finished, another Python script is merging the databases. The fastest way here was to drop the indexes first and then insert like this:

conn = sqlite3.connect(main_db_path)
for worker_path in worker_dbs:
    conn.execute(f"ATTACH DATABASE '{worker_path}' AS worker")
    conn.execute(
        """
        INSERT INTO entities
        SELECT * FROM worker.entities
    """
    )
    conn.commit()
    conn.execute("DETACH DATABASE worker")

The whole convertion takes roughly 1.5 days on an 10 year old i7 with 8 cores. There is obviously the tradeoff between using all cores for compression/decompression vs. the timesink of merging the dbs. So I benchmarked this on the same machine by only using one SQLite db. I stopped the single job experiment after 2 days with 40% finished.

Now I have a 400GB SQLite database generated out of a 100GB wikidata-*-all.json.bz2.

To retrieve the data I added a small FastAPI app:

import asyncio
import bz2
import json

from fastapi import FastAPI, HTTPException
from sqlite_utils import Database

app = FastAPI()
db = Database("wikidata-cache.db", check_same_thread=False)

def fetch_entity(entity_id: str) -> dict | None:
    rows = db["entities"].rows_where("entity_id = ?", [entity_id], limit=1)
    row = next(rows, None)
    if row is None:
        return None
    return json.loads(bz2.decompress(row["data"]))

@app.get("/{entity_id}.json")
async def get_entity(entity_id: str):
    entity = await asyncio.to_thread(fetch_entity, entity_id)
    if not entity:
        raise HTTPException(status_code=404, detail="Entity not found")
    return {"entities": {entity_id: entity}}

The format is on purpose the same as for the Wikidata Json Special page, i.e. for Q42. Now I can process a lot of Wikidata entries without hitting the Wikidata servers.

Linux Thinkpad Learnings

A few weeks ago the USB-C power supply of my work notebook died. As a replacement I ordered a UGreen one which can power multiple USB-C devices -- resulting in less power plugs on my desk at home.

After this I looked into not killing the battery of this Thinkpad by configuring the charging. Addionally the tool I installed could manage the temperature by not powering the notebook to the limit.

The Thinkpad I got from my employer is a Lenovo T14 Gen3 Ryzen7. Because I work at home most of the time, the notebook is pluged in nearly all the time. It is not healthy for the battery to always charge a tiny bit until it is full again.

To fix this I install TLP. The default Battery charging thresholds are perfect:

# Battery charge level below which charging will begin.
START_CHARGE_THRESH_BAT0=75
# Battery charge level above which charging will stop.
STOP_CHARGE_THRESH_BAT0=80

These values are the defaults, but I still removed the comments for me to be aware that I want them like that.

Additionally I unplugged the notebook a few times and recharged it, to give the battery a bit of "normal" charging. I am aware that mistreated batteries need to be observed if they inflate. Mine seems to be okay.

So as I already said, the Thinkpad is most of the time on AC. And when using a lot of CPU the notebook gets warm and loud. I was not aware that there are multiple power profiles to manage this. Then plugged into AC the performance profile is active, which results in more heat and fan spinning when all cores are busy. But I actually don't need the performance profile of the notebook running on maximum when connected to AC. So I changed the platform profile for AC from performance to balanced with this line in the config:

PLATFORM_PROFILE_ON_AC=balanced

I still could change it to performace again when needed. Or even to low-power when I don't want to have the fan spinning.

This resolved the two ThinkPad issues I wasn't aware I needed to fix. 🎉

Deutschlandticket WorthIt Analysis

Some years ago I aggregated my travel costs in a post to show the savings I had because of the 9€-Ticket. Since then I tracked all my Deutschlandticket saving using a hledger virtual entries.

Tracking in hledger

The "WorthIt Sum" in table below is based on a hledger virtual entries. For example one trip, without Deutschlandticket:

2025-12-30 VVS Ticket (Stuttgart -> Schorndorf)
    Expenses:ÖPNV:VVS                          €6.80
    Assets:Girokonto
    (worthIt:Deutschlandticket)                €6.80

And one trip where I only track the usage, because it is paid via Deutschlandticket:

2025-10-10 Regional train (Göppingen -> Stuttgart)
    (worthIt:Deutschlandticket)                €-8.3

The value tracked in the virtual entry is negative, because it is the amount saved.

The Deutschlandticket itself in my ledger looks like this:

2025-10-01 Deutschlandticket
    Expenses:ÖPNV:Deutschlandticket              €58
    Assets:Girokonto
    (worthIt:Deutschlandticket)                  €58

To get the worthIt data from the ledger files into Python I use subprocess and call this one:

hledger register ^worthIt:Deutschlandticket -O csv

The first lines of my data:

"txnidx","date","code","description","account","amount","total"
"4928","2023-05-01","","Deutschlandticket bei der SSB","(worthIt:Deutschlandticket)","€49.00","€49.00"
"4970","2023-05-04","","Bahn: Stgt -> Reutlingen","(worthIt:Deutschlandticket)","€-11.00","€38.00"
"4970","2023-05-04","","Bahn: Tübingen -> Herrenberg","(worthIt:Deutschlandticket)","€-6.00","€32.00"
...
"5011","2023-05-30","","VVS: Office day","(worthIt:Deutschlandticket)","€-2.75","€-52.55"

The total column from the last entry of a month is the value you see in the table below (€-52.55 for 2023-05). So the virtual entry sums per month are calculated by hledger.

Data analysis

The Python script is analysing the csv and returns a table with additional columns. Actually a second hledger csv export is needed: the one returning all Expenses:ÖPNV:Deutschlandticket to know if I booked a Deutschlandticket that month.

For the months where I didn't buy a Deutschlandticket, the script still needs to know what it would have costed. The ticket price increased in the last years, so I have this in my Python code to track the price that month:

def get_ticket_price_for_month(year_month):
    if year_month >= "2026-01":
        return Decimal("63")
    elif year_month >= "2025-01":
        return Decimal("58")
    return Decimal("49")

With all this data a table is generated. The "WorthIt Sum" is exactly the virtual value at the end of the month, as described above. The "Actual Cost" is either the Deutschlandticket, or in months without one, the virtual entry. This virtual entries are always postive, because no Deutschlandticket to save with. And finally a "Not WorthIt" column. For this the actual costs need to be over the costs of the Deutschlandticket (I never had such a month, yet). Or the "WorthIt Sum" is positive, so not enough was saved to reach the costs of the Deutschlandticket. I had this 4 times in the past as you can see in the table below. As a result of the months where I didn't save enough with the Deutschlandticket I started to not have a ticket in winter months.

Year-Month

Ticket Bought

WorthIt Sum

Actual Cost

Not WorthIt

2025-12

No

€42.67

€42.67

2025-11

No

€11.70

€11.70

2025-10

Yes

€-5.35

€58.00

2025-09

Yes

€-141.15

€58.00

2025-08

Yes

€-219.87

€58.00

2025-07

Yes

€-55.03

€58.00

2025-06

Yes

€-234.21

€58.00

2025-05

Yes

€-184.15

€58.00

2025-04

Yes

€-21.47

€58.00

2025-03

Yes

€20.01

€58.00

x

2025-02

No

€45.56

€45.56

2025-01

No

€13.22

€13.22

2024-12

Yes

€-3.63

€49.00

2024-11

Yes

€-80.49

€49.00

2024-10

Yes

€-184.65

€49.00

2024-09

Yes

€-76.91

€49.00

2024-08

Yes

€-31.20

€49.00

2024-07

Yes

€-135.47

€49.00

2024-06

Yes

€-135.79

€49.00

2024-05

Yes

€-109.42

€49.00

2024-04

Yes

€-143.54

€49.00

2024-03

Yes

€-16.84

€49.00

2024-02

Yes

€39.95

€49.00

x

2024-01

Yes

€7.86

€49.00

x

2023-12

Yes

€-51.02

€49.00

2023-11

Yes

€9.65

€49.00

x

2023-10

Yes

€-4.88

€49.00

2023-09

Yes

€-154.49

€49.00

2023-08

Yes

€-158.42

€49.00

2023-07

Yes

€-144.95

€49.00

2023-06

Yes

€-227.04

€49.00

2023-05

Yes

€-52.55

€49.00


I am a bit torn in not having the Deutschlandticket. On the one hand the Deutschlandticket is not worth it for me in the winter. On the other hand I am limiting myself using public transport, because I have to buy a single trip ticket or daily ticket.