Passage's Homepage

¡Live Wikidata: Correct and Complete Results for Everyone!

Wikidata is a notorious knowledge graph available for everyone. It stores tens of billions of statements, directly accessible using SPARQL queries through the web. However, to enforce its fair use policies, Wikidata sets a 60-seconds timeout on query execution. Once this threshold is reached, the query execution is aborted, providing no results to its end-user.

To alleviate this issue, end-users must either manipulate the query to make it more efficient; or download the dump themselves to run it without limitations.

We argue that end-users can get complete and correct results if SPARQL query engines enable continuation queries, i.e., queries that allow retrieving the missing results when the computation is stopped before completion.

Downloading

The first step to create a live Wikidata SPARQL endpoint is to retrieve the data. Since Passage supports Blazegraph, which is the engine and data structure that Wikidata relies on, retrieving the .jnl file would suffice. However, these files are huge and not publicly available, so we download their dump instead:

While creating our first journal file is time consumming, later replication will be easier.

We use axel to speed up the download, and provide error management.

axel -a -n 10 --verbose \
     http://dumps.wikimedia.your.org/wikidatawiki/entities/latest-all.ttl.gz
Downloaded 109.696 Gigabyte(s) in 1:57:53 hour(s). (16261.77 KB/s)

Ingesting

Once the compressed turtle file containing all Wikidata's data is downloaded, we must ingest it into our database, which is Blazegraph, as we want to remain close to the original endpoints.

Issues

Despite being impressive in many aspects, Blazegraph's official support has been abandonned years ago. Wikimedia had to manage their own version of Blazegraph.

Until the team finds a suitable replacement to Blazegraph, that is.

To remain the closest to Wikipedia's database, we use their provided tools to ingest the data. We favor an approach based on the CLI. Although it cannot benefit from the ingestion of multiple files at a time, it is easily monitorable, and feels more reliable.

Only one process has locked read/write access to the .jnl file.

However, we had trouble compiling the tools: Lombok was not properly generating constructors; and the Scala compiler dependency threw unintelligible errors. So instead, we used munge.sh provided online; and restricted our goal to create a self-contained blazegraph.jar that contains Wikidata's home-made vocabulary.

  • We added a shade target so DataLoader is part of the exported jar;
  • We added the dependency to log4j 1.2.17 which the old DataLoader depends on.

Comments

Wikimedia recommends to munge.sh the compressed turtle file. From the original file, it creates a large number of compressed turtle files:

The original file still exists. So if you fall short of 100GB, remember to clean it.

./munge.sh -c 50000 \
           -f /DATA/datasets/latest-all.ttl.gz \
           -d /DATA/datasets/wikidata-munged/
…
04:22:45.108 … - Processed 102100000 entities at (2899, 2652, 2354)
04:22:45.108 … - Switching to /DATA/datasets/wikidata-munged//wikidump-000002043.ttl.gz
04:22:47.697 … - Processed 102110000 entities at (2979, 2672, 2363)
04:22:50.619 … - Processed 102120000 entities at (2979, 2672, 2363)
04:22:54.360 … - Processed 102130000 entities at (3022, 2686, 2369)
04:22:57.630 … - Processed 102140000 entities at (2989, 2685, 2371)
04:23:00.375 … - Processed 102150000 entities at (2989, 2685, 2371)
04:23:00.376 … - Switching to /DATA/datasets/wikidata-munged//wikidump-000002044.ttl.gz
04:23:02.485 … - Processed 102160000 entities at (3036, 2700, 2377)

It creates 2044 files formated as wikidump-000000001.ttl.gz. To plan the ingestion of all these files, we checked the idempotency of the operation, i.e., ingesting a statement multiple times has no more effect that ingesting it once. Thus, the ingestion can be paused and resumed easily, if needed.

Timelapse

Finally, we can start the sequential ingestion of the 100GB of downloaded data. The following scripts run the ingester for each file in the targeted $SOURCE_FOLDER. FROM_FILE and TO_FILE allow resuming the ingestion if needed. RWStore.properties is the property file defining Blazegraph's engine and storage. To keep the log of the operation, we redirect every output towards ingestion_log.dat. We allocate -Xmx32G of memory to the JVM, although it does not seem to require more than 10GB.

SOURCE_FOLDER="/DATA/datasets/wikidata-munged"
FROM_FILE="/DATA/datasets/wikidata-munged/wikidump-000000257.ttl.gz" # included
TO_FILE="/DATA/datasets/wikidata-munged/wikidump-000003000.ttl.gz" # excluded

for file in "$SOURCE_FOLDER"/*.ttl.gz; do
    if ([[ "$file" > "$FROM_FILE" ]] || [[ "$file" == "$FROM_FILE" ]]) && [[ "$file" < "$TO_FILE" ]]; then
      java -Xmx32g -cp blazegraph-0.3.154-shaded.jar com.bigdata.rdf.store.DataLoader RWStore.properties "$file" &>> ingestion_log.dat
    fi
done
…
Will load from: /DATA/datasets/wikidata-munged/wikidump-000000589.ttl.gz
Journal file: wikidata.jnl
loading: 15264095 stmts added in 1745.302 secs, rate= 8745, commitLatency=0ms, {failSet=0,goodSet=0}
Load: 15264095 stmts added in 1745.302 secs, rate= 8745, commitLatency=0ms, {failSet=0,goodSet=1}
Total elapsed=1800641ms
…

ingestion.svg Using the output log, we plot the number of statements and ingestion time of each compressed turtle file.

The top figure shows that the number of statements widely differ from one file to another, but overall remains constant during the whole process. Put in relation with the bottom figure, a large number of statements often means a long ingestion (as expected).

The bottom figure about ingestion times is more informative: ingestion times are increasing over time. We suspect that it comes from the underlying balanced tree data structure used by Blazegraph for its indexes. Not only the depth of the tree increases, but dichotomic searches to find the insertion location take longer and longer.

Unfortunately, even for us, this is too slow. Resources are largely underexploited: CPU usage remains very low, RAM usage is below 10GB. To avoid opening the journal that now takes 150s, and to hopefully get better multithreading capabilities, we load folders of 100 compressed turtle files.

Running

Once the Blazegraph journal exists, we must start the Passage service that will accept SPARQL queries, execute them on the journal, time out when their execution reaches the 60-seconds threshold, but send back a SPARQL continuation query along with its partial results.

The architecture is basic. End-users connect through a glicid url; where a nginx redirects the request to our virtual machine; where an ha-proxy redirects the request to our SPARQL endpoint in charge of the database:

               ┌ @https://10-54-2-226.gcp.glicid.fr/
user 1 <-----> │                  ┌ @http://localhost:8080/
user 2 <-----> │ ha-proxy <-----> │ passage-server <-> wikidata.jnl
user 3 <-----> │
java -jar passage-server.jar \
     -d /DATA/datasets/watdiv10m-blaze/watdiv10M.jnl \
     --port 8080 \
     --timeout=60000
curl -v -X GET --http1.1 -G \
     --data-urlencode "query=SELECT * WHERE {?s ?p ?o} LIMIT 10" \
     "https://10-54-2-226.gcp.glicid.fr/watdiv10M.jnl/passage"

TODO Deploy ha-proxy in front of Passage.

TODO Automatic restart when the service is down.

Updating

Wikidata's data is constantly evolving. Anyone can add new statements to the knowledge graph. Therefore, ingesting a dump is not enough for a legit Wikidata mirror. It must be updated regularly to follow the changes made by the community.

Fortunately, Wikidata provides deployers with tools to update their database, granted it is a SPARQL endpoint that accepts UPDATE. Our mirror must be able accept the update from, and only from, a Wikidata source.

TODO Make the endpoint accept updates.

References

Ⓒ 2017–2025 GDD Team, LS2N, University of Nantes