Passage's Homepage
|
¡Live Wikidata: Correct and Complete Results for Everyone!
Wikidata is a notorious knowledge graph available for everyone. It stores tens of billions of statements, directly accessible using SPARQL queries through the web. However, to enforce its fair use policies, Wikidata sets a 60-seconds timeout on query execution. Once this threshold is reached, the query execution is aborted, providing no results to its end-user.
To alleviate this issue, end-users must either manipulate the query to make it more efficient; or download the dump themselves to run it without limitations.
We argue that end-users can get complete and correct results if SPARQL query engines enable continuation queries, i.e., queries that allow retrieving the missing results when the computation is stopped before completion.
Downloading
The first step to create a live Wikidata SPARQL endpoint is to
retrieve the data. Since Passage supports Blazegraph, which is
the engine and data structure that Wikidata relies on, retrieving the
.jnl
file would suffice. However, these files are huge and not
publicly available, so we download their dump instead:
We use axel
to speed up the download, and provide error management.
axel -a -n 10 --verbose \
http://dumps.wikimedia.your.org/wikidatawiki/entities/latest-all.ttl.gz
Downloaded 109.696 Gigabyte(s) in 1:57:53 hour(s). (16261.77 KB/s)
Ingesting
Once the compressed turtle file containing all Wikidata's data is downloaded, we must ingest it into our database, which is Blazegraph, as we want to remain close to the original endpoints.
Issues
Despite being impressive in many aspects, Blazegraph's official support has been abandonned years ago. Wikimedia had to manage their own version of Blazegraph.
Until the team finds a suitable replacement to Blazegraph, that is.
To remain the closest to Wikipedia's database, we use their provided tools to ingest the data. We favor an approach based on the CLI. Although it cannot benefit from the ingestion of multiple files at a time, it is easily monitorable, and feels more reliable.
Only one process has locked read/write access to the .jnl
file.
However, we had trouble compiling the tools: Lombok was not properly
generating constructors; and the Scala compiler dependency threw
unintelligible errors. So instead, we used munge.sh
provided online;
and restricted our goal to create a self-contained blazegraph.jar
that contains Wikidata's home-made vocabulary.
- We added a shade target so
DataLoader
is part of the exported jar; - We added the dependency to
log4j 1.2.17
which the oldDataLoader
depends on.
Comments
Wikimedia recommends to munge.sh
the compressed turtle file. From the
original file, it creates a large number of compressed turtle files:
The original file still exists. So if you fall short of 100GB, remember to clean it.
./munge.sh -c 50000 \ -f /DATA/datasets/latest-all.ttl.gz \ -d /DATA/datasets/wikidata-munged/
… 04:22:45.108 … - Processed 102100000 entities at (2899, 2652, 2354) 04:22:45.108 … - Switching to /DATA/datasets/wikidata-munged//wikidump-000002043.ttl.gz 04:22:47.697 … - Processed 102110000 entities at (2979, 2672, 2363) 04:22:50.619 … - Processed 102120000 entities at (2979, 2672, 2363) 04:22:54.360 … - Processed 102130000 entities at (3022, 2686, 2369) 04:22:57.630 … - Processed 102140000 entities at (2989, 2685, 2371) 04:23:00.375 … - Processed 102150000 entities at (2989, 2685, 2371) 04:23:00.376 … - Switching to /DATA/datasets/wikidata-munged//wikidump-000002044.ttl.gz 04:23:02.485 … - Processed 102160000 entities at (3036, 2700, 2377)
It creates 2044 files formated as wikidump-000000001.ttl.gz
. To plan
the ingestion of all these files, we checked the idempotency of the
operation, i.e., ingesting a statement multiple times has no more
effect that ingesting it once. Thus, the ingestion can be paused and
resumed easily, if needed.
Timelapse
Finally, we can start the sequential ingestion of the 100GB of
downloaded data. The following scripts run the ingester for each file
in the targeted $SOURCE_FOLDER
. FROM_FILE
and TO_FILE
allow
resuming the ingestion if needed. RWStore.properties
is the property
file defining Blazegraph's engine and storage. To keep the log of the
operation, we redirect every output towards ingestion_log.dat
. We
allocate -Xmx32G
of memory to the JVM, although it does not seem to
require more than 10GB.
SOURCE_FOLDER="/DATA/datasets/wikidata-munged" FROM_FILE="/DATA/datasets/wikidata-munged/wikidump-000000257.ttl.gz" # included TO_FILE="/DATA/datasets/wikidata-munged/wikidump-000003000.ttl.gz" # excluded for file in "$SOURCE_FOLDER"/*.ttl.gz; do if ([[ "$file" > "$FROM_FILE" ]] || [[ "$file" == "$FROM_FILE" ]]) && [[ "$file" < "$TO_FILE" ]]; then java -Xmx32g -cp blazegraph-0.3.154-shaded.jar com.bigdata.rdf.store.DataLoader RWStore.properties "$file" &>> ingestion_log.dat fi done
… Will load from: /DATA/datasets/wikidata-munged/wikidump-000000589.ttl.gz Journal file: wikidata.jnl loading: 15264095 stmts added in 1745.302 secs, rate= 8745, commitLatency=0ms, {failSet=0,goodSet=0} Load: 15264095 stmts added in 1745.302 secs, rate= 8745, commitLatency=0ms, {failSet=0,goodSet=1} Total elapsed=1800641ms …
Using the output log, we plot the number of statements and ingestion
time of each compressed turtle file.
The top figure shows that the number of statements widely differ from one file to another, but overall remains constant during the whole process. Put in relation with the bottom figure, a large number of statements often means a long ingestion (as expected).
The bottom figure about ingestion times is more informative: ingestion times are increasing over time. We suspect that it comes from the underlying balanced tree data structure used by Blazegraph for its indexes. Not only the depth of the tree increases, but dichotomic searches to find the insertion location take longer and longer.
Unfortunately, even for us, this is too slow. Resources are largely underexploited: CPU usage remains very low, RAM usage is below 10GB. To avoid opening the journal that now takes 150s, and to hopefully get better multithreading capabilities, we load folders of 100 compressed turtle files.
Running
Once the Blazegraph journal exists, we must start the Passage service that will accept SPARQL queries, execute them on the journal, time out when their execution reaches the 60-seconds threshold, but send back a SPARQL continuation query along with its partial results.
The architecture is basic. End-users connect through a glicid url; where a nginx redirects the request to our virtual machine; where an ha-proxy redirects the request to our SPARQL endpoint in charge of the database:
┌ @https://10-54-2-226.gcp.glicid.fr/ user 1 <-----> │ ┌ @http://localhost:8080/ user 2 <-----> │ ha-proxy <-----> │ passage-server <-> wikidata.jnl user 3 <-----> │
java -jar passage-server.jar \ -d /DATA/datasets/watdiv10m-blaze/watdiv10M.jnl \ --port 8080 \ --timeout=60000
curl -v -X GET --http1.1 -G \ --data-urlencode "query=SELECT * WHERE {?s ?p ?o} LIMIT 10" \ "https://10-54-2-226.gcp.glicid.fr/watdiv10M.jnl/passage"
TODO Deploy ha-proxy in front of Passage.
TODO Automatic restart when the service is down.
Updating
Wikidata's data is constantly evolving. Anyone can add new statements to the knowledge graph. Therefore, ingesting a dump is not enough for a legit Wikidata mirror. It must be updated regularly to follow the changes made by the community.
Fortunately, Wikidata provides deployers with tools to update their
database, granted it is a SPARQL endpoint that accepts UPDATE
. Our
mirror must be able accept the update from, and only from, a Wikidata
source.
TODO Make the endpoint accept updates.
References
- https://addshore.com/2019/10/your-own-wikidata-query-service-with-no-limits/
- https://www.mediawiki.org/wiki/Wikidata_Query_Service/Implementation/Standalone
- https://github.com/wmde/wikibase-release-pipeline
- https://github.com/wikimedia/wikidata-query-rdf
- https://wikidataworkshop.github.io/
- https://wiki.bitplan.com/index.php/Get_your_own_copy_of_WikiData
- Antoine Willerval, Dennis Diefenbach, and Pierre Maret. Easily setting up a local Wikidata SPARQL endpoint using the qEndpoint. 2022.
- Antoine Willerval, Dennis Diefenbach, and Angela Bonifati. qEndpoint: A Wikidata SPARQL endpoint on commodity hardware. 2023.
- Wolfgang Fahl, Tim Holzheim, Andrea Westerinen, Christoph Lange, and Stefan Decker. Getting and hosting your own copy of Wikidata. 2022.
Fahl et al's paper constitutes our best entrypoint in terms of state-of-the-art.
While creating our first journal file is time consumming, later replication will be easier.