Ontolog Forum
OntologySummit2014_Hackathon - Project
Optimized SPARQL performance management via native API
Project roster page: http://ontolog.cim3.net/cgi-bin/wiki.pl?OntologySummit2014_Hackathon_OptimizedSPARQLviaNativeAPI (this page).
Team: Victor Chernov (MSK, UTC+4) vchernov at nitrosbase.com (lead), Vladislav Golovkov (MSK, UTC+4) vgolovkov at nitrosbase.com
Post-event updates
Optimized SPARQL performance management via native API Hackathon took place on March 29, 2014 14:00 - 18:00 MSK (virtual session) with subsequent activities during the next day.
Four people participated in the event:
- Victor Chernov (team lead), General Manager at NitrosData Rus, Russia, vchernov@nitrosbase.com;
- Vladislav Golovkov System Architect at NitrosData Rus, Russia, vgolovkov@nitrosbase.com;
- Andrej Andrejev, Ph.D. student at Uppsala University, Sweden, andrej.andrejev@it.uu.se;
- Vladimir Salnikov, Head of QA Department at Compile Group, Russia, vladimir.salnikov@compilesoft.ru.
During the event triplestores have been installed, benchmark queries prepared, experiments have been set up and run. The results are discussed and published (see link below).
The main conclusions are:
- Performance is bottleneck for ontological technologies.
- It is desirable to have direct access to database, not through TCP protocol.
- Sometimes it is worth to simplify the queries as much as possible and make some processing on the client.
- Very often what is difficult to do with a single large query is easy to implement with a set of small ones. In those cases triplestore should be able to perform small queries quickly. Further performance gain could be reached giving the users direct access to database, bypassing SPARQL processing.
- The myth that RDF database is slower than SQL does not work anymore. RDF storages perform fast and can compete with SQL databases.
The report can be downloaded from
http://nitrosbase.com/wp-content/uploads/OptimizedSPARQLreportV13.zip
Event announce
Participation: You are welcome to participate, please send an E-mail to vchernov at nitrosbase.com.
Event starts 29th of March 2014 14:00 MSK / 10:00 UTC / 03:00 PST all over the world
Communication:
- Google Hangout - We'll connect you according to participant's E-mail;
- Skype - Will be used as additional tool if the number of Google Hangout connections exceeded. Please add vladislav.golovkov to your Skype contact list.
A Caveat Concerning the IPR Policy Conformance:
- to allow us to gain insights into relevant "open" as well as "non-open" products and technologies, this session will be featured under a special waiver to the prevailing IPR Policy. We may be working on, and discussing about, some "non-open" technologies. Vendors of such are welcome to participate (on the condition that proprietary portions of their presentation are to be specifically stated as such). However, (despite the waiver) do note that we will (as usual) be making available the results of the session to the community and the public at large under our prevailing Open IPR Policy.
The Goals of the project are
Studying the kinds of queries revealing the advantages of one or another RDF database. The goals imply:
- Selection of a SPARQL subset from SP2Bench
- Forming a dataset and loading it to all triple-stores.
- Implementing measurement aids, testing
- Accurate time measurement, getting min, max, average and median times.
- Reflection on the results, advantages and disadvantages of the triplestores on each selected query.
The following triplestores will be compared:
The triplestores have the following important advantages:
- Very high performance on demonstrated on sp2bench benchmark
- Linux and Windows versions
- Native API for fast query processing
It is important to use native API for fast query execution. All 3 tools provide native API:
- Virtuoso
- Jena, Sesame and Virtuoso ODBC RDF Extensions for SPASQL
- Stardog
- the core SNARL (Stardog Native API for the RDF Language) classes and interfaces
- NitrosBase
- C++ and .NET native API
We suppose writing additional codes needed for accurate testing:
- Accurate time measurement;
- Functions for getting min, max, average and median times;
- Functions for getting time of scanning through the whole query result;
- Functions for getting time of retrieving first several records (for example, the first page of web grid);
- Etc.
The following steps are needed for loading test dataset:
- Selecting a data subset from sp2bench benchmark
- Measuring data loading time
Note: Data are considered as loaded as soon as the system is ready to perform a simplest search query. This is done to eliminate background processes (eg. indexing).
We are going to explore the query execution performance by the databases under consideration (Virtuoso, Stardog, NitrosBase).
The queries should be fairly simple and cover the different techniques, for example:
- search the small range of values
- search the big range of values
- Sorting
- Aggregation
- Several different join queries
- Retrieving part of result
- Retrieving whole result
- etc.
Note: During testing each database may allocate a lot of resources, that can affect the performance of other databases. That's why each test should be stared from system reboot.