squall2sparql: a Translator from Controlled
Campus de Beaulieu, 35042 Rennes cedex, France
Abstract. This paper reports on the participation of the system
squall2sparql in the QALD-3 question answering challenge for DBpe-dia. squall2sparql is a translator from SQUALL, a controlled naturallanguage for English, to SPARQL 1.1, a standard expressive query andupdate language for linked open data. It covers nearly all features ofSPARQL 1.1, and is directly applicable to any SPARQL endpoint.
An open challenge of the Semantic Web [7] is semantic search, i.e., the abilityfor users to browse and search semantic data according to their needs. Seman-tic search systems can be classified according to their usability, the expressivepower they offer, their compliance to Semantic Web standards, and their scala-bility. The most expressive approach by far is to use SPARQL [11], the standardRDF query language. SPARQL 1.11 features graph patterns, filters, unions, dif-ferences, optionals, aggregations, expressions, subqueries, ordering, etc. However,SPARQL is also the least usable approach, as it is defined at a low level in termsof logic (e.g., variables, Boolean conditions) and relational algebra (e.g., UNION,GROUP BY). A more usable approach is question answering, where users canexpress their questions in natural language. Natural language is used in searchengines in various forms, going from full natural language (e.g., FREyA [2],Aqualog [9]) to mere keywords (e.g., NLP-Reduce [8]) through controlled nat-ural languages (e.g., Ginseng [1]). Questions in natural language are generallytranslated to SPARQL queries, but in general, only a small fragment of SPARQLis used. This means that even if full natural language is allowed, expressivenessis in fact strongly limited.
The objective of QALD challenges (Question Answering over Linked Data) is
to challenge, evaluate, and compare Semantic Web systems on the task of ques-tion answering on large and real linked datasets: DBpedia and MusicBrainz. Thispaper reports on the participation of squall2sparql in the question answeringtask (in English) for DBpedia in QALD-3 open challenge.
SQUALL (Semantic Query and Update High-Level Language) is a Controlled
Natural Language (CNL) for English that has full compliance with Linked Open
Data (LOD), and covers nearly all features of SPARQL 1.1, for both queries andupdates. The advantage of CNLs is to provide a natural language syntax while re-taining the precision and lack of ambiguity of formal languages like SPARQL [6]. The main drawback of CNLs is that users have to learn the language and itsdisambiguation rules. The system
supports the translation from SQUALL to SPARQL, as well as the direct query-ing of SPARQL endpoints, like DBpedia.
The objective of our participation in the QALD-3 question answering task
was to evaluate the capability of SQUALL to express English questions in a nat-ural and precise way, and the capability of squall2sparql to correctly translateSQUALL questions into SPARQL. Therefore, the measures of precision and re-call are not enough to evaluate our approach, and the naturalness of SQUALLquestions has also to be assessed.
Our system squall2sparql, as its name suggests, is a translator from SQUALLto SPARQL. Given a SQUALL sentence, the system first translates it into anintermediate logical representation using a Montague grammar [3]. The inter-mediate representation is then translated into SPARQL, simply mapping logicalconstructs to combinations of SPARQL constructs. The produced query can thenbe sent to any SPARQL endpoint, and results returned. We have shown thatSQUALL covers most features of SPARQL 1.1, including aggregations, expres-sions, updates, and named graphs. The only missing features are graph-levelupdates (e.g., LOAD), federated queries (i.e., SERVICE), and transitive closures ofcomplex property paths (e.g., (author/ˆauthor)+).
We now briefly describe SQUALL as a controlled natural language. Content
words are nouns, verbs, and proper nouns. Nouns (e.g., Person) and intransitiveverbs are interpreted as class URIs or built-in unary predicates. Relation nouns(e.g., child, birthPlace) and transitive verbs (e.g., matches) are interpreted asproperty URIs or built-in binary predicates. Proper nouns (e.g., res:Germany)are interpreted as entity URIs. Grammatical words are determiners (e.g., a. the,every, at least 10), auxilliary verbs (e.g., is, has), predefined verbs (e.g., shares,relates), imperative verbs (e.g., give me, return), comparative and superlative ad-jectives (e.g., higher, later, most), aggregation nouns and adjectives (e.g., number,average), interrogative determiners and pronouns (e.g., what, which, how many),coordinations (e.g., and, or, not), and others. Boolean coordinations can be ap-plied to most types of phrases: noun phrases, verb phrases, relative propositions,and sentences. Every proposition has a subject, a verb, and also an object ifthe verb is transitive. A sentence can be an open question (e.g., starting withWhich or What), a closed question (e.g., starting with Whether or using auxiliaryverbs and inversion), an imperative-style question (e.g., starting with Give meor Return), or an assertion (for updates).
2 Web forms, examples, and source code can be found from the SQUALL homepage:
http://www.irisa.fr/LIS/softwares/squall.
Complete examples of SQUALL questions are given in the following sections.
More details and examples about the SQUALL language and its translation toSPARQL can be found in previous papers [4,5].
The use of squall2sparql in QALD-3 assumes that English questions are re-formulated in SQUALL, i.e. Controlled English. Its syntax is regular and suf-ficiently similar to English so that it can be learned without too much effort. Many examples are available on the SQUALL’s Web page. Its vocabulary (i.e.,nouns and verbs) is made of URIs because there is so far no lexical treatment in
squall2sparql. This has the obvious drawback that SQUALL queries look lessnatural, and that URIs have to be known or discovered manually. However, theadvantage is that squall2sparql is directly applicable to any LOD dataset,because no linguistic resource is required. If such linguistic resource is available,like those produced by the lexicon task of the QALD-3 challenge, it could becombined in SQUALL, using words instead of URIs.
From the training phase, we already learned some of the DBpedia vocabu-
lary, and other URIs were found manually with Google searches and DBpediabrowsing. We spent on average a few minutes per question for the reformula-tion phase. The automatic translation to SPARQL takes much less time thanSPARQL query evaluation, and is therefore not an issue.
For illustration purposes, we list below a few original questions along with
their SQUALL reformulation. The full list of SQUALL questions can be foundin the official results of the QALD-3 open challenge.
1 Which German cities have more than 250000 inhabitants?
Which Town that has country res:Germany has a populationTotal greaterthan 250000?
2 Who was the successor of John F. Kennedy?
Who is the successor of res:John F. Kennedy?
4 How many students does the Free University in Amsterdam have?
What is the numberOfStudents of res:Vrije Universiteit?
5 What is the second highest mountain on Earth?
Which Mountain has the 2nd highest elevation?
What is the dbp:admittancedate of res:Alberta?
9 Give me a list of all trumpet players that were bandleaders.
Give me all Person-s whose instrument is res:Trumpet and whose occupationis res:Bandleader.
12 Give me all world heritage sites designated within the past five years.
Give me all WorldHeritageSite whose dbp:year is between 2008 and 2013.
23 Do Prince Harry and Prince William have the same mother?
Has ’Prince Harry’ the same dbp:mother as ’Prince William’ ?
26 How many official languages are spoken on the Seychelles?
How many officialLanguage-s of res:Seychelles are there?
28 Give me all movies directed by Francis Ford Coppola.
Give me all Film-s whose director is res:Francis Ford Coppola.
How many spouse-s of res:Nicole Kidman are there?
What is the deathDate of res:Michael Jackson?
Out of the 99 questions, we got the right answers for 80 questions (including thethree OUT OF SCOPE questions), and partial answers for 13. Recall is 0.88,precision is 0.93, and the F-measure is 0.90. Errors come:
– from heterogeneity in data (12 errors, questions 1, 6, 17, 19, 29, 33, 39, 60,
– from the user reformulation in SQUALL (2 errors, questions 14, 43),– from squall2sparql (2 errors, questions 49, 59),– from the gold standard (2 errors, question 16, 75),– from the endpoint (1 error, question 92).
Looking at heterogeneity errors in detail, it appears that most of them could
be solved simply by: either adding generic super-properties in the DBpedia on-tology, or by expanding common words (e.g., location, date) into UNION graphpatterns. For example, in question 39 “Give me all companies in Munich.”,the implicit relation “has location” can be translated in any of the three RDFproperties: dbo:location, dbo:headquarter, dbo:locationCity. This explainswhy our reformulation in SQUALL “Give me all Company-es whose location isres:Munich.” has recall 0.6 only (the default prefix was used for DBpedia ontol-ogy, so that location stands for dbo:location). If location, or another prop-erty, was defined as a super-property of the other properties, the same SQUALLquestion would have recall 1. Alternatively, assuming linguistic knowledge, theword “location” could be mapped to the graph pattern
UNION { ?x dbo:headquarter ?y }UNION { ?x dbo:locationCity ?y }
where ?x and ?y respectively stand for the subject and object of the relation. Such graph patterns could easily be exploited in the translation from the inter-mediate representation to SPARQL without the need to change the SQUALLlanguage and its parsing.
Another problem related to heterogeneity is that some expected domain and
range axioms are not verified in some cases. For example, in question 19 “Give meall people that were born in Vienna and died in Berlin.”, 2 out of the 6 expectedanswers are not instances of the class Person. This is why our reformulation“Give me all Person-s whose birthPlace is res:Vienna and whose deathPlaceis res:Berlin.” missed 2 answers, even though it is arguably equivalent to theoriginal formulation.
The errors coming from the user reformulation of questions are due to mis-
spelling or misunderstanding of URIs. In question 14, “res:Prodigy” was usedinstead of “res:The Prodigy”. In question 43, the property “dbp:breed” was usedin the wrong direction.
The errors coming from squall2sparql are due to an incorrect translation
of the special verb “share”. For example, Question 49 “Which other weapons didthe designer of the Uzi develop?” was reformulated as “Which Weapon sharesthe dbp:designer with res:Uzi?”, which returns “Uzi” itself as an answer. An-other possible reformulation is “Which Weapon has the same dbp:designer asres:Uzi?”, but it exhibits the same error.
The error from the endpoint is because the BIND construct of SPARQL is not
(yet) supported by the QALD-3 endpoint. It is possible to write the SPARQLquery to avoid it, but squall2sparql relies on it to simplify the translationfrom SQUALL. Note that the correct answers are returned when using the officialDBpedia endpoint.
Regarding the naturalness of SQUALL sentences, most of them are not much
longer than the original ones, and can be understood without learning SQUALL. Most differences fall into three categories:
1. reformulating the question to make it agree with SQUALL’s grammar,2. replacing a word by another (e.g., movie → Film),3. making explicit some relations (e.g., “is a chemist” → “has profession
We here discuss a few directions to go in order to improve the usability andperformance of our approach.
Lexicons. The data independence of SQUALL is valuable as it allows to queryall LOD with neither preparation nor linguistic resources. However, when suchlinguistic resources are available [12], it is a shame not to use them, as theycould improve recall, and make SQUALL sentences much more natural at thelexical level. The useful format of lexicons for squall2sparql would be map-pings from words to graph patterns. Nouns and intransitive verbs would bemapped to RDF classes or mono-dimensional graph patterns (one free variable),and relational nouns and transitive verbs would be mapped to RDF propertiesor bi-dimensional graph patterns (two free variables). Such lexicons may be ex-tended to adjectives, adverbs, and prepositions by corresponding extensions of
SQUALL’s syntax. A candidate format of lexicons is lemon (lexicon model forontologies) [10].
Multilinguality. This is an aspect of the QALD-3 challenge that we did notaddress. A priori, it suffices to define a different concrete syntax for each lan-guage, keeping unchanged the intermediate representation. However, for somelanguages, it may be more difficult than for English whose morphology is lesscomplex than many other languages.
Guidance. Writing questions in a controlled natural language is easier than in aformal language like SPARQL. However, it is still error-prone, and may be frus-trating for users. A possible solution that has already been used for CNLs is anauto-completion mechanism that suggests possible completions for the sentence,based on the grammar (e.g., Ginseng [1]). However, this is mostly useful to avoidgrammatical errors, but not so helpful to find the right content words. Anotherparticipant of the challenge, Scalewelis, proposes a content-based guided ap-proach, where users can build in a flexible way complex queries without the needto know the grammar or the content words.
1. Bernstein, A., Kaufmann, E., Kaiser, C.: Querying the semantic web with Ginseng:
A guided input natural language search engine. In: Work. Information Technologyand Systems (WITS) (2005)
2. Damljanovic, D., Agatonovic, M., Cunningham, H.: Identification of the question
focus: Combining syntactic analysis and ontology-based lookup through the userinteraction. In: Language Resources and Evaluation Conference (LREC). ELRA(2010)
3. Dowty, D.R., Wall, R.E., Peters, S.: Introduction to Montague Semantics. D. Reidel
e, S.: SQUALL: a controlled natural language for querying and updating RDF
graphs. In: Kuhn, T., Fuchs, N. (eds.) Controlled Natural Languages. pp. 11–25. LNCS 7427, Springer (2012)
e, S.: SQUALL: a controlled natural language as expressive as SPARQL 1.1.
etais, E. (ed.) Int. Conf. Application of Natural Language to Information
Systems (NLDB). pp. 114–125. LNCS 7934, Springer (2013)
6. Fuchs, N.E., Kaljurand, K., Schneider, G.: Attempto Controlled English meets
the challenges of knowledge representation, reasoning, interoperability and userinterfaces. In: Sutcliffe, G., Goebel, R. (eds.) FLAIRS Conference. pp. 664–669. AAAI Press (2006)
otzsch, M., Rudolph, S.: Foundations of Semantic Web Technologies.
8. Kaufmann, E., Bernstein, A.: Evaluating the usability of natural language query
languages and interfaces to semantic web knowledge bases. J. Web Semantics 8(4),377–393 (2010)
9. Lopez, V., Uren, V., Motta, E., Pasin, M.: Aqualog: An ontology-driven question
answering system for organizational semantic intranets. Journal of Web Semantics5(2), 72–105 (2007)
10. McCrae, J., Spohr, D., Cimiano, P.: Linking lexical resources and ontologies on
the semantic web with lemon. In: Extended Semantic Web Conference (ESWC). pp. 245–259. LNCS 6643, Springer (2011)
erez, J., Arenas, M., Gutierrez, C.: Semantics and complexity of SPARQL. In:
et al, I.F.C. (ed.) Int. Semantic Web Conf. pp. 30–43. LNCS 4273, Springer (2006)
12. Walter, S., Unger, C., Cimiano, P.: A corpus-based approach for the induction of
ontology lexica. In: Int. Conf. Applications of Natural Languages to InformationSystems (NLDB). pp. 102–113. LNCS 7934, Springer (2013)
References State of the Watershed Report Aaston, M. 1986. Magnitude and Frequency of Peak Flows and Flow Volumes in Saskatchewan . Hydrology Division, Prairie Farm Rehabilitation Administration. Acid Rain Task Group. 2007a. The Acid Rain Task Group Long-term Strategic Plan to Implement The Acid Rain Task Group. 2007b. A National Acid Rain Science Plan . 9 pp. Adams, B. W., G.