Anda di halaman 1dari 10

HACKING DBPEDIA WITH SPARQL

BASICS

General form of an SPARQL query is:


PREFIX ...
...
SELECT ...
FROM ...
WHERE {
triples - filters }
GROUP BY ...
ORDER BY ..
HAVING ..
LIMIT ...
An rdf database consist on a set of triples having connections between them.
A triple resource has tree specific element, in this specific order:
SUBJECT -> PROPERTY -> OBJECT
We can say, that a triple is a subject having the property of object.
Examples:
CAT IS BLACK, TOM (IS AN) ACTOR, PETER (IS A) NAME.
An rdf URL is a resource where we can access each properties of a subject.
In dbpedia each subject and property is an actual url. Objects could be urls or
strings.
Example url of a subject:
http://dbpedia.org/resource/JavaScript
Accessing this url on an html page we can see the Properties and Values of this
subject.

Counting each triple at a specific rdf graph database can be done by:

SELECT COUNT(*) {
?subject ?property ?object.
}

WHERE ?subject, ?propery and ?object they variables selecting any possible
match in an rdf graph.
So there are four possible selection to an rdf graph.
1. With three variables: ?subject ?property ?object (and filtering the result
by our need)
2. With two variables
3. With one variables
4. And when we know each value. (In this case it is possible we just want to
know if they exists in the graph)

DEFINING AND USING PREFIXES

For more convenient use In SPARQL it is possible to create an alias for each url.
This alias is called prefix.
Dbpedia has a lot of predefined prefixes, listed here:
http://dbpedia.org/sparql?nsdecl
In SPARQL a prefix is created with the following syntax:
PREFIX alias:<URL>
So example the previous JavaScript subject url can have a prefix with the
following definition:
PREFIX javascript:<http://dbpedia.org/resource/JavaScript>
QUERING the JavaScript resource normally is done by this query:
SELECT ?js_label WHERE {
<http://dbpedia.org/resource/JavaScript> rdfs:label ?js_label.
}

Using a prefix, it's possible to use like this:


PREFIX javascript: <http://dbpedia.org/resource/JavaScript>
SELECT ?js_label WHERE {
javascript: rdfs:label ?js_label.
}

This query will return the label on all possible languages. If we want to see only
the english version we can add a filter for this:
PREFIX javascript: <http://dbpedia.org/resource/JavaScript>
SELECT ?js_label WHERE {
javascript: rdfs:label ?js_label.
FILTER (LANG(?js_label)='en')
}

The result now is only the label for the english language. We see that there is a
concatenated @en telling the language. If we don't want to display that, we
can use the STR function to cast the result into a string.
PREFIX javascript: <http://dbpedia.org/resource/JavaScript>
SELECT str(?js_label) AS ?js_label WHERE {
javascript: rdfs:label ?js_label.
FILTER (LANG(?js_label)='en')
}

I used the AS operator to add a custom name for the row in the result.

QUERIES FOR EXACT TERMS

Finding subjects by label for a specific language


finding types and type labels for subjects
casting labels into strings
searching case insensitive labels
exclude some dbpedia paths like yago

Most of the time, we don't know the subject, but we know the label name. In
this case, the subject can be searched using the query:
SELECT ?subject WHERE {
?subject rdfs:label "JavaScript"@en.
}

This will find each subject, having the label JavaScript in english. Note, that we
should add the language selector after our string, otherwise we wont have any
result.
Most of the subjects have one or more TYPE property. These types can be
accessed with multiline queries for a given label.
Multiple triple selections are separated by the . (dot)
Listing all the types for the JavaScript subject could be done by:
SELECT ?subject ?types WHERE {
?subject rdfs:label "JavaScript"@en.
?subject rdf:type ?types.
}

Having the same subject used on two ore more consecutive triple selection,
could be written with this short form:
SELECT ?subject ?types WHERE {
?subject rdfs:label "JavaScript"@en;
rdf:type ?types.
}

Note that the first selection has a ; (semicolon) at the and, the next line has
only two resource. The subject will be the same like at the first line.
This selection will show the links to each type. If we want to see the LABEL for
these types we must add two other line for this query:

SELECT ?subject ?types STR(?type_label) WHERE {


?subject rdfs:label "JavaScript"@en;
rdf:type ?types.
?types rdfs:label ?type_label.
FILTER (LANG(?type_label)='en')

Most of the time we don't know the exact case of the term in dbpedia we want
to search. We could transform into lowercase, and doing filter by that to find
the subjects.
SELECT ?subject {
?subject rdfs:label ?subject_label.
FILTER(LANG(?subject_label)='en')
filter(lcase(str(?subject_label)) = 'javascript')
}

Sometimes we select some terms but we want to exclude some specific url-s
form the result. This selection will show the types of the word Barcode. The
result has a lot of YAGO class domain.
SELECT DISTINCT ?label {
<http://dbpedia.org/resource/Barcode> rdf:type ?type.
<http://dbpedia.org/resource/Barcode> rdfs:label ?label.
FILTER (LANG(?label)='en')
}

To Exclude these Yago domains, a filter should be added.


SELECT DISTINCT ?label {
<http://dbpedia.org/resource/Barcode> rdf:type ?type.
<http://dbpedia.org/resource/Barcode> rdfs:label ?label.
FILTER (LANG(?label)='en')
FILTER (!STRSTARTS(STR(?type), "http://dbpedia.org/class/yago/") )
}

FINDING TERMS IN CATEGORIES


use of categories
use of recursive search on categories
Dbpedia has a lot of categories. And sometimes we want to get subjects for a
specific category.
Let's say we want to show each subject in the Occupations category.
We can achieve that by:
SELECT ?subject {
?subject dcterms:subject category:Occupations.
}

and showing the labels instead of urls:


SELECT str(?subject_label) {
?subject dcterms:subject category:Occupations;
rdfs:label ?subject_label.
}

We can observe, that we obtained a very small list of occupations doing this
selection. This is because dbpedia have a lot of subcategories for each
occupation, and not each is related directly to the category:Occupations, but
most of them having a path through other nodes to this.
If we want to have more occupation listed, we can specify a recursive selection
of the subcategories:
SELECT str(?subject_label) str(?category_label) {
?subject dcterms:subject ?category;
rdfs:label ?subject_label.
?category skos:broader{,1} category:Occupations;
rdfs:label ?category_label.
FILTER(LANG(?category_label)='en').
}

This will list each terms related to category:Occupations directly or by one


depth level specified by skos:broader{,1}.
Infinite depth could be selected by skos:broader+ but most probably will be
very slow.

SEARCHING TERMS IN ONTOLOGIES

limiting search results


searching in ontologies
using regexp filter on results
using fulltext search

What about listing the first 1000 university from dbpedia?


PREFIX univ_ontology: <http://dbpedia.org/ontology/University>
SELECT str(?university_label) {
?univ_subject rdf:type univ_ontology:;
rdfs:label ?university_label.
FILTER(LANG(?university_label)='en')
} Limit 1000

Most of the time, we know only part of the label, and we want to check if that is
a term related to somewhere.
For instance, let's say we have the term Petru Maior and we want to check if
this is part of the name of a university.
For this, one solution is to us a regexp filter on the result with this string.
PREFIX univ_ontology: <http://dbpedia.org/ontology/University>
SELECT str(?university_label) {
?univ_subject rdf:type univ_ontology:;
rdfs:label ?university_label.
FILTER(LANG(?university_label)='en')
FILTER(REGEX(?university_label, "Petru Maior", "i"))
}

Notice the i parameter for regex, that means case insensitive.


The result is Petru Maior University of Trgu Mure which is good, but notice
that the search was very slow. For better performance we can use a fulltext
search.
PREFIX univ_ontology: <http://dbpedia.org/ontology/University>
SELECT str(?university_label) {
?univ_subject rdf:type univ_ontology:;
rdfs:label ?university_label.
?university_label bif:contains "'Petru Maior'".
FILTER(LANG(?university_label)='en')
}

Instead of regexp the result, we used ?university_label bif:contains "'Petru


Maior'". Which will do a fulltext search on each university label. Notice that we
had a dramatic speed increase with the same result.

DBPEDIA DISAMBIGUATES
getting the types of subjects (kind of tagging)
working with disambiguated terms.
Using UNION
Sometimes a subject can have more meaning that what we have as a first
result from Dbpedia.
For example lets query the type of the label Apache.
SELECT STR(?type_label) WHERE {
?subject rdfs:label "Apache"@en;
rdf:type ?types.
?types rdfs:label ?type_label.
FILTER (LANG(?type_label)='en')
}

As a result, we got concept, and enthnic group. But we know that Apache could
be also an http server. How we can include in the result also that?
This is where we can use the dbpedia disambiguates feature.
For this we need to find the disambiguation url for this term. This is done by:
SELECT DISTINCT ?subject_disambiguation_url{
?subject rdfs:label "Apache"@en;
rdf:type ?subject_type.
?subject_disambiguation_url dbpedia-owl:wikiPageDisambiguates ?subject.
}

Now that we have this url, we can use to select the subjects having the same
disambiguation url.
SELECT DISTINCT ?disamb_subjects{
?subject rdfs:label "Apache"@en;
rdf:type ?subject_type.
?subject_disambiguation_url dbpedia-owl:wikiPageDisambiguates ?subject.
?subject_disambiguation_url dbpedia-owl:wikiPageDisambiguates ?
disamb_subjects.
}

WE see that there is a couple, and the one we needed is also there:

http://dbpedia.org/resource/Apache_HTTP_Server
Now we can list each type for each of these disambiguated subjects related to
Apache.
SELECT DISTINCT ?disamb_subjects_types{
?subject rdfs:label "Apache"@en;
rdf:type ?subject_type.
?subject_disambiguation_url dbpedia-owl:wikiPageDisambiguates ?subject.
?subject_disambiguation_url dbpedia-owl:wikiPageDisambiguates ?

disamb_subjects.
?disamb_subjects rdf:type ?disamb_subjects_types.
}

And of course we can show only the labels for these, and in english.
SELECT DISTINCT str(?disamb_subjects_labels){
?subject rdfs:label "Apache"@en;
rdf:type ?subject_type.
?subject_disambiguation_url dbpedia-owl:wikiPageDisambiguates ?subject.
?subject_disambiguation_url dbpedia-owl:wikiPageDisambiguates ?
disamb_subjects.
?disamb_subjects rdf:type ?disamb_subjects_types.
?disamb_subjects_types rdfs:label ?disamb_subjects_labels.
FILTER(LANG(?disamb_subjects_labels)='en')
}

What about if we have multiple terms, and we want to show disambiguates for
each of them in one query? For this we can use UNION in this way:
SELECT DISTINCT ?subject str(?disamb_subjects_labels){
{?subject rdfs:label "Apache"@en.} UNION {?subject rdfs:label "Java"@en.}
?subject rdf:type ?subject_type.
?subject_disambiguation_url dbpedia-owl:wikiPageDisambiguates ?subject.
?subject_disambiguation_url dbpedia-owl:wikiPageDisambiguates ?
disamb_subjects.
?disamb_subjects rdf:type ?disamb_subjects_types.
?disamb_subjects_types rdfs:label ?disamb_subjects_labels.
FILTER(LANG(?disamb_subjects_labels)='en')
}

We used curly braces to group and unify the two terms.


{?subject rdfs:label "Apache"@en.} UNION {?subject rdfs:label "Java"@en.}

The result will show all the disambiguates for Apache and than Java.
Of course, the subject urls could be transformed into labels adding another
line ?subject rdfs:label ?subject_label and selecting this instead of subject.

MULTIPLE PURPOSE CHECKING


how to check to which specific domains a term is related to
This example shows, how we can use SELECT and UNION to retrieve if a term is
related to a PROFESSION, CITY, TOWN, COUNTRY, EDUCATIONAL INSTITUSION.
select
(count(?is_profession) as ?is_profession)
(count(?is_city) as ?is_city)
(count(?is_town) as ?is_town)
(count(?is_country) as ?is_country)
(count( ?is_edu_inst) as ?is_edu_inst)
where{
{
?is_profession rdfs:label "Sovata"@en.
FILTER(EXISTS { ?is_profession dbpprop:type dbpedia:Profession })
}
UNION
{
?is_city rdfs:label "Sovata"@en.
FILTER(EXISTS { ?is_city rdf:type dbpedia-owl:City })
}
UNION
{
?is_country rdfs:label "Sovata"@en.
FILTER(EXISTS { ?is_country rdf:type dbpedia-owl:Country })
}
UNION {
?is_town rdfs:label "Sovata"@en.
FILTER(EXISTS { ?is_town rdf:type dbpedia-owl:Town })
}
UNION {
?is_edu_inst rdfs:label ?lab.
filter(lang(?lab)='en')
filter(regex(?lab, "Petru Maior", 'i'))
FILTER(EXISTS { ?is_edu_inst rdf:type dbpedia-owl:EducationalInstitution})
}
}

Anda mungkin juga menyukai