## Umaka Score

August 22, 2017

Abstract

Umaka Score is an index to help assess the datasets and SPARQL endpoints. Umaka Score is calculated on the basis of the evaluation from the 6 aspects; Availability, Freshness, Operation, Usefulness, Validity and Performance. We also rank the endpoints on a scale of A to E according to the Umaka score. In this document, we show how to calculate Umaka score.

## 1. Umaka Score

Umaka Score represents how valuable endpoints are. We believe there are six aspects, Availability, Freshness, Operation, Usefulness, Validity and Performance, for valuable endpoints. We evaluate and score endpoints from these aspects. Then Umaka Score is average of these score:

Umaka Score = $\frac{\displaystyle \sum_{aspects}score}{\displaystyle 6}$

where

$\begin{array}{llll} {\rm aspects} = [ & {\rm Availability}, & {\rm Freshness}, & {\rm Operation}, \\ & {\rm Usefulness}, & {\rm Validity}, & {\rm Performance} ] \end{array}$

We rank the endpoint as shown in Table 1

Table 1. Umaka Rank
Umaka Score Umaka rank

81 - 100

A

61 - 80

B

41 - 60

C

21 - 40

D

0 - 20

E

In Section 2, we show how to score endpoints from each aspect.

## 2. Metrics of Umaka Score

### 2.1 Availability

Availability represents the degree of ready for use. High availability value means we can access the endpoint most of all the time. Low availability value means the endpoint is often down. We measure the following metrics for availability:

• Alive

We send HTTP request to the endpoint URI daily. If the endpoint return 200 HTTP response, alive is true, otherwise false. Note that we assume the endpoint as dead when the endpoint returns 302 Found.

• Alive Rate

Alive rate is the percentage of alive in 30 days.

The Availability score is calculated as:

Availability = Alive Rate

### 2.2 Freshness

Freshness represents how often data in the endpoint is updated. We measure the following metrics for freshness:

• Last Updated

We retrieve Service Description and VoID and get the literals specified by dcterms:modified. Then we assume the last updated as the latest date among those literals.

Unfortunately, most of all endpoints provide Service Description and VoID without dcterms:modified statement.

Thus, we determine the data modification thought the following adhoc procedure:

First, we retrieve a number of statements using a query described in Listing 1.

Listing 1. A query for retrieving a number of statements
SELECT (COUNT(*) AS ?count)
WHERE {?s ?p ?o}

We retrieve the first and the last statements using queries described in Listing 2 and 3. Then we compare them with those of a previous day. If one of them is different, we assume the endpoint seems to be updated today.

Listing 2. A query for the first statement
SELECT *
WHERE {?s ?p ?o}
OFFSET 0 LIMIT 1
Listing 3. A query for the last statement
SELECT *
WHERE {?s ?p ?o}
OFFSET ($COUNT - 1) LIMIT 1 Note that$COUNT represents the number of "nearly total statements" which is subtracted the number of count error in the following note from the number of total statements retrieved by Listing 1.

Note that Virtuoso has trouble to count the number of statements. So we round off the count error beyond the digit of 0.05% of the number of statemnts.

• Update Interval

Update Interval is average of the interval between last updated. Update Interval is N/A if there are less than two last updated dates for the endpoint.

Currently, freshness score takes value from 30 to 100 as follows.

 Freshness = 30 if update interval is N/A or more than 365 = 100 if update interval is less than 30 = 100 - 70 * (update interval -30) / 335 otherwise

### 2.3 Operation

Operation represents the degree of the maintenance. We send HTTP request to the endpoint URI with using the accept request-header field to specify both Turtle and RDF/XML, and validate the format of its response. We measure the two metrics:

• Service Description

true if Service Description can be retrieved in Turtle format or RDF/XML format, otherwise false. We access the endpoint URI via HTTP with the following HTTP Request Header:

Accept: text/turtle,application/rdf+xml

• VoID

true if VoID can be retrieved from well-known URI[1] in Turtle format or RDF/XML format, otherwise false. We access the endpoint URI via HTTP with the following HTTP Request Header:

Accept: text/turtle,application/rdf+xml

We calculate Operation score as follows:

Operation = $\left\{ \begin{array}{ll} 0 & {\rm if~both~of~them~are~false } \\ 50 & {\rm if~one~of~them~is~false } \\ 100 & {\rm if~both~of~them~are~true } \end{array} \right.$

### 2.4 Usefulness

Usefulness represents the degree how easily we can link data in the endpoint. We measure the three metrics:

Metadata Score represents how much the endpoint contains metadata defined in [3].

If there is GRAPH clause being applied, we retrieve a list of graphs in the endpoint using a query described in Listing 4, otherwise we use one graph which does not have a name, called the background graph.

Listing 4. Obtain graph URIs on a SPARQL endpoint
SELECT DISTINCT ?g
WHERE{
GRAPH ?g{ ?s ?p ?o.}
}

Then we try to retrieve the metadata for each graph except for Table 2 as follows:

Table 2. List of Ignore Graphs
Graph URI

1. Classes

we retrieve a list of classes using a query described in Listing 5 and 6 if there is GRAPH clause being applied; otherwise Listing 7 and 8.

Listing 5. Obtain the classes on a graph g
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
SELECT DISTINCT ?c
FROM <g>
WHERE {
{ ?c rdf:type rdfs:Class. }
UNION
{ [] rdf:type ?c. }
UNION
{ [] rdfs:domain ?c. }
UNION
{ [] rdfs:range ?c. }
UNION
{ ?c rdfs:subclassOf []. }
UNION
{ [] rdfs:subclassOf ?c. }
}
LIMIT 100
Listing 6. Obtain the classes having instances on a graph g
PREFIX rdf:
SELECT DISTINCT ?c
FROM <g>
WHERE{
[] rdf:type ?c.
}
Listing 7. Obtain the classes on the background graph
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
SELECT DISTINCT ?c
WHERE {
{ ?c rdf:type rdfs:Class. }
UNION
{ [] rdf:type ?c. }
UNION
{ [] rdfs:domain ?c. }
UNION
{ [] rdfs:range ?c. }
UNION
{ ?c rdfs:subclassOf []. }
UNION
{ [] rdfs:subclassOf ?c. }
}
LIMIT 100
Listing 8. Obtain the classes having instances on the background graph
PREFIX rdf:
SELECT DISTINCT ?c
WHERE{
[] rdf:type ?c.
}
2. Labels

We retrieve a list of labels using a query described in Listing 9 if there is GRAPH clause being applied; otherwise Listing 10.

Listing 9. Obtain labels of the classes c1 c2 …​ cn from a graph g
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT DISTINCT ?c ?label
WHERE {
graph <g> {
?c rdfs:label ?label.
filter (
?c IN (<c1>, <c2>, ..., <cn>)
)
}
}
Listing 10. Obtain labels of the classes c1 c2 …​ cn from the background graph
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT DISTINCT ?c ?label
WHERE {
?c rdfs:label ?label.
filter (
?c IN (<c1>, <c2>, ..., <cn>)
)
}
}
3. Datatypes

We retrieve a list of datatypes using a query described in Listing 11 if there is GRAPH clause being applied; otherwise Listing 12.

Listing 11. Obtain the datatypes on a graph g
SELECT DISTINCT (datatype(?o) AS ?ldt)
FROM <g>
WHERE{
[] ?p ?o.
FILTER(isLiteral(?o))
}
Listing 12. Obtain the datatypes on the background graph
SELECT DISTINCT (datatype(?o) AS ?ldt)
WHERE{
[] ?p ?o.
FILTER(isLiteral(?o))
}
4. Properties

We retrieve a list of properties using a query described in Listing 13 if there is GRAPH clause being applied; otherwise Listing 14.

Listing 13. Obtain the properties on a graph g
SELECT DISTINCT ?p
FROM <g>
WHERE{
?s ?p ?o.
}
Listing 14. Obtain the properties on the background graph
SELECT DISTINCT ?p
WHERE{
?s ?p ?o.
}

We evaluate Metadata score as follows:

Metadata Score = $\frac{\displaystyle \sum_{graphs}^{g}(c(g) + l(g) + p(g) + d(g))}{\displaystyle N}$

where

$N$ = Number of Graphs

$c(g) = \left\{ \begin{array}{ll} 0 & {\rm if~g~does~not~contains~any~classes} \\ 25 & {\rm if~g~contains~more~than~zero~classes} \end{array} \right.$

$l(g) = \left\{ \begin{array}{ll} 0 & {\rm if~g~does~not~contains~any~labels} \\ 25 & {\rm if~g~contains~more~than~zero~labels} \end{array} \right.$

$p(g) = \left\{ \begin{array}{ll} 0 & {\rm if~g~does~not~contains~any~properties} \\ 25 & {\rm if~g~contains~more~than~zero~properties} \end{array} \right.$

$d(g) = \left\{ \begin{array}{ll} 0 & {\rm if~g~does~not~contains~any~datatypes} \\ 25 & {\rm if~g~contains~more~than~zero~datatypes} \end{array} \right.$

• Ontology Score

Ontology Score, which is calculated based on metadata, represents how much ontologies data used among endpoints handled in the UmakaData system or used in Linked Open Vocabularies (LOV). We can obtain ontologies in LOV via LOV API.

Ontology Score is calculated as follows:

Ontology Score = $50.0 * \frac{\displaystyle NOE}{\displaystyle NO} + 50.0 * \frac{\displaystyle NOLOV}{\displaystyle NO}$

where

$NO$ = Number of Ontologies used for Properties

$NOE$ = Number of Ontologies used for Properties among other endpoints

$NOLOV$ = Number of Ontologies used for Properties in LOV

At last, we evaluate Usefulness Score as follows:

$\begin{array}{lll} {\rm Usefulness} & = & 50.0 * {\rm Metadata~Score} \\ & + & 50.0 * {\rm Ontology~Score} \end{array}$

### 2.5 Validity

Validity represents how endpoint and data in it obey the rules. We measure the two metrics:

• Cool URI

The URI of endpoints is preferred to be Cool URI[5], [4].

We check four criteria:
1. A host of URI of endpoints should not be specified by IP address

2. A port of URI of endpoints should be 80

3. A URI of endpoints should not contain query parameters

4. A length of URI of endpoints should be less than 30 characters

Cool URI Score is a percentage of the satisfied rules.

Though the endpoints are preferred to be satisfied with the four rules of linked data[2], we omit "1. Use URIs as names for things". This is because the first rule is natural for RDF and it is meaningless for umakadata score.

We check the four rules of Linked Data except "1. Use URIs as names for things":
1. Use HTTP URIs so that people can look up those names

We assume all subjects of statements are things. We search invalid statement using a query described in Listing 15 if there is GRAPH clause being applied; otherwise Listing 16. If nothing is found the endpoint satisfied this rule.

Note that we ignore Virtuoso specific graphs since Virtuoso contains a graph which contains invalid statements.

Listing 15. A Query for searching non-HTTP-URI subjects on a graph g
SELECT
*
WHERE {
GRAPH ?g { ?s ?p ?o } .
filter (!regex(STR(?s), "^http://", "i") && !isBLANK(?s) && ?g NOT IN (
))
}
LIMIT 1
Listing 16. A Query for searching non-HTTP-URI subjects on the background graph
SELECT
*
WHERE {
filter (!regex(STR(?s), "^http://", "i") && !isBLANK(?s))
}
LIMIT 1
2. When someone looks up a URI, provide useful information, using the standards (RDF*, SPARQL)

We assess this rule by obtaining a subject (URI) using a query described in Listing 17 if there is GRAPH clause being applied otherwise Listing 18 , and accessing the URI via HTTP protocol. We assume that the endpoint is satisfied with the rule if the URI returns any data.

Note that we ignore Virtuoso specific graphs since Virtuoso contains a graph which contains invalid statements.

Listing 17. A Query for a Subject on a graph g
SELECT
?s
WHERE {
GRAPH ?g { ?s ?p ?o } .
filter (isURI(?s) && !regex(STR(?s), "^http://localhost", "i") && ?g NOT IN (
))
}
LIMIT 1
OFFSET 100
Listing 18. A Query for a Subject on the background graph
SELECT
?s
WHERE {
{ ?s ?p ?o } .
filter (isURI(?s) && !regex(STR(?s), "^http://localhost", "i") && !regex (STR(?s), "^http://www.openlinksw.com", "i"))
}
LIMIT 1
OFFSET 100
3. Include links to other URIs. so that they can discover more things

We assume the statement representing the link to other URI uses the vocabularies owl:sameAs or rdfs:seeAlso. We think if there are any statement of which property is owl:sameAs or rdfs:seeAlso, the endpoint is satisfied with the rule. We check the feasibility of the rule by using queries described in Listing 19, 20 if there is GRAPH clause being applied; otherwise Listing 21, 22

Listing 19. A Query for a Same AS Statement on a graph g
PREFIX owl:<http://www.w3.org/2002/07/owl#>
SELECT
*
WHERE {
GRAPH ?g { ?s owl:sameAs ?o } .
}
LIMIT 1
Listing 20. A Query for a See Also Statement on a graph g
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT
*
WHERE {
GRAPH ?g { ?s rdfs:seeAlso ?o } .
}
LIMIT 1
Listing 21. A Query for a Same AS Statement on the background graph
PREFIX owl:<http://www.w3.org/2002/07/owl#>
SELECT
*
WHERE {
?s owl:sameAs ?o .
}
LIMIT 1
Listing 22. A Query for a See Also Statement on the background graph
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT
*
WHERE {
?s rdfs:seeAlso ?o .
}
LIMIT 1

Linked Data Score is a percentage of the satisfied rules.

We evaluate Validity as follows:

Validity = 40 * Cool URI Score + 60.0 * Linked Data Rule Score

### 2.6 Performance

Performace suggests how powerful the endpoint is.

We measure the response times of the two queries, Listing 23, 24. The former query is a most simple query and we use this query to estimate the transfer time. The latter query requires a little computations for endpoints. We believe the execution cost of this query does not differ very much according to the size of data.

Listing 23. A Most Simple Query
Listing 24. A Query for retrieving the number of classes
SELECT DISTINCT
(COUNT(?class) AS ?c)
WHERE {
{ [] a ?class .}
}

We assume the execution time as:

Execution Time = Differences of the response time for those queries.

After that, we evaluate Performance as:

Performance = $\left\{ \begin{array}{ll} 100.0 * (1.0 - (({\rm Execution~Time} {\rm ~ / ~} N) * 1000000)) & {\rm if~Execution~Time~is~less~than~1~second} \\ 0 & {\rm Otherwise} \end{array} \right.$

where

N = Number of statements

## References

[1] Keith Alexander, Richard Cyganiak, Michael Hausenblas, and Jun Zhao. Describing linked datasets with the void vocabulary. https://www.w3.org/TR/void/, March 2011.