Umaka Score

August 22, 2017

Abstract

Umaka Score is an index to help assess the datasets and SPARQL endpoints. Umaka Score is calculated on the basis of the evaluation from the 6 aspects; Availability, Freshness, Operation, Usefulness, Validity and Performance. We also rank the endpoints on a scale of A to E according to the Umaka score. In this document, we show how to calculate Umaka score.

1. Umaka Score

Umaka Score represents how valuable endpoints are. We believe there are six aspects, Availability, Freshness, Operation, Usefulness, Validity and Performance, for valuable endpoints. We evaluate and score endpoints from these aspects. Then Umaka Score is average of these score:

Umaka Score = \(\frac{\displaystyle \sum_{aspects}score}{\displaystyle 6}\)

where

\(\begin{array}{llll} {\rm aspects} = [ & {\rm Availability}, & {\rm Freshness}, & {\rm Operation}, \\ & {\rm Usefulness}, & {\rm Validity}, & {\rm Performance} ] \end{array}\)

We rank the endpoint as shown in Table 1

Table 1. Umaka Rank
Umaka Score Umaka rank

81 - 100

A

61 - 80

B

41 - 60

C

21 - 40

D

0 - 20

E

In Section 2, we show how to score endpoints from each aspect.

2. Metrics of Umaka Score

2.1 Availability

Availability represents the degree of ready for use. High availability value means we can access the endpoint most of all the time. Low availability value means the endpoint is often down. We measure the following metrics for availability:

  • Alive

    We send the SPARQL query described in Listing 1 to the endpoint URI daily. If the response status is not 2xx, we will try the query described in Listing 2. If the endpoint return 2xx HTTP response for either of queries, alive is true, otherwise false.

    Listing 1. A query for checking liveness for the endpoint that supports graph keyword
    CONSTRUCT {
      ?s ?p ?o .
    }
    WHERE {
      GRAPH ?g {
        ?s ?p ?o .
      }
    }
    LIMIT 1
    Listing 2. A query for checking liveness for the endpoint that does not support graph keyword
    CONSTRUCT {
      ?s ?p ?o .
    }
    WHERE {
      ?s ?p ?o .
    }
    LIMIT 1
  • Alive Score

    Alive monitoring score that takes the value from 0 to 100. If the crawler fails to access the endpoint, the score drops from the previous score day by day. If the crawler successfully accesses to the endpoint, the score rises day by day based on the previous score.

The Availability score is calculated as:

Availability = Alive Score

2.2 Freshness

Freshness represents how often data in the endpoint is updated. We measure the following metrics for freshness:

  • Last Updated

    We retrieve the Vocabulary of Interlinked Datasets (VoID) and Service Description Vocabulary and get the literals specified by dcterms:modified or dcterms:issued. Then we assume the last updated as the latest date among those literals.

    Listing 3. Expected triples in VoID for checking freshness (see https://www.w3.org/TR/void/#dublin-core)
    :DBpedia a void:Dataset;
      dcterms:title "DBPedia";
      dcterms:description "RDF data extracted from Wikipedia";
      dcterms:contributor :FU_Berlin;
      dcterms:contributor :University_Leipzig;
      dcterms:contributor :OpenLink_Software;
      dcterms:contributor :DBpedia_community;
      dcterms:source <http://dbpedia.org/resource/Wikipedia>;
      dcterms:modified "2008-11-17"^^xsd:date;
    
    Listing 4. Expected triples in Service Description for checking freshness (see https://www.w3.org/TR/void/#sparql-sd)
    <#service> a sd:Service;
      sd:url <http://example.org/geopedia/sparql>;
      sd:defaultDatasetDescription [
        a sd:Dataset;
        dcterms:title "GeoPedia";
        dcterms:description "A mirror of DBpedia and Geonames";
        void:triples 1100000100;
        sd:defaultGraph [
          a sd:Graph, void:Dataset;
          dcterms:title "GeoPedia SPARQL Endpoint Description";
          dcterms:description "Contains a copy of this SD+VoID file!";
          void:triples 100;
          dcterms:issued "yyyy-mm-dd"^^xsd:date;
        ];
        sd:namedGraph [
          sd:name <http://dbpedia.org/>;
          sd:graph [
            a sd:Graph, void:Dataset;
            dcterms:title "DBpedia";
            foaf:homepage <http://dbpedia.org/>;
            void:triples 1000000000;
            dcterms:issued "yyyy-mm-dd"^^xsd:date;
          ];
        ];
        sd:namedGraph [
          sd:name <http://geonames.org/>;
          sd:graph [
            a sd:Graph, void:Dataset;
            dcterms:title "Geonames";
            foaf:homepage <http://www.geonames.org/ontology/>;
            void:triples 100000000;
            dcterms:issued "yyyy-mm-dd"^^xsd:date;
          ];
        ];
      ];
  • Update Interval

    Update Interval is average of the interval between last updated. Update Interval is N/A if there are less than two last updated dates for the endpoint.

Currently, freshness score takes value from 30 to 100 as follows.

Freshness = 30 if update interval is N/A or more than 365
= 100 if update interval is less than 30
= 100 - 70 * (update interval -30) / 335 otherwise

2.3 Operation

Operation represents the degree of the maintenance. We send HTTP request to the endpoint URI with using the accept request-header field to specify both Turtle and RDF/XML, and validate the format of its response. We measure the two metrics:

  • Service Description

    True if Service Description can be retrieved in Turtle format or RDF/XML format, otherwise false. We access the endpoint URI via HTTP with the following HTTP Request Header:

    Accept: text/turtle, application/rdf+xml

  • VoID

    True if VoID can be retrieved from well-known URI[1] in Turtle format or RDF/XML format, otherwise false. We access the endpoint URI via HTTP with the following HTTP Request Header:

    Accept: text/turtle, application/rdf+xml

We calculate Operation score as follows:

Operation = \(\left\{ \begin{array}{ll} 0 & {\rm if~both~of~them~are~false } \\ 50 & {\rm if~one~of~them~is~false } \\ 100 & {\rm if~both~of~them~are~true } \end{array} \right.\)

2.4 Usefulness

Usefulness represents the degree how easily we can link data in the endpoint. We measure the three metrics:

  • Metadata Score

    Metadata Score represents how much the endpoint contains metadata defined in [3].

    If there is GRAPH clause being applied, we retrieve a list of graphs in the endpoint using a query described in Listing 4, otherwise we use one graph which does not have a name, called the background graph.

    Listing 4. Obtain graph URIs on a SPARQL endpoint
    SELECT DISTINCT ?g
    WHERE{
      GRAPH ?g {
        ?s ?p ?o .
      }
    }

    Then we try to retrieve the metadata for each graph except for Table 2 as follows:

    Table 2. List of Ignore Graphs
    Graph URI

    http://www.openlinksw.com/schemas/virtrdf#

    1. Classes

      we retrieve a list of classes using a query described in Listing 5 and 6 if there is GRAPH clause being applied; otherwise Listing 7 and 8.

      Listing 5. Obtain the classes on a graph g
      PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
      PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
      
      SELECT DISTINCT ?c
      FROM <g>
      WHERE {
        {
          ?c rdf:type rdfs:Class .
        } UNION {
          [] rdf:type ?c .
        } UNION {
          [] rdfs:domain ?c .
        } UNION {
          [] rdfs:range ?c .
        } UNION {
          ?c rdfs:subclassOf [] .
        } UNION {
          [] rdfs:subclassOf ?c .
        }
      }
      LIMIT 100
      Listing 6. Obtain the classes having instances on a graph g
      PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
      
      SELECT DISTINCT ?c
      FROM <g>
      WHERE{
        [] rdf:type ?c .
      }
      Listing 7. Obtain the classes on the background graph
      PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
      PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
      
      SELECT DISTINCT ?c
      WHERE {
        {
          ?c rdf:type rdfs:Class .
        } UNION {
          [] rdf:type ?c .
        } UNION {
          [] rdfs:domain ?c .
        } UNION {
          [] rdfs:range ?c .
        } UNION {
          ?c rdfs:subclassOf [] .
        } UNION {
          [] rdfs:subclassOf ?c .
        }
      }
      LIMIT 100
      Listing 8. Obtain the classes having instances on the background graph
      PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
      
      SELECT DISTINCT ?c
      WHERE {
        [] rdf:type ?c .
      }
    2. Labels

      We retrieve a list of labels using a query described in Listing 9 if there is GRAPH clause being applied; otherwise Listing 10.

      Listing 9. Obtain labels of the classes c1 c2 …​ cn from a graph g
      PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
      
      SELECT DISTINCT ?c ?label
      WHERE {
        GRAPH <g> {
          ?c rdfs:label ?label .
          FILTER (?c IN (<c1>, <c2>, ..., <cn>))
        }
      }
      Listing 10. Obtain labels of the classes c1 c2 …​ cn from the background graph
      PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
      
      SELECT DISTINCT ?c ?label
      WHERE {
        ?c rdfs:label ?label .
        FILTER (?c IN (<c1>, <c2>, ..., <cn>))
      }
    3. Datatypes

      We retrieve a list of datatypes using a query described in Listing 11 if there is GRAPH clause being applied; otherwise Listing 12.

      Listing 11. Obtain the datatypes on a graph g
      SELECT DISTINCT (datatype(?o) AS ?ldt)
      
      FROM <g>
      WHERE{
        [] ?p ?o .
        FILTER (isLiteral(?o))
      }
      Listing 12. Obtain the datatypes on the background graph
      SELECT DISTINCT (datatype(?o) AS ?ldt)
      WHERE{
        [] ?p ?o .
        FILTER (isLiteral(?o))
      }
    4. Properties

      We retrieve a list of properties using a query described in Listing 13 if there is GRAPH clause being applied; otherwise Listing 14.

      Listing 13. Obtain the properties on a graph g
      SELECT DISTINCT ?p
      FROM <g>
      WHERE{
        ?s ?p ?o .
      }
      Listing 14. Obtain the properties on the background graph
      SELECT DISTINCT ?p
      WHERE{
        ?s ?p ?o .
      }

      We evaluate Metadata score as follows:

      Metadata Score = \(\frac{\displaystyle \sum_{graphs}^{g}(c(g) + l(g) + p(g) + d(g))}{\displaystyle N}\)

      where

      \(N\) = Number of Graphs

      \(c(g) = \left\{ \begin{array}{ll} 0 & {\rm if~g~does~not~contains~any~classes} \\ 25 & {\rm if~g~contains~more~than~zero~classes} \end{array} \right.\)

      \(l(g) = \left\{ \begin{array}{ll} 0 & {\rm if~g~does~not~contains~any~labels} \\ 25 & {\rm if~g~contains~more~than~zero~labels} \end{array} \right.\)

      \(p(g) = \left\{ \begin{array}{ll} 0 & {\rm if~g~does~not~contains~any~properties} \\ 25 & {\rm if~g~contains~more~than~zero~properties} \end{array} \right.\)

      \(d(g) = \left\{ \begin{array}{ll} 0 & {\rm if~g~does~not~contains~any~datatypes} \\ 25 & {\rm if~g~contains~more~than~zero~datatypes} \end{array} \right.\)

  • Ontology Score

    Ontology Score, which is calculated based on metadata, represents how much ontologies data used among endpoints handled in the UmakaData system or used in Linked Open Vocabularies (LOV). We can obtain ontologies in LOV via LOV API.

    Ontology Score is calculated as follows:

    Ontology Score = \(50.0 * \frac{\displaystyle NOE}{\displaystyle NO} + 50.0 * \frac{\displaystyle NOLOV}{\displaystyle NO}\)

    where

    \(NO\) = Number of Ontologies used for Properties

    \(NOE\) = Number of Ontologies used for Properties among other endpoints

    \(NOLOV\) = Number of Ontologies used for Properties in LOV

    At last, we evaluate Usefulness Score as follows:

    \(\begin{array}{lll} {\rm Usefulness} & = & 50.0 * {\rm Metadata~Score} \\ & + & 50.0 * {\rm Ontology~Score} \end{array}\)

2.5 Validity

Validity represents how endpoint and data in it obey the rules. We measure the two metrics:

  • Cool URI

    The URI of endpoints is preferred to be Cool URI[5], [4].

    We check four criteria:
    1. A host of URI of endpoints should not be specified by IP address

    2. A port of URI of endpoints should be 80

    3. A URI of endpoints should not contain query parameters

    4. A length of URI of endpoints should be less than 30 characters

    Cool URI Score is a percentage of the satisfied rules.

  • Linked Data Rule

    Though the endpoints are preferred to be satisfied with the four rules of linked data[2], we omit "1. Use URIs as names for things". This is because the first rule is natural for RDF and it is meaningless for umakadata score.

    We check the four rules of Linked Data except "1. Use URIs as names for things":
    1. Use HTTP URIs so that people can look up those names

      We assume all subjects of statements are things. We search invalid statement using a query described in Listing 15 if there is GRAPH clause being applied; otherwise Listing 16. If nothing is found the endpoint satisfied this rule.

      Note that we ignore Virtuoso specific graphs since Virtuoso contains a graph which contains invalid statements.

      Listing 15. A Query for searching non-HTTP-URI subjects on a graph g
      SELECT *
      WHERE {
        GRAPH ?g {
          ?s ?p ?o .
        }
        FILTER (!REGEX(STR(?s), "^http://", "i")
                && !isBLANK(?s)
                && ?g NOT IN (<http://www.openlinksw.com/schemas/virtrdf#>))
      }
      LIMIT 1
      Listing 16. A Query for searching non-HTTP-URI subjects on the background graph
      SELECT *
      WHERE {
        FILTER (!REGEX(STR(?s), "^http://", "i")
                && !isBLANK(?s))
      }
      LIMIT 1
    2. When someone looks up a URI, provide useful information, using the standards (RDF*, SPARQL)

      We assess this rule by obtaining a subject (URI) using a query described in Listing 17 if there is GRAPH clause being applied otherwise Listing 18 , and accessing the URI via HTTP protocol. We assume that the endpoint is satisfied with the rule if the URI returns any data.

      Note that we ignore Virtuoso specific graphs since Virtuoso contains a graph which contains invalid statements.

      Listing 17. A Query for a Subject on a graph g
      SELECT ?s
      WHERE {
        GRAPH ?g {
          ?s ?p ?o .
        }
        FILTER (isURI(?s)
                && !REGEX(STR(?s), "^http://localhost", "i")
                && ?g NOT IN (<http://www.openlinksw.com/schemas/virtrdf#>))
      }
      LIMIT 1
      OFFSET 100
      Listing 18. A Query for a Subject on the background graph
      SELECT ?s
      WHERE {
        ?s ?p ?o .
        FILTER (isURI(?s)
                && !REGEX(STR(?s), "^http://localhost", "i")
                && !REGEX(STR(?s), "^http://www.openlinksw.com", "i"))
      }
      LIMIT 1
      OFFSET 100
    3. Include links to other URIs. so that they can discover more things

      We assume the statement representing the link to other URI uses the vocabularies owl:sameAs or rdfs:seeAlso. We think if there are any statement of which property is owl:sameAs or rdfs:seeAlso, the endpoint is satisfied with the rule. We check the feasibility of the rule by using queries described in Listing 19, 20 if there is GRAPH clause being applied; otherwise Listing 21, 22

      Listing 19. A Query for a Same AS Statement on a graph g
      PREFIX owl: <http://www.w3.org/2002/07/owl#>
      
      SELECT *
      WHERE {
        GRAPH ?g {
          ?s owl:sameAs ?o .
        }
      }
      LIMIT 1
      Listing 20. A Query for a See Also Statement on a graph g
      PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
      
      SELECT *
      WHERE {
        GRAPH ?g {
          ?s rdfs:seeAlso ?o .
        }
      }
      LIMIT 1
      Listing 21. A Query for a Same AS Statement on the background graph
      PREFIX owl: <http://www.w3.org/2002/07/owl#>
      
      SELECT *
      WHERE {
        ?s owl:sameAs ?o .
      }
      LIMIT 1
      Listing 22. A Query for a See Also Statement on the background graph
      PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
      
      SELECT *
      WHERE {
        ?s rdfs:seeAlso ?o .
      }
      LIMIT 1

    Linked Data Score is a percentage of the satisfied rules.

We evaluate Validity as follows:

Validity = 40 * Cool URI Score + 60.0 * Linked Data Rule Score

2.6 Performance

Performace suggests how powerful the endpoint is.

We measure the response times of the two queries, Listing 23, 24. The former query is a most simple query and we use this query to estimate the transfer time. The latter query requires a little computations for endpoints. We believe the execution cost of this query does not differ very much according to the size of data.

Listing 23. A Most Simple Query
ASK {}
Listing 24. A Query for retrieving classes
SELECT DISTINCT ?c
WHERE {
  GRAPH ?g {
    [] a ?c .
  }
}
LIMIT 100

We assume the execution time as:

Execution Time = Differences of the response time for those queries.

The final value is obtained by averaging over the results taken by three times.

After that, we evaluate Performance as:

Performance = \(\left\{ \begin{array}{ll} 100.0 * (1.0 - (({\rm Execution~Time} {\rm ~ / ~} N) * 1000000)) & {\rm if~Execution~Time~is~less~than~1~second} \\ 0 & {\rm Otherwise} \end{array} \right.\)

where

N = Number of statements

References

[1] Keith Alexander, Richard Cyganiak, Michael Hausenblas, and Jun Zhao. Describing linked datasets with the void vocabulary. https://www.w3.org/TR/void/, March 2011.

[2] Tim Berners-Lee. Linked data - design issues. https://www.w3.org/DesignIssues/LinkedData.html, 2006.

[3] DBCLS. Sparql queries for sparql builder metadata. http://www.sparqlbuilder.org/doc/sparql-queries-for-sparql-buildermetadata/.

[4] Leigh Dodds and Ian Davis. Linked data patterns - a pattern catalogue for modelling, publishing, and consuming linked data. http://patterns.dataincubator.org, 2012.

[5] Leo Sauermann and Richard Cyganiak. Cool uris for the semantic web. https://www.w3.org/TR/cooluris/, Descember 2008.