RDF-izing Tabular Data

Ontotext Refine is designed to facilitate the production of RDF from tabular data.

Depending on the particular use case and requirements, the user can choose either the visual RDF mapping interface or directly write your mapping as a SPARQL query.

Ontotext Refine also exposes a virtual SPARQL endpoint on top of the tabular data, which can be used in more complex data integration scenarios.

Connecting to a GraphDB instance

The user can connect a GraphDB instance using the Setup page in the Refine Menu or by setting the corresponding Refine Configuration Parameters.

Connecting to a GraphDB instance is not required in order to produce RDF. However it will further facilitate your work by allowing:

  • Auto-completion of classes and properties from an existing ontology
  • Access to prefixes
  • Easy access to the virtual SPARQL endpoint directly from the GraphDB workbench

Important

If using GraphDB to autocomplete classes and properties make sure that autocomplete is enabled

Using Тhe Visual RDF Mapper

The Visual RDF Mapper allows mapping the tabular data progressively and interactively, one triple at a time.

The Mapper User interface

The headers of all columns in the tabular data are displayed as boxes that you can drag and drop into a mapping cell to configure them.

_images/visual-mapper-columns.png

Following are the BASE and PREFIX definitions. New prefixes can be added using SPARQL syntax. Prefixes can be edited and copied.

_images/visual-mapper-base-and-prefix.png

The Mapping Table contains the actual mapping.

Each row in the represents an RDF triple constructed from the tabular data. If two (or more) triples have the same subject, they will be displayed as a triple with one subject and two (or more) predicates. Analogically, if two (or more) triples have the same subject + predicate, but different objects, they will be displayed as a triple with one subject + predicate and multiple different objects.

_images/visual-mapper-mapping-table.png

The mapping is applied consecutively to each row in the input table.

Note

The table is modeled after the Turtle language syntax. If you look closely you will see the familiar use of the . , ; and , separators at the end of each row.

Creating a mapping

Creating a new mapping is done one step at a time by configuring the values at the Subject, Predicate and Object positions of each triple.

For each position in the mapping table the user defines a value mapping, i.e he describes how to create the RDF value based on the input tabular data. In order to determine the value, the user should specify the value source, the value type and the value transformation.

The Value Source

This is the source of the value in the associated position, it can be one of the following:

  • Column: the value of the cell in the mentioned column
  • Constant: a constant value
  • Record ID: The record identifier from the OpenRefine project
  • Row Index: The index of the row
  • GREL: a GREL expression
_images/visual-mapper-value-source.png

Tip

You can also refer to a column directly from a cell in the mapping table by typing @ followed by the name of the column.

The Value Type

The value type defines the type of RDF value that will be produced by the value mapping. The possible types are:

  • Resource (abstract): An RDF resource. This is an abstract type that specifies the common features of IRIs and blank nodes. A resource value type may have type mappings and property mappings on its own.
  • IRI: An RDF IRI. The transformed value is the IRI with illegal characters escaped automatically. This is a subtype of the abstract Resource value type.
  • Blank node based on value: (value Bnode) An RDF blank node. The transformed value is used to calculate a reasonable blank node identifier such that identical transformed values produce the same blank node. This is a subtype of the abstract Resource value type.
  • Unique blank node: (unique Bnode) An RDF blank node. The transformed value is ignored and a unique blank node is created every time. This is a subtype of the abstract Resource value type.
  • Any literal (abstract): Any kind of RDF literal (plain, language, or datatype). This is an abstract type that unifies all literal value types.
  • Literal: An RDF plain literal. The transformed value is the literal’s label.
  • Literal with a language: An RDF literal with a language. The transformed value is the literal’s label. The language is a simple literal value mapping, i.e., identical to a value mapping with type literal.
  • Literal with a datatype: An RDF literal with a datatype. The transformed value is the literal’s label. The datatype is a simple IRI value mapping.

This window shows how to produce a xsd:gYear typed literal from the values of the startYear column.

_images/visual-mapper-literal-gyear.png

The Value Transformation

Each value mapping may have an optional transformation, applied to the data received from the value source, before the value is created.

Refine supports two types of transformations:

  • GREL: A GREL expression defines the transformation
  • Prefixing: a known prefix is applied to the value and the result is given the IRI data type

This window shows a GREL expression being used to construct a URL from the values of the tconst column.

Note the instant preview of the result of the GREL expression.

_images/visual-mapper-grel-uri.png

Prefixes

The mapping tree contains a set of prefixes that are used in the cell configuration. They are defined in the prefix area and can be of three types:

  • Default prefixes from commonly used RDF schemas, such as foaf, geo, rdf, rdfs, skos, xsd. You can select entities from these schemas without importing them in your repository.
  • Prefixes that you select from the GraphDB repository,
  • Prefixes that you create yourself.

Tip

It is possible to extend prefixes when using them in the visual mapper. Entering movie:episode/ will extend the predefined movie: prefix with episode/.

Preview Mode

_images/visual-mapper-preview-mode-controls.png

Besides the default Configuration mode, the RDF mapper can be switched to Preview mode. This mode shows the RDF resulting from mapping the first row of the table. The URIs are clickable and can be used to validate if they are well constructed.

_images/visual-mapper-preview-mode.png

Hint

For more advanced validation and feedback, the user can export well formatted turtle using the :guilabel:RDF button at any time and inspect it in a suitable text editor.

Nesting Triples and Blank Nodes

It is possible to nest triples directly from the Visual RDF Mapper.

Clicking on the arrow symbol of a the cell in the object position will start a nested set of triples, where that resource is implicitly in the subject position. This is indicated by the green frame around the nested triples.

The following mapping:

_images/visual-mapper-blank-node.png

Produces the following RDF:

movie:tt0093296 mdbo:starring [
   mdbo:actor actor:nm0141056;
   mdbo:ordering 3;
   mdbo:character "Heinrich", "Herman", "Hans"
] .

from the following source:

_images/visual-mapper-blank-node-source.png

Note

The transformation above is not trivial.

Note how we used a GREL forEach to parse the JSON string and and produce three triples with the mdbo:character predicate .

Reusing a Mapping

Once created, a mapping can be saved using the Save button.

It will be saved as part of the overall project. A mapping can be exported as a JSON file and saved using the Download JSON button. Respectively such a file can be restored with the Upload JSON button.

The user can also reuse a mapping by:

  • Convert it to SPARQL open it in a GraphDB query editor (see below)
  • Convert it to SPARQL from the Refine SPARQL editor (see below)

Using the SPARQL Query Editor

The SPARQL Query Editor allows directly creating and editing mappings аs SPARQL queries.

_images/sparql-mapper-editor.png

The input tabular data is read one row at a time and the value in each cell is bound as a variable, named after the column header.

The user is then free to manipulate the values with all the expressiveness that SPARQL allows.

The output RDF is shaped, using either a CONSTRUCT query or a INSERT query, from a remote endpoint, using SPARQL federation.

Mapping Syntax and Dedicated SPARQL Variables

The mapping query uses standard SPARQL syntax with a few special variables and statements.

The Mapping Variables

Values from the input table are mapped to variables, named after the corresponding columns, prefixed with a c_.

These variables are not readily available and they must be mentioned in a BIND() statement in order to be activated.

The values are always initially typed as strings and we recommend using the same BIND() statement to handle any data type conversions. If the desired datatype is string, the variable still needs to be bound as a variable with a different name.

Here is an example with the following conversions:

  • the contents of the startYear is converted to a literal with a xsd:gYear data type,
  • the runtimeMinutes column is converted to an integer,
  • the primaryTitle column is not converted and remains a string.
BIND(STRDT(?c_startYear,xsd:year) AS ?startYear )
BIND(STRDT(?c_runtimeMinutes,xsd:integer) AS ?runtimeMinutes )
BIND( ?c_primaryTitle AS ?primaryTitle )

Tip

Click on the name of the columns on top of the window!

The generic BIND() statement corresponding to that column will be generated in the WHERE clause of the query.

The Row and Record index

Two more variables are reserved by the SPARQL mapping. They correspond to OpenRefine project metadata:

  • ?row_index contains the index of the row being processed
  • ?record_id contains the id of the record being processed (if in record mode)

The mapper:grel Magic Predicate

GREL expressions can be used directly in the SPARQL query, using the mapper:grel magic predicate.

The subject part of the triple contains an expression with two arguments, the ?rowIndex and a string corresponding to the GREL expression itself. The variable in the object position contains the result.

(?row_index "replace(cells['column_A'].value,'foo','bar'") mapper:grel ?result .

Hint

A lot of the tasks for which we can use GREL can also be achieved by SPARQL 1.1 string functions, with much less syntactic complexity.

Making mappings using SPARQL

While it is possible to just start from scratch and write a query, Ontotext Refine has several features for bootstrapping this process.

Generating such mapping queries can be done either from the Visual RDF Mapper or from a predefined template.

Generate a Query from the Visual RDF Mapper

If the user has defined a mapping by using the Visual RDF Mapper, the Generate Query ‣ From Mapping menu item will translate the mapping to a SPARQL CONSTRUCT query.

This SPARQL query:

CONSTRUCT {
  ?s1 a mdbo:Movie ;
      mdbo:title ?o_title ;
      mdbo:year ?o_year .
} WHERE {
  BIND(IRI(mapper:encode_iri(movie:, ?c_tconst)) as ?s1)
  BIND(STR(?c_primaryTitle) as ?o_title)
  BIND(STRDT(?c_startYear, xsd:gYear) as ?o_year)
}

Has been automatically generated from the following visual mapping:

_images/sparql-mapper-visual-source.png

Important

Any modifications to the query are not reflected back to the Visual RDF Mapper.

Generate a Query from Template

If a user wants to start directly working with SPARQL, the Generate Query ‣ Standard Template option is the easiest way to start.

This feature will produce a SELECT query with all the BIND() statements, corresponding to the columns of the tabular data.

The recommended workflow in this scenario is the following:

  • First the user chooses which columns he needs and removes the unneeded BIND() statements
  • Second, he fixes any data type issues, directly in the BIND() statements using STRDT() .
  • He finally crafts the URIs so they fit the URI format of his project.

This is an iterative process the user can execute the query many times in order to inspect the results.

Once satisfied, the user changes the SELECT to a CONSTRUCT query and defines the shape of the output RDF graph.

The result in Turtle format can then be downloaded using the Download ‣ Result menu item.

Tip

The SPARQL Mapper can be a powerful data exploration tool! You can use it to check for inconsistencies, to count occurrences, and in general to get a better understanding of the data you are transforming. Also remember that you can use an additional BIND() or FILTER() statements, to fix a certain value and constrain result set to a small subset of the data, centered around a (potentially problematic) value.

Alternatively if the project has too many columns, the individual BIND() statements can be generated one by one by clicking on the names of the columns in the top band.

Important

Values from the tabular data always come typed as strings. It is up to the user to convert them to the desired data type using the STRDT() SPARQL function.

Saving the Mapping Queries

All the mapping queries in the SPARQL Query Editor are saved as part of the Refine project. This is done after clicking on the Save button but is only effective for queries after they have been executed.

Warning

Modifications in a query are not saved if the query has not been executed after having been modified.

The queries can also be saved using the Download ‣ Query menu item.

Tip

The tabs can be renamed by double-clicking on their names.

Data Integration Using the Virtual SPARQL Endpoint

Thanks to SPARQL federation, advanced users can produce RDF using Ontotext Refine by using the SPARQL endpoint directly from a GraphDB instance.

This is the recommended method for integrating data using Ontotext Refine for a number of reasons:

  • A user can use SPARQL INSERT to directly write triples in the GraphDB repository
  • Federated queries allow joining values from the Refine project and the existing data in the graph. This allows for much finer control on the shape of the resulting RDF graph.
  • No intermediary files are created
  • The user can specify the context using a SPARQL GRAPH clause and can even define it dynamically from values in the data.

Accessing the Virtual Endpoint

Ontotext Refine exposes a SPARQL endpoint for each project.

The URL of the endpoint corresponds to the Refine base URL, followed by repositories/ontorefine:, followed by the identifier of the project.

If a Project Alias is set it will be used in place of the project identifier to form the virtual endpoint URL.

Aliases are useful when the mapping queries are stored externally and need to be persistent, because they will be used on newly created projects.

In order to run the federated queries, Refine can generate the queries with the corresponding SERVICE clause, wrapping the statements in the WHERE clause. This is done by choosing the corresponding menu items in the Generate query menu:

The Open in GraphDB button also inserts the SERVICE clause in the current query before opening it in the associated GraphDB instance.

Note

The URL of the virtual SPARQL endpoint might need to be changed, to fit the particular user’s architecture. Such are cases where Refine runs inside a docker container or needs to be accessed from another machine.

This window shows a INSERT query, ready to write triples in a GraphDB repository.

_images/virtual-endpoint-sparql-example.png