31 December 2014

A Round-Trip to RDF

This is my first post in the blog and I would like to start off by sharing an application I developed as part of my studies at École Centrale Paris. Its name is RDF Usine and it is meant to convert plain text files (e.g. CSV) into RDF format.

I have published not only the executable library but also the complete source code.

Without any further introductions, let's see how the application looks like:

RDF Usine


Application Interface


As it is illustrated in the figure depicted before, the interface is made up by three main components. Each one of them is briefly described as follows:
  • The "Configuration Pane" is on the left. We will use it in order to define the different parameters that are inherent to each file format.
  • The "Input File Preview Pane" is on the top right. It shows a raw preview of the input file.
  • The "Output RDF Files Preview Pane" is on the bottom left. This tab shows a tabular representation of the input file (in "Table View" tab) and a preview of how RDF output files will look like (in the tabs "Turtle" and "N-TRIPLE").

Main Features


The key features of RDF Usine are enumerated as follows:
  • The application is multilingual! It supports English, French and Spanish. You can easily change the language by making use of the "Language" menu:

    Language Menu

    If you speak other languages and would like to contribute with their translations, your collaboration will be pretty much appreciated! Please contact me for more details.
  • RDF Usine was developed using JavaFX technology so it could be run not only in Windows but also in Linux, Mac OS X and other operating systems.
  • Java
  • RDS Usine also "speaks" two RDF formats. Information can be exported in:
    • Turtle
    • N-Triple
    Turtle N-Triples
  • The range of accepted file encodings is quite broad, including UTF-8, ISO-8859 and US-ASCII, among others.
  • Multi-line fields are supported. For example:

    Multi-line fields example

  • It is possible to preview the complete input and output files. According to the user needs, it is also possible to restrict the visualisation to the first 10, 100 or 1000 rows (for big files, it is advisable to use one of these filters). You may use the "Preview" menu for doing so:
  • Preview

    Preview Menu

  • Configurations can be Saved and Loaded to be applied again afterwards. This includes all settings specified in the "General settings", "Fields" and "Prefixes" tabs:
  • Save/Load Configuration

    Configuration Menu

  • Possible field delimiters:
    • Semicolon: ;
    • Pipe: |
    • Comma: ,
    • Space
    • Tabulation
    • Dollar Sign: $

    Field delimiters

  • Escape characters may also be configured:

    Escape char.

  • Headers can be:
    • Read from a user-defined row number of the file.
    • Defined in a customised way -in tab "2) Fields"-.
    • Headers

  • Entity classes can be defined as free text.

    Entity type

  • You need to specify where to start and end processing the input files?
    That is quite easy with the file "Boundaries" options!

    Boundaries

  • Subjects can optionally have a text prefix and they may be:
    • Read from a user-defined column number of the input file.
    • Defined as auto numeric values.
    • Subjects

  • Need to work on multiple files or entire folders?
    That is not a problem! Use the File(s)... button to select one or more input files or the Folder... button to select a directory and look for input files recursively in all the subdirectories.

Preview and Export

  • While working with multiple files, you can use the "Preview" combo box situated on the "General Settings" tab to choose the file that you would like to preview in the "Input File Preview Pane" and the "Output RDF Files Preview Pane":

    File selection

  • The application will allow you to experiment with different parameters and see results interactively.

    In addition to the "Table View", you will also be able to see the RDF previews for Turtle and N-Triple formats. Both text windows are read-only but allow the user to select text fragments and copy them to the Clipboard by means of the context menu (right-click):

    Example of Turtle Output


    Example of N-Triples RDF Output

    Bear in mind that the amount of records shown can be modified through the "Preview" menu option.
  • Once results look like what is expected, they can be exported in two ways. The first one is through the "Export RDF" menu:

    Export RDF Menu

  • Alternatively if you click on the "Output folder" field, you will be able to define a directory to export results in a batch mode:

    Output folder

    Once an output directory has been chosen, you may click on the Run button. If you do so, the application will process all the files previously chosen with the File(s)... or Folder... buttons and will export them in Turtle and N-Triple formats in the output folder.

Fields


As it was mentioned before, the "Fields" tab lets you define a set of headers in case the file does not have a header line or if you would like to customise the list:

Fields

  • To add a row, enter its "Name" and "Type". Optionally, you can also specify a prefix for the predicate and/or the objects associated. Once that is done, click on Add Row button.
  • To update an existing row, first select it from the list with a mouse click. You will see that "Prefix predicate", "Prefix object", "Name" and "Type" fields are populated with the relevant data. Make the changes that you would like to perform and finally click on Update Row button.
  • To delete a row, first select it from the list and then click on Delete button.
  • To change the order of a field, first select it and then click on Up or Down to move it up or down respectively.
  • If you have a file with a myriad of fields and there is no header line, you can automatically generate the list of columns. In order to do so, just click on Auto generate button. The application will count the number of fields (according to the delimiter specified in the "General Settings" tab) and will assign names automatically to each of them, using one or more letters of the alphabet, exactly in the same vein as columns in spreadsheet applications.
  • You do not know which field types you should use? No worries! Just click on the Detect types button and the application will suggest the most appropriate choice for each field within the list of 15 supported types.

Prefixes


Would you like to improve the readability of your output files? Prefixes are the answer:

Prefixes

  • To add a prefix, just enter the "Name" and "Reference" of the prefix and click on Add prefix button.
  • To delete a prefix, select it on the list and click on the Delete button.
  • If you need to update a prefix, as setting are quite simple, you can delete it and add it again.

RDF Usine in Perspective

RDF Usine is not meant to compete with other open source or commercial tools. It is just the result of a small-sized project I had to undertake during my studies at École Centrale Paris and it could be considered as a proof of concept. That being said, it still depicts some interesting features that might hopefully represent a source of inspiration for enhancing other open source projects.

In the following paragraphs, I will outline a brief comparison with regards to two major open source projects: Any23 and Datalift.

Any23
  • It is an Apache Foundation project.
  • It enjoys the benefits of community contributions (bug fixes, enhacements, etc).
  • It is library, a RESTful web service and a command line tool that extracts structured data in RDF format from a variety of Web documents.
  • It can be embedded in another application but it does not provide a graphic user interface. Therefore, it does not offer interactivity or previews.
  • It supports a number of input formats including HTML, Turtle, N-Triples, N-Quads, RDF/XML, CSV (provided a header is available).
  • The following output formats are supported: Turtle, N-Triples, N-Quads, RDF/XML, JSON
Datalift
  • It is an open platform and is the result of an R&D project leaded by INRIA and sponsored by other academic, industry, institutional and innovation partners such as Montpellier Laboratory of Informatics, Eurecom, Atos, Mondeca, Institut National de l'Information Géographique et forestiére and L'Institut National de la Statistique et des Études Économiques.
  • It enjoys the benefits of community contributions (bug fixes, enhacements, etc).
  • It is available in French.
  • It runs as a web application and may be used by means of a browser either locally or remotely.
  • It has an embedded RDF triplestore.
  • SPARQL queries may be issued through its web interface.
  • Imported and processed data is organised in Projects.
  • Once a file has been imported, there is not much flexibility when it comes to performing changes. For example, to change a column name or type, the file needs to be removed and imported again.
  • As input, it allows CSV, RDF, relational databases (through JDBC connectors), other SPARQL endpoints, XML, Shapefile and GML.
  • Among other features, it allows exports to RDF and CSV formats, publishing data towards other triple stores, etc.
  • It allows a broad range of chart visualisations on the basis of SPARQL queries.
  • A base URI and a Graph URI may be defined. This feature is not available in RDF Usine.
RDF Usine
  • It is conceived as a stand alone application with which the user can interact directly by means of its Graphic User Interface.
  • It may be used in three languages: English, French and Spanish.
  • It only supports delimited field files as input but it does not require headers to be present (as it happens with Apache Any23, for instance).
  • It focuses on the CSV to RDF translation but it does not offer an embedded triple store or querying facilities. However, once the conversion has taken place, it is possible to export data as triples or SPARQL 1.1 Insert statements and upload them to a triple store such as Apache Fuseki.
  • Different settings may be tested interactively, as the application allows previews both in RDF and tabular formats. For the time being, this is not available in Any23 or Datalift.
  • Configurations can be Saved and Loaded, thus facilitating reutilisation. It is also possible to process multiple files and/or entire folders (and their respective subfolders) with the same set of settings. This is not the case in DataLift: if you have to process many files, you have to tackle them individually even if they have identical formats.
  • Entity types (i.e., the object associated to "a" predicates) may be freely defined. This is not allowed in Datalift.
  • Objects identifiers (i.e., subjects) may be autonumeric or read from a particular column. An optional prefix may also be specified. Datalift only allows autonumeric values.
  • It is possible to define Prefix names and URIs manually.
  • Automatic field type detection is a very nice feature that can save up a lot of time while defining conversion parameters. It is also meant to aid users that do not know the differences between the different types. This is not available in Datalift either.
Other RDF Conversion Tools
  • You may find information about other RDF Conversion Tools in W3 website.

Summing-up

RDF Usine is a tool that allows converting text files into Turtle or N-Triple RDF files in an easy and interactive way. Both executables and the source code are published and available for the general public.

If you would like to contribute to the project by fixing bugs or developing enhancements, do not hesitate to let me know.

Please notice that this software is provided "as is" and any expressed or implied warranties are disclaimed. In no event shall the contributors be liable for any direct, indirect, incidental, special, exemplary or consequential damages arising in any way out of the use of this software.

In my next post, I will present an interesting real-world use case of heterogeneous data integration where RDF shows its full potential.

Please share this post with friends and colleagues and let me know what you think by using the Comments section below!


Bonus Track: Behind the Scenes

Some details for the technically curious readers:

  • The application is made up by about 3,500 lines of Java code.
  • The class for popups (windows to display errors, warnings or additional information) was a slight adaptation of code publicly shared by Mark Heckler. This accounts about 300 lines of Java code.
  • The interface was defined as an FXML file of about 350 lines of code.
  • Internationalisation settings are stored as key-value files. There is one file per language and each of them contains almost 100 labels. Most of the labels represent single words or short sentences.
  • For String tokenisation purposes, Apache Commons Lang library is used. It is governed by the terms of the Apache License version 2.0.
  • The initial version of the application used a third-party library for outputting RDF text but its usage was discontinued when performance problems were detected with big files.
  • Binaries are as slim as 88 KB. Considering the size of the aforementioned library (373 KB), the application total size is 461 KB.
  • Converting a file of almost 691,000 lines and 8 columns (33.6 MB) to Turtle takes about 30 seconds and results in an output RDF file of 216 MB. The same conversion to N-Triples takes about 36 seconds and outputs 256 MB.

About the Author

Maximiliano Ariel López is a Bachelor of Business Information Systems with 9 years of work experience in Information Technology and a background of almost 7 years of university teaching. This gave him exposure to different organisational environments and to a wide range of IT tools, thus testing his abilities to adapt and quickly learn.

At present, he is attending the third semester of the Erasmus Mundus Master's Programme in Information Technologies for Business Intelligence (IT4BI) at École Centrale Paris.

You may contact him through LinkedIn and Google+.

1 comment:

  1. This comment has been removed by a blog administrator.

    ReplyDelete