WASP-Showcase: Repair Programs in the INFOMIX project

WASP (IST project IST-FET-2001-37004)
Official Project-Website

The INFOMIX System

INFOMIX is a novel system which supports powerful information integration, utilizing the ASP-system DLV. While INFOMIX is based on solid theoretical foundations, it is a user-friendly system, endowed with graphical user interfaces for the average database user and administrator, respectively. The main features of the INFOMIX system are:

  1. a comprehensive information model, through which the knowledge about the integration domain can be declaratively specified,

  2. capability of dealing with data that may result incomplete and/or inconsistent with respect to global constraints,

  3. advanced information integration algorithms, which reduce (in a sound and complete way) query answering to cautious reasoning on disjunctive Datalog programs,

  4. sophisticated optimization techniques guaranteeing the effectiveness of query evaluation in INFOMIX,

  5. a rich data acquisition and transformation framework for accessing heterogeneous data in many formats including relational, XML, and HTML data.

Problem Description

A Data Integration System offers uniform access to a set of heterogeneous sources, so that the user is freed from the knowledge about the data. The main components of a data integration systems are the Global Schema, the Source Schemas and the Mapping Assertions, as depicted below.



 

When the user issues a query over the global schema, the system:

All the above tasks has to be carried out by accounting for the possibility of having both incomplete and inconsistent data. In these cases, the data retrived may be inconsistent and should be repaired.

Usage of Answer-Set Programming

In INFOMIX, the setting is such that:

Complexity results for query answering is such setting (in the presence of incomplete and inconsistent data) are as shown in the table.

Therefore, the advanced reasoning capabilities of Answer-Set engines are really required and are exploited according to the following ideas:

Benefits of Using Answer-Set Programming

The proposed approach has some attractive features. An important one is that logic programs serve as executable logical specifications of repairs, and thus provide a language for expressing repair policies in a fully declarative manner rather than in a procedural way. Furthermore, reasoning about specifications, their properties and behavior, is much better facilitated than for procedural repair specifications, since reasoning about programs is one of the principal issues in logic programming, and has been abundantly studied. Finally, extensions to the logic programming language which allow e.g. to handle priorities and weight constraints provide a useful set of constructs for expressing also more involved criteria that repairs should satisfy, which possibly have to be customized to a particular application scenario. Here, logic programming specifications may serve as a useful test-bed for development, since variants of repair can be quickly realized and experimented with.

Demo Scenario

We have tested the INFOMIX prototype system by means of a real-life application scenario, in which data from various legacy databases and web sources must be integrated for a university information system. In particular, we built our information integration system on top of the data sources available at the University of Rome “La Sapienza”. The data sources comprise information on students, professors, curricula and exams in various faculties of the university. Currently, this data is dispersed over several databases in various (autonomous) administration sources and many webpages at different servers. Given this setting, we have devised a global schema of 14 relations,

student(S ID,FirstName,SecondName,CityOfResidence,
Address,Telephone,HighSchoolSpecialization)
enrollment(S ID,FacultyName,Year)
course(C Code, Description)
. . .

and 29 integrity constraints, comprising KDs, IDs, and EDs. The application scenario includes 3 legacy databases in relational format, comprising about 25 relations in total. The relation sizes range from a few hundred to tens of thousands of tuples (e.g., exam data). Besides these legacy databases, there are numerous web pages, which either provide information explicitly or through simple query interfaces (e.g., members of a department, phone numbers etc). We have developed a number of wrappers using LiXto tools, which
extract information from these web sources. In total, there are about 35 data sources in the application scenario, which are mapped to the global relations through about 20 UCQs. Each UCQ joins up to three different logical data sources.

View DEMO Specification.

Finally, we have formulated 9 typical queries with peculiar characteristics, which model different use cases. Students data have been encrypted for privacy reason.

Further Information




page mainted by Gianluigi Greco and Luigi Granata.