Faceted browsing using Longwell - Towards semantic interoperability of cultural information sys

The main objective of the project on hand has been to map two specialized data models to a data model that conforms to the CIDOC Conceptual Reference Model. This data can be shared and processed within multiple contexts. However, this does not answer for the existence of tools that can process the data in a way that is meaningful and contributes to solving a specific scientific problem. One fundamental step towards this objective has been visualizing the data as soon as possible to get a better impression of how Semantic Web tools could deal with the data to be ingested.

1_{SVG is maintained at the W3C (}_{http:// www.w3.org/ TR/ SVG/}_{). The X3D standard is}

defined athttp:// www.web3d.org/ x3d/ specifications/ x3d specification.html.

2_{A representation of “Berlin” in Disco as HTML can be retrieved at}_{http:// dbpedia.org/}

page/ Berlin, the same resource as RDF/XML at http:// dbpedia.org/ data/ Berlin. The Tabulator browser is available athttp:// www.w3.org/ 2005/ ajar/ tab.

Longwell is a Semantic Web browser using the faceted classification paradigm that was first introduced by the Flamenco Project at Berkeley University [46, 19]. This paradigm assigns a couple of category terms to each term from one or more facets. A facet is a set of categories, for example, archaeological artefacts could be classified under a facet “material” with categories such as “marble” and “bronze.” Unfortunately, Flamenco has its own proprietary data model and mark-up format. Therfore, it will not be able to ingest RDF metadata that is published on the World Wide Web. Longwell is a web-based Semantic Web browser and runs on a standalone basis or within the context of a Java servlet container like Apache Tomcat [22].

Longwell is highly configurable in how it presents data to the user. The Fres- nel3 _{display vocabulary can be used to change the appearance of items that are}

displayed within the browser [61]. Currently most RDF browsers rely on their individual methods to approach two issues: selecting what information of an RDF graph will be displayed and how the data will be formatted. Fresnel can be used to facilitate concept-oriented browsing by explicitly displaying links to related objects. Listing 6.1 shows an abbreviated example of how the Fresnel language was used to tailor the output.

Listing 6.1: Fresnel configuration code in Notation3 (N3).

1 @ p r e f i x f r e s n e l : <h t t p : / /www. w3 . o r g / 2 0 0 4 / 0 9 / f r e s n e l#> . 2 @ p r e f i x r d f : <h t t p : / /www. w3 . o r g /1999/02/22 − r d f −s y n t a x −n s#> . 3 @ p r e f i x f a c e t s : <h t t p : / / s i m i l e . m i t . edu / 2 0 0 6 / 0 1 / o n t o l o g i e s / f r e s n e l −f a c e t s#> . 4 @ p r e f i x crm : <h t t p : / / c i d o c . i c s . f o r t h . g r / r d f s / c i d o c v 4 . 2 . r d f s#> . 5 6 @ p r e f i x : <#> . 7 8 : f a c e t s a f a c e t s : F a c e t S e t ; 9 f a c e t s : t y p e s f a c e t s : a l l T y p e s ; 10 f a c e t s : f a c e t s ( r d f : type ) . 11 12 : c i d o c F a c t e t s r d f : type f a c e t s : F a c e t S e t ; 13 f a c e t s : t y p e s ( crm : E 2 4 P h y s i c a l M a n −Made Thing ) ; 14 f a c e t s : f a c e t s ( 15 crm : P 6 7 B i s r e f e r r e d t o b y 16 crm : P 5 3 F h a s f o r m e r o r c u r r e n t l o c a t i o n 17 crm : P 4 4 F h a s c o n d i t i o n 18 crm : P 4 6 B f o r m s p a r t o f 19 crm : P 1 0 3 F w a s i n t e n d e d f o r 20 crm : P 4 5 F c o n s i s t s o f 21 ) . 22 23 : c i d o c O b j e c t L e n s r d f : type f r e s n e l : L e n s ; 24 f r e s n e l : p u r p o s e f r e s n e l : d e f a u l t L e n s ; 25 f r e s n e l : c l a s s L e n s D o m a i n crm : E 2 4 P h y s i c a l M a n −Made Thing ; 26 f r e s n e l : s h o w P r o p e r t i e s ( 27 crm : P 3 F h a s n o t e 28 crm : P 1 0 3 F w a s i n t e n d e d f o r 29 crm : P 5 3 F h a s f o r m e r o r c u r r e n t l o c a t i o n 30 crm : P 4 4 F h a s c o n d i t i o n 31 crm : P 4 5 F c o n s i s t s o f 32 crm : P 6 7 B i s r e f e r r e d t o b y 33 crm : P 4 6 B f o r m s p a r t o f 34 crm : P 1 3 8 B h a s r e p r e s e n t a t i o n 35 ) ; 36 f r e s n e l : g r o u p : g r . 37 38 : c i d o c O j e c t I m a g e F o r m a t r d f : type f r e s n e l : Format ;

3_{The term “Fresnel” refers to the French physicist Augustin-Jean Fresnel who constructed a}

39 f r e s n e l : p r o p e r t y F o r m a t D o m a i n crm : P 1 3 8 B h a s r e p r e s e n t a t i o n ; 40 f r e s n e l : v a l u e f r e s n e l : i m a g e ; 41 f r e s n e l : l a b e l " I m a g e s " ; 42 f r e s n e l : g r o u p : g r . 43 44 : g r r d f : type f r e s n e l : Group ; 45 f r e s n e l : l a b e l " C I D O C C R M s t a n d a r d g r o u p " ; 46 f r e s n e l : s t y l e s h e e t L i n k <h t t p : / / p e n t h e u s . p e r s e u s . t u f t s . edu / crm . c s s > .

Figures 6.1 and 6.2 exemplify how the Fresnel language can be used to change the appearance of data objects within the browser, including the display of images.

Figure 6.1: The Longwell Semantic Web browser, unconfigured.

Longwell is an example for using the underlying data model to control the user interface component of an application. In this example the CIDOC CRM is used both for internal information representation and for external user interface generation. If the underlying data model will be changed or extended, the graph- ical user interface component will automatically reflect these changes without any additional efforts.

Khurso and Tjoa found out that Longwell, compared to other browsing and visualization tools, is one of the more scalable tools [38]. According to their experiments, Longwell is able to handle more than 500,000 triples. Currently about 40 fields and one link to the picture database of Perseus’ 6,000 database records are mapped to RDF/XML. This results in about 401,000 RDF triples, an amount of data that has been indexed within a couple of hours. The data-set could be browsed at good performance. However, for Arachne, ten fields of the main object table with links to geographic entities and bibliographic information have been mapped resulting in about 2,402,000 triples for about 60,000 archaeological objects, 6,000 bibliographic entries, and 5,000 records with place information. This amount of data could not be ingested into the Longwell browser, the ingesting process was stopped after 109 hours of computing time on a Mac Pro (3,0 GHz

Figure 6.2: The Longwell Semantic Web browser, configured with Fresnel. Quad-Core Intel Xeon 5300, 2GB main memory). Performance experiments with a native in-memory store turned out to be promising.

But there are other alternatives that should be explored as well. A larger integration project for archaeological data would easily reach a magnitude of more than 30 million RDF triples. Portwin and Parvatikar state that the Jena API scaled up to 200 million triples during their project [51].4 This amount of triples is enough for a small cultural domain but not enough for huge amounts of data worldwide.

Unfortunately, Longwell does not support any inferencing on the underlying ontology. Even if the CIDOC CRM definitions were ingested together with Perseus’ and Arachne’s metadata, no links from data objects to their defining classes were discovered and indexed. This leads to the fact that Longwell completely ignores the concepts of generalization and inheritance. For example, Longwell does not allow for displaying all persistent physical items and non-material products of human activity by selecting the E71 Man-Made Stuff, the class under which they are subsumed. The user rather has to formulate a concatenated query that includes both classes. However, this prevents from exploiting some of the most fundamental advantages of ontologies and thesauri.

4_{At “}_{http:// www.mkbergman.com/ ?p=227}_{” M. K. Bergmann states that 250 million}

Chapter 7 Conclusion

After evaluating functional requirements of digital scholarship, some building blocks of a future Cyberinfrastructure have been introduced. A distinction has been made that conceptually separates instances from entities. To conduct serious research, scientists need to refer to instances within primary sources to give evidence for their argumentation. They also need to make unambiguous assertions about, for example, historical places and persons. Thus, there is a need for a system that enables scientists referring to specific entities. A complex software architecture including authority-naming services and institutional repositories that build upon Semantic Web concepts could provide the functionality needed. Additionally, standards that facilitate networked knowledge organization systems have been looked at. To better understand different conceptual and physical elements of the overall architecture, a basic mapping workflow has been established, reaching from data extraction, over cleaning and mapping, to visual presentation in the Longwell Se- mantic Web browser. Most current data models cannot instantly deliver the data in a way that can be processed for Semantic Web purposes. Common problems comprise dirty and unstructured data that could not be easily extracted. Addition- ally, many Semantic Web concepts are still not well understood and complicated to implement using state of the art web-server technology.

The Perseus art and archaeology database contains approximately 6,000 data objects. Each object is described as a subset of altogether 102 database fields. Some of these fields are administrative and only used internally so that 94 fields qualify for mapping. Since the database hosts a high diversity of objects rang- ing from coins to buildings, only 34 fields were found to be relevant for all data objects. Therefore, the mapping experiment started with mapping those fields to the CIDOC CRM. Three fields contained structured bibliographic entities that could be easily extracted by trivial pre-processing. Four fields, however, contained mostly unstructured text with valuable information about places and people that have not been extracted. The mapping workflow certainly needs better automation

and options for plugging in data quality and cleaning tools. The compilation of further and better mapping rules as well as pre-processing components will be an iterative and ongoing endeavor. The experience we gained with mapping Perseus’ data will help to better map the more than 100,000 data objects of Arachne. As a test-case the most important fields of three central Arachne tables (objekt, literatur, ort) have also been mapped to the CIDOC CRM.

Some problems with extracting data from both databases originate from fields with an implicit internal structure. For bibliographic information, the structure could be automatically discovered and items extracted. Because of poor data quality, some information had to be dropped and could not be mapped to the CIDOC CRM. The application of tools that can fix common data quality problems would result in a more comprehensive mapping result. A couple of tools are freely available in the public domain and commercial solutions also exist. But most problems require domain specific knowledge and could probably be handled better by specialized software. However, investing in internal data cleaning and re-organization of data models would help with mapping cultural heritage databases to the CIDOC CRM.

Introduction of multilingual record-linkage tools could assist with automatically linking data objects that belong together. Perseus and Arachne have slightly overlapping collections. In this context, digital surrogates that refer to the same entity should be linked. This would result in accumulating multilingual metadata for these objects. Even if cross language information retrieval tools are introduced, all metadata internally should to be available in a certain language, for example, English.

Record-linkage is dependent on entity identification. If two bundles of metadata can be identified as referring to the same entity, the records can be linked as belonging together. The objective has to be not only to identify that a specific string refers to a person or a place, but also to what specific place or person. This will be carried out by assigning a common global identifier. The overall aim will be to automatically linking data objects that conceptually belong together.

Entity-identification will become more powerful if it is done with the assistance of authority naming-services. Hooking text-parsers up to these services could extract information about people and places from full-text descriptions. However, in the course of the project, record-linkage only could be established on a very low level, future research should concentrate on this area. Advanced record-linkage applications seem to be promising for contributing linked Semantic Web data.

For the time being, Perseus’ data has been published for harvesting at http: // athena.perseus.tufts.edu/ collection in three different representations (RDF/XML, HTML and Collection service XML). This data-set should be ingested into institutional repository software that is able to handle huge amounts of data. Each data

object should be equipped with a persistent identifier. This could be the Fedora institutional repository software or a simple triple store with another publishing component. Fedora bears the advantage of delivering many data management tools and facilitating long term preservation. For publishing the data to a large audience, Fedora implements the OAI Protocol for Metadata Harvesting. Large repositories that facilitate discovery of RDF data are emerging.1

By the development of new and flexible ways of knowledge representation, historical cultural scientists will be enabled to refer to, access, and manage vast amounts of densely linked data objects as surrogates for existing cultural heritage objects. The most obvious benefit of putting granular data online is providing rapid and economic access not only to documents but to granular metadata and large knowledge organization systems. To action this vision, scientists will need to encode their documentation in a way that can be processed by machines and reused by other scientists. Moreover, the vast amount of material that has been published traditionally, in print or even hand written, should be digitized in a way that contributes to the linked data idea. Named entity identification systems in addition to full-text parsers could be adapted to this task.

Although the Semantic Web is obviously an emerging field, current frameworks and browsers leave many issues unaddressed. Each tool is targeted to a certain display paradigm and provides only limited scalability, being suitable for research in the lab but not for a large production environment. Because Longwell separates data and display it provides a promising paradigm for future research, provided that it will be able to overcome current scalability issues. Frameworks like the Jena API also show promise concerning scalability issues. However, certain communities need to chose a suitable display paradigm that provides useful access to contribute to their research objectives.

All tools that have been described so far could also be applied to publications that exist in digital form either to link archaeological objects and ancient texts to secondary sources or to automatically create data objects in bulk. A fruitful area of research surely will be the development of tools that provide “Intelligent Information Access” for digital libraries. These are “technologies that make use of human knowledge or human-like intelligence to provide effective and efficient access to large, distributed, heterogeneous and multilingual (and at this time mainly [but not only] text-based) information resources and to satisfy users’ information needs [6].”

1_{http:// pingthesemanticweb.com/} _{is a service which acts as a concentrator for multiple}

Bibliography

[1] A. Babeu, D. Bamman, G. Crane, R. Kummer, and G. Weaver. Named entity identification and cyberinfrastructure. In Proceedings of the 11th European Conference on Research and Advanced Technology for Digital Libraries (ECDL 2007)-to appear, pages 259–270. Springer Verlag, September 2007.

[2] M. Baca and P. Harpring. Categories for the Description of Works of Art.

http:// www.getty.edu/ research/ conducting research/ standards/ cdwa/, August 2006.

[3] J. Bekaert, X. Liu, H. Van de Sompel, C. Lagoze, S. Payette, and S. Warner. Pathways core: a data model for cross-repository services. In JCDL ’06: Proceedings of the 6th ACM/IEEE-CS joint conference on Digital Libraries, pages 368–368, New York, NY, USA, 2006. ACM Press.

[4] O. Boonstra, L. Breure, and P. Doorn. Past, present and future of historical information science. Historical Social Research / Historische

Sozialforschung, 29(2):4–131, 2004.

[5] Dan Brickley and R.V. Guha. Rdf vocabulary description language 1.0: Rdf schema. http:// www.w3.org/ TR/ rdf-schema/, February 2004.

[6] J. Chen, F. Li, and C. Xuan. A preliminary Analysis of the Use of resources in intelligent information access research. In Proceedings 69th Annual

Meeting of the American Society for Information Science and Technology (ASIST), volume 43, 2006.

[7] Art Museum Image Consortium. AMICO Data Specification.

http:// www.amico.org/ AMICOlibrary/ dataspec.html, 2004.

[8] G. Crane, D. Bamman, L. Cerrato, A. Jones, D. Mimno, A. Packel, D. Sculley, and G. Weaver. Beyond digital incunabula: Modeling the next generation of digital libraries. In Proceedings of the 10th European

(ECDL 2006), volume 4172 of Lecture Notes in Computer Science. Springer, 2006.

[9] G. Crane, C. E. Wulfman, L. M. Cerrato, A. Mahoney, T. L. Milbank, D. Mimno, J. A. Rydberg-Cox, D. A. Smith, and C. York. Towards a cultural heritage digital library. In Proceedings of the 3rd ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL 2003, pages 75–86, Houston, TX, June 2003.

[10] N. Crofts, M. D¨orr, T. Gill, S. Stead, and M. Stiff. Definition of the CIDOC object-oriented conceptual reference model. Technical report, The CIDOC CRM Special Interest Group, 2005.

[11] DAI. Deutsches Arch¨aologisches Institut. http:// www.dainst.org, August 2007.

[12] H. Van de Sompel, C. Lagoze, J. Bekaert, X. Liu, S. Payette, and S. Warner. An Interoperable Fabric for Scholarly Value Chains. D-Lib Magazine,

12(10), October 2006.

[13] H. Van de Sompel, M. L. Nelson, C. Lagoze, and S. Warner. Resource Harvesting within the OAI-PMH Framework. D-Lib Magazine, 10(12), 2004. [14] H. Van de Sompel, S. Payette, J. Erickson, C. Lagoze, and S. Warner.

Rethinking Scholarly Communication. Building the System that Scholars Deserve. D-Lib Magazine, 10(9), September 2004.

[15] M. D¨orr. The CIDOC conceptual reference module[sic!]: An ontological approach to semantic interoperability of metadata. AI Mag, 24(3):75–92, 2003.

[16] M. D¨orr. The CIDOC CRM, a Standard for the Integration of Cultural Information. http:// cidoc.ics.forth.gr/ docs/ crm for gothenburg.ppt, November 2005.

[17] M. D¨orr and P. LeBoeuf. FRBR object-oriented definition and mapping to FRBR-ER.

http:// cidoc.ics.forth.gr/ docs/ frbr oo/ frbr docs/ FRBR oo V0.8.1c.pdf, May 2007.

[18] EPOCH. A Survey of Documentation Standards in the Archaeological and Museum Community. http://hdl.handle.net/2313/91, October 2006.

[19] Flamenco. The Flamenco Search Interface Project.

[20] R. Förtsch. ARACHNE - Datenbank und kulturelle Archive des Forschungsarchivs für Antike Plastik Köln und des Deutschen

Arch¨aologischen Instituts. http:// arachne.uni-koeln.de/ inhalt text.html, August 2007.

[21] R. F¨ortsch. Forschungsarchiv f¨ur Antike Plastik.

http:// www.klassarchaeologie.uni-koeln.de/ abteilungen/ mar/ forber.htm, August 2007.

[22] The Apache Software Foundation. Apache Tomcat.

http:// tomcat.apache.org/, May 2007.

[23] The Apache Software Foundation. The Apache HTTP Server Project.

http:// httpd.apache.org/, August 2007.

[24] H. Galhardas, D. Florescu, D. Shasha, and E. Simon. AJAX: An Extensible Data Cleaning Tool. In SIGMOD ’00: Proceedings of the 2000 ACM

SIGMOD international conference on Management of data, page 590, New York, NY, USA, 2000. ACM Press.

[25] P. Galuzzi. The virtual museum of the Future. In Semantic Web for

scientific and cultural organisations: results of some early experiments, June 2003.

[26] M. Genereux and F. Niccolucci. Extraction and mapping of CIDOC-CRM encodings from texts and other digital formats. In The 7th International Symposium on Virtual Reality, Archaeology and Cultural Heritage (VAST), Nicosia, Cyprus, 2006.

[27] V. Geroimenko and C. Chen. Visualizing Information Using SVG and X3D. XML Based Technologies for the XML Based Web. Springer, London [Et al.], 2. ed edition, 2004.

[28] P. Gietz, A. Aschenbrenner, S. Budenbender, F. Jannidis, M. W. Kuster, C. Ludwig, W. Pempe, T. Vitt, W. Wegstein, and A. Zielinski. TextGrid and eHumanities. In E-SCIENCE ’06: Proceedings of the Second IEEE International Conference on e-Science and Grid Computing, pages 133–141, Washington, DC, USA, 2006. IEEE Computer Society.

[29] T. R. Gruber. Towards Principles for the Design of Ontologies Used for Knowledge Sharing. In N. Guarino and R. Poli, editors, Formal Ontology in Conceptual Analysis and Knowledge Representation, Deventer, The

[30] I. Herman, R. Swick, and D. Brickley. Resource Description Framework (RDF) / W3C Semantic Web Activity. http:// www.w3.org/ RDF/, January 2007.

[31] ICS-FORTH. Partial Definition of the CIDOC Conceptual Reference Model version 4.2 in RDF. http:// cidoc.ics.forth.gr/ rdfs/ cidoc v4.2.rdfs, June 2005. [32] IFLA Study Group on the Functional Requirements for Bibliographic

Records. Functional Requirements for Bibliographic Records: Final Report, volume 19 of UBCIM Publications-New Series. K.G.Saur, M¨unchen, 1998. [33] Open Archives Initiative. Open Archives Initiative Protocol — Object

Reuse and Exchange. http:// www.openarchives.org/ ore/, August 2007. [34] The Text Encoding Initiative. Tei: Yesterday’s information tomorrow.

http:// www.tei-c.org/, August 2007.

[35] Getty Institute. The Getty Thesaurus of Geographic Names Online. http: // www.getty.edu/ research/ conducting research/ vocabularies/ tgn/ index.html, August 2007.

[36] H. Kondylakis, M. D¨orr, and D. Plexousakis. Mapping Language for Information Integration. Technical report, ICS-FORTH, December 2006. [37] R. Kummer. Integrating Data from The Perseus Project and Arachne using

the CIDOC CRM: An Examination from a Software Developer’s

Perspective. In Exploring the Limits of Global Models for Integration and Use of Historical and Scientific Information-ICS Forth Workshop,

Heraklion, Crete, October 2006. ICS-Forth, ICS-Forth.

[38] S. Kushro and A. Tjoa. Fulfilling the Needs of a Metadata Creator and Analyst – An Investigation of RDF Browsing and Visualization Tools. Canadian Semantic Web, pages 81–101, 2006.

[39] C. Lagoze and H. Van de Sompel. The Open Archives Initiative: Building a Low-Barrier Interoperability Framework. In ACM/IEEE Joint Conference on Digital Libraries, pages 54–62, 2001.

[40] C. Lagoze, S. Payette, E. Shin, and C. Wilper. Fedora: An Architecture for Complex Objects and their Relationships.

http://arxiv.org/abs/cs.DL/0501012, August 2005.

[41] J. Maeda. The Laws of Simplicity (Simplicity: Design, Technology, Business, Life). The MIT Press, August 2006.

[42] D. L. McGuinness and F. van Harmelen. OWL Web Ontology Language Overview. http:// www.w3.org/ TR/ owl-features/, February 2004.

[43] B. Metcalfe. Metcalfe’s Law: A Network Becomes More Valuable as it

In document Towards semantic interoperability of cultural information systems making ontologies work. (Page 58-72)