Discover millions of ebooks, audiobooks, and so much more with a free trial

Only $11.99/month after trial. Cancel anytime.

A Developer’s Guide to the Semantic Web
A Developer’s Guide to the Semantic Web
A Developer’s Guide to the Semantic Web
Ebook927 pages11 hours

A Developer’s Guide to the Semantic Web

Rating: 5 out of 5 stars

5/5

()

Read preview

About this ebook

The Semantic Web represents a vision for how to make the huge amount of information on the Web automatically processable by machines on a large scale. For this purpose, a whole suite of standards, technologies and related tools have been specified and developed over the last couple of years, and they have now become the foundation for numerous new applications.

A Developer’s Guide to the Semantic Web helps the reader to learn the core standards, key components, and underlying concepts. It provides in-depth coverage of both the what-is and how-to aspects of the Semantic Web. From Yu’s presentation, the reader will obtain not only a solid understanding about the Semantic Web, but also learn how to combine all the pieces to build new applications on the Semantic Web.

Software developers in industry and students specializing in Web development or Semantic Web technologies will find in this book the most complete guide to this exciting field available today. Based on the step-by-step presentation of real-world projects, where the technologies and standards are applied, they will acquire the knowledge needed to design and implement state-of-the-art applications.

LanguageEnglish
PublisherSpringer
Release dateJan 3, 2011
ISBN9783642159701
A Developer’s Guide to the Semantic Web

Related to A Developer’s Guide to the Semantic Web

Related ebooks

System Administration For You

View More

Related articles

Reviews for A Developer’s Guide to the Semantic Web

Rating: 4.75 out of 5 stars
5/5

2 ratings1 review

What did you think?

Tap to rate

Review must be at least 10 words

  • Rating: 5 out of 5 stars
    5/5
    This is a very solid introduction to the world of semantic web for developers who need a practical guide. Even if you did not encounter the concepts related to the semantic web, you will not have any trouble because the author gently introduces every concept and provides you with practical examples and justifications. After having read the first half of the book you'll have a very good grasp of RDF, RDFS, N3, OWL, SPARQL, FOAF, Linked Data and other underlying concepts and technologies. But the story does not end there, in the second part of the book you'll proceed to develop concrete Semantic Web applications using established semantic frameworks such as Jena. You will also have learned how to utilize the most important semantic linked data resources such as DBpedia. The author shows a lot of examples with accompanying source code and does not refrain from explaining the intricate parts of the code as well as pointing out potential improvements and pitfalls in real life scenarios.I would be willing to give the book five stars if only a little more care were taken in its editing. I'd expect a better proofreading from Springer editors. And even though lots of references are provided within the text it would be good to have a references or bibliography section at the end of the book. I was also surprised because the author never mentioned AllegroGraph, an important player in the field of graph databases / RDF stores.My conclusion is that if you are going to read only one book for semantic web then it should be this book.

Book preview

A Developer’s Guide to the Semantic Web - Liyang Yu

Liyang YuA Developer’s Guide to the Semantic Web10.1007/978-3-642-15970-1_1© Springer-Verlag Berlin Heidelberg 2011

1. A Web of Data: Toward the Idea of the Semantic Web

Liyang Yu¹  

(1)

Delta Air Lines, Inc., Delta Blvd. 1030, Atlanta, GA 30354, USA

Liyang Yu

Email: liyang910@yahoo.com

1.1 A Motivating Example: Data Integration on the Web

1.1.1 A Smart Data Integration Agent

1.1.2 Is Smart Data Integration Agent Possible?

1.1.3 The Idea of the Semantic Web

1.2 A More General Goal: A Web Understandable to Machines

1.2.1 How Do We Use the Web?

1.2.2 What Stops Us from Doing More?

1.2.3 Again, the Idea of the Semantic Web

1.3 The Semantic Web: A First Look

1.3.1 The Concept of the Semantic Web

1.3.2 The Semantic Web, Linked Data, and the Web of Data

1.3.3 Some Basic Things About the Semantic Web

Reference

Abstract

If you are reading this book, chance is you are a software engineer who makes a living by developing applications on the Web – or, more precisely, on the Web that we currently have.

If you are reading this book, chance is you are a software engineer who makes a living by developing applications on the Web – or, more precisely, on the Web that we currently have.

And yes, there is another kind of Web. It is built on top of the current Web and called the Semantic Web. As a Web application developer, your career on the Semantic Web will be more exciting and fulfilling.

This book will prepare you well for your development work on the Semantic Web. This chapter will tell you exactly what the Semantic Web is, and why it is important for you to learn everything about it.

We will get started by presenting a simple example to illustrate the difference between the current Web and the Semantic Web. You can consider this example as a development assignment for you. Once you start to ponder the issues such as what exactly do we need to change on the current Web to finish this assignment, you are well on the way to see the basic picture about the Semantic Web.

With a basic understanding about the Semantic Web, we will continue to discuss how much more it can revolutionize the way we use the Web, and can further change the patterns we conduct our development work on the Web. We will then formally introduce the concept of the Semantic Web; hopefully this concept will seem to be much more intuitive to you.

This chapter will build a solid foundation for you to understand the rest of this book. When you have finished the whole book, come back and read this chapter again. You should be able to acquire a deeper appreciation of the idea of the Semantic Web. By then, I hope you have also formed your own opinion about the vision of the Semantic Web, and with what you have learned from this book, you are ready to start your own exciting journey of exploring more and creating more on the Semantic Web.

1.1 A Motivating Example: Data Integration on the Web

Data integration on the Web refers to the process of combining and aggregating information resources on the Web so they could be collectively useful to us. In this section, we will use a concrete example to see how data integration can be implemented on the Web, and why it is so interesting to us.

1.1.1 A Smart Data Integration Agent

Here is our goal: for a given resource (could be a person, an idea, an event, or a product, such as a digital camera), we would like to know everything that has been said about it. More specifically, we would like to accomplish this goal by collecting as much information as possible about this resource; we will then understand it by making queries against the collected information.

To make this more tangible, let us use myself as the resource. In addition, assume we have already built a smart agent, which will walk around the Web and try to find everything about me on our behalf.

To get our smart agent started, we feed it with the URL of my personal home page as the starting point of its journey on the Web:http://www.liyangyu.com

Now, our agent downloads this page and tries to collect information from this page. This is also the point where everything gets more interesting:

If my Web page were a traditional Web document, our agent would not be able to collect anything that is much useful at all.

More specifically, the only thing our agent is able to understand on this page would be those HTML language constructs, such as

,
, ,

  • . Besides telling a Web browser about how to present my Web page, these HTML constructs do not convey any useful information about the underlying resource. Therefore, other than these HTLM tokens, to our agent, my whole Web page would simply represent a string of characters that look no different from any other Web document.
  • and

    However, let us assume my Web page is not a traditional Web document: besides the HTML constructs, it actually contains some statements that can be collected by our agent.

    More specifically, all these statements follow the same simple structure, and each one of them represents one aspect of the given resource. For example, List 1.1 shows some example statements that have been collected:

    List 1.1 Some of the statements collected by our smart agent from my personal Web page ns0:LiyangYu ns0:name Liyang Yu. ns0:LiyangYu ns0:nickname LaoYu. ns0:LiyangYu ns0:author . ns0:_x ns0:ISBN 978-1584889335. ns0:_x ns0:publisher .

    At this point, let us not worry about the issues such as how these statements are added to my Web page, and how our agent collects them. Let us simply assume when our agent visits my Web page, it can easily discover these statements.

    Note that ns0 represents a namespace, so that we know everything, with ns0 as its prefix, is collected from the same Web page. ns0:LiyangYu represents a resource that is described by my Web page; in this case, this resource is me.

    With this said, the first statement in List 1.1 can be read like this:

    resource ns0:LiyangYu has a ns0:name whose value is Liyang Yu.

    Or, like this:

    resource ns0:LiyangYu has a ns0:name property, whose value is Liyang Yu.

    The second way to read a given statement is perhaps more intuitive. With this in mind, each statement actually adds one property–value pair to the resource that is being described. For example, the second statement claims the ns0:nickname property of resource ns0:LiyangYu has a value given by LaoYu.

    The third statement in List 1.1 is a little bit unusual. When specifying the value of ns0:author property for resource ns0:LiyangYu, instead of using a simple character string as its value (as the first two statements in List 1.1), it uses another resource, and this resource is identified by ns0:_x. To make this fact more obvious, ns0:_x is included by <>.

    Statement 4 in List 1.1 specifies the value of ns0:ISBN property of resource ns0:_x, and the last statement in List 1.1 specifies the value of ns0:publisher property of the same resource. Note again that the value of this property is not a character sting, but another resource identified by http://www.crcpress.com.

    The statements in List 1.1 probably still make sense to our human eyes, no matter how ugly they look. The real interesting question is, how much does our agent understand these statements?

    Not much at all. However, without too much understanding about these statements, our agent can indeed organize them into a graph format, as shown in Fig. 1.1 (note that more statements have been added to the graph).

    A978-3-642-15970-1_1_Fig1_HTML.gif

    Fig. 1.1

    A graph generated by our agent after visiting my person Web page

    After the graph shown in Fig. 1.1 is created, our agent declares its success on my personal Web site and moves on to the next one. It will repeat the same process again when it hits the next Web document.

    Let us say the next Web site our agent hits is http://www.amazon.com. Similarly, if Amazon were still the Amazon today, our agent could not do much either. In fact, it can retrieve information about this ISBN number, 978-1584889335, by using Amazon Web Services .¹ For now, let us say our agent does not know how to do that.

    However, again assume Amazon is a new Amazon already. Our agent can therefore collect lots of statements, which follow the same format as shown in List 1.1. Furthermore, among the statements that have been collected, some of them are shown in List 1.2.

    List 1.2 Statements collected by our agent from Amazon.com ns1:book-1584889330 ns1:ISBN 978-1584889335. ns1:book-1584889330 ns1:price USD62.36. ns1:book-1584889330 ns1:customerReview 4.5 star.

    Note that, similar to namespace prefix ns0, ns1 represents another namespace prefix. And now, our agent can again organize these statements into a graph form as shown in Fig. 1.2 (note that more statements have been added to the graph).

    A978-3-642-15970-1_1_Fig2_HTML.gif

    Fig. 1.2

    The graph generated by our agent after visiting Amazon.com

    For human eyes, one important fact is already quite obvious: ns0:_x, as a resource in Fig. 1.1 (the empty circle), represents exactly the same item denoted by the resource named ns1:book-1584889330 in Fig. 1.2. And once we made this connection, we start to see other facts easily. For example, a person who has a home page with its URL given by http://www.liyangyu.com has a book published and the current price of that book is US $62.36 on Amazon. Obviously, this fact is not explicitly stated on either one of the Web sites, but our human minds have integrated the information from both http://www.liyangyu.com and http://www.amazon.com to reach this conclusion.

    For our agent, similar data integration is not difficult either. In fact, our agent sees the ISBN number, 978-1584889335, showing up in both Figs. 1.1 and 1.2, it will therefore make a connect between these two appearances, as shown in Fig. 1.3. It will then automatically add the following new statement to its original statement collection:ns0:_x sameAs ns1:book-1584889330.

    A978-3-642-15970-1_1_Fig3_HTML.gif

    Fig. 1.3

    Our agent can combine Figs. 1.1 and 1.2 automatically

    And once this is done, for our agent, Figs. 1.1 and 1.2 are already glued together by overlapping the ns0:_x node with the ns1:book-1584889330 node. This gluing process is exactly the data integration process on the Web.

    Now, without going into the details, it is not difficult to convince ourselves that our agent can answer lots of questions that we might have. For example, what is the price of the book written by a person whose home page is given by this URL, http://www.liyangyu.com?

    This is indeed very encouraging. And this is not all.

    Let us say now our agent hits www.linkedIn.com. Similarly, if LinkedIn were still the LinkedIn today, our agent could not do much. However, again assume LinkedIn is a new LinkedIn and our agent is able to collect quite a few statements from this Web site. Some of them are shown in List 1.3.

    List 1.3 Statements collected by our agent from LinkedIn ns2:LiyangYu ns2:email liyang910@yahoo.com. ns2:LiyangYu ns2:workPlaceHomepage http://www.delta.com. ns2:LiyangYu ns2:connectedTo .

    The graph created by the agent is shown in Fig. 1.4 (note that more statements have been added to the graph).

    A978-3-642-15970-1_1_Fig4_HTML.gif

    Fig. 1.4

    A graph generated by our agent after visiting linkedIn.com

    For human readers, we know ns0:LiyangYu and ns2:LiyangYu represent exactly the same resource, because both these two resources have the same e-mail address. For our agent, just by comparing the two identities (ns0:LiyangYu vs. ns2:LiyangYu) does not ensure the fact that these two resources are the same.

    However, if we can teach our agent the following fact:

    If the e-mail property of resource A has the same value as the e-mail property of resource B, then resources A and B are the same resource.

    Then our agent will be able to automatically add the following new statement to its current statement collection:ns0:LiyangYu sameAs ns2:LiyangYu.

    With the creation of this new statement, our agent has in fact integrated Figs. 1.1 and 1.4 by overlapping node ns0:LiyangYu with node ns2:LiyangYu. Clearly, this integration process is exactly the same as the one where Figs. 1.1 and 1.2 are connected together by overlapping node ns0:_x with node ns1:book-1584889330. And now, Figs. 1.1, 1.2, and 1.4 are all connected.

    At this point, our agent will be able to answer even more questions. The following are just some of them:

    What is Laoyu’s workplace home page?

    How much does it cost to buy Laoyu’s book?

    Which city does Liyang live in?

    And clearly, to answer these questions, the agent has to depend on the integrated graph, not any single one. For example, to answer the first question, the agent has to go through the following link that runs across different graphs: ns0:LiyangYu ns0:nickname LaoYu. ns0:LiyangYu sameAs ns2:LiyangYu. ns2:LiyangYu ns2:workPlaceHomepage http://www.delta.com.

    Once the agent reaches the last statement, it can present an answer to us. You should be able to understand how the other questions are answered by mapping out the similar path as shown above.

    Obviously, the set of questions that our agent is able to answer grows when it hits more Web documents. We can continue to move on to another Web site so as to add more statements to our agent’s collection. However, the idea is clear: automatic data integration on the Web can be quite powerful and can help us a lot when it comes to information discovery and retrieval.

    1.1.2 Is Smart Data Integration Agent Possible?

    The question is also clear: is it even possible to build a smart data integration agent as the one we have just discussed? Human do this kind of information integration on the Web on a daily basis, and now, we are in fact hoping to program a machine to do this for us.

    To answer this question, we have to understand what exactly has to be there to make our agent possible.

    Let us go back to the basics. Our agent, after all, is a piece of software that works on the Web. So to make it possible, we have to work on two parties: the Web and the agent.

    Let us start with the Web. Recall that we have assumed our agent is able to collect some statements from various Web sites (see Lists 1.1, 1.2, and 1.3). Therefore, each Web site has to be different from its traditional form, so changes have to be made. Without going into the details, here are some changes we need to have on each Web site:

    Each statement collected by our agent represents a piece of knowledge. Therefore, there has to be a way (a model) to represent knowledge on the Web. Furthermore, this model of representing knowledge has to be easily and readily processed (understood) by machines.

    This model has to be accepted as a standard by all the Web sites. Otherwise, statements contained in different Web sites will not share a common pattern.

    There has to be a way to create these statements on each Web site. For example, they can be either manually added or automatically generated.

    The statements contained in different Web sites cannot be completely arbitrary. For example, they should be created by using some common terms and relationships, at least for a given domain. For instance, to describe a person, we have some common terms such as name, birthday, and home page.

    There has to be a way to define these common terms and relationships, which specifies some kind of agreement on these common terms and relationships. Different Web sites, when creating their statements, will use these terms and relationships.

    Perhaps there are more to be included.

    With these changes on the Web, a new breed of Web will be available for our agent. And in order to take advantage of this new Web, our agent has to be changed as well. For example,

    Our agent has to be able to understand each statement that it collects. One way to accomplish this is by understanding the common terms and relationships that are used to create these statements.

    Our agent has to be able to conduct reasoning based on its understanding of the common terms and relationships. For example, knowing the fact that resources A and B have the same e-mail address and considering the knowledge expressed by the common terms and relationships, it should be able to conclude that A and B are in fact the same resource.

    Our agent should be able to process some common queries that are submitted against the statements it has collected. After all, without providing a query interface, the collected statements will not be of too much use to us.

    Perhaps there are more to be included as well.

    Therefore, here is our conclusion: yes, our agent is possible, provided that we can implement all the above (and possibly more).

    1.1.3 The Idea of the Semantic Web

    At this point, the Semantic Web can be understood as follows: the Semantic Web provides the technologies and standards that we need to make our agent possible, including all the things we have listed in the previous section. It can be understood as a brand new layer built on top of the current Web, and it adds machine-understandable meanings (or semantics ) to the current Web. Thus the name the Semantic Web.

    The Semantic Web is certainly more than automatic data integration on a large scale. In the next section, we will position it in a more general setting. We will then summarize the concept of the Semantic Web, which will hopefully look more natural to you.

    1.2 A More General Goal: A Web Understandable to Machines

    1.2.1 How Do We Use the Web?

    In its early days, the Web could be viewed as a set of Web sites which offered a collection of Web documents, and the goal was to get the content pushed out to its audiences. It acted like a one-way traffic: people read whatever was out there, with the goal of getting information they could use in a variety of ways.

    Today, the Web has become much more interactive. First off, more and more of what is now known as user-generated content has emerged on the Web, and a host of new Internet companies were created around this trend as well. More specifically, instead of only reading the content, people are now using the Web to create content and also interact with each other by using social networking sites over Web platforms. And certainly, even if we don’t create any new content or participate in any social networking sites, we can still enjoy the Web a lot: we can chat with our friends, we can shop online, we can pay our bills online, and we can also watch a tennis game online, just to name a few.

    Second, the life of a Web developer has changed a lot too. Instead of offering merely static Web contents, today’s Web developer is capably of building Web sites that can execute complex business transactions, from paying bills online to booking a hotel room and airline tickets.

    Third, more and more Web sites have started to publish structured content so that different business entities can share their content to attract and accomplish more transactions online. For example, Amazon and eBay both are publishing structured data via Web service standards from their databases, so other applications can be built on top of these contents.

    With all these being said, let us summarize what we do on today’s Web at a higher level, and this will eventually reveal some critical issues we are having on the Web. Note that our summary will be mainly related to how we consume the available information on the Web, and we are not going to include activities such as chatting with friends or watching a tennis game online – these are nice things you can do on the Web, but not much related to our purpose here.

    Now, to put it simple, searching, information integration, and Web data mining are the three main activities we conduct using the Web.

    1.2.1.1 Searching

    This is probably the most common usage of the Web. The goal is to locate some specific information or resources on the Web. For instance, finding different recipes for making margarita and locating a local agent who might be able to help buying a house are all good examples of searching.

    Quite often though, searching on the Web can be very frustrating. At the time of this writing, for instance, using a common search engine (Google, for example), let us search the word SOAP with the idea that SOAP is a W3C standard for Web services in our mind. Unfortunately, we will get about 63,200,000 listings back and will soon find this result hardly helpful: there are listings for dish detergents, facial soaps, and even soap operas! Only after sifting through multiple listings and reading through the linked pages are we able to find information about the W3C’s SOAP specifications.

    The reason for this situation is that search engines implement their search based on the core concept of which documents contain the given keyword – as long as a given document contains the keyword, it will be included in the candidate set and will be later presented back to the user as the search result. It is then up to the user to read and interpret the result to extrapolate any useful information.

    1.2.1.2 Information Integration

    We have seen an example of information integration in Sect. 1.1.1, and we have also created an imaginary agent to help us accomplish our goal. In this section, we will take a closer look at this common task on the Web.

    Let us say you decide to try some Indian food for your weekend dining out. You first search the Web to find a list of restaurants specialized in Indian cuisine, you then pick one restaurant, and you write down the address. Now you open up a new browser and go to your favorite map utility to get the driving direction from your house to the restaurant. This process is a simple integration process: you first get some information (the address of the restaurant) and you use it to get more information (the direction), and they collectively help you to enjoy a nice dining out.

    Another similar but more complex example is to create a holiday plan that pretty much every one of us has done. Today, it is safe to say that we all have to do this manually: search for a place that fits not only our interest, but also our budget, and then hotel, then flights, and finally cars. The information from these different steps will then be combined together to create a perfect plan for our vacation, hopefully.

    Clearly, to conduct this kind of information integration manually is a somewhat tedious process. It will be more convenient if we can finish the process with more help from the machine. For example, we can specify what we need to some application, which will then help us out by conducting the necessary steps for us. For the vacation example, the application can even help us more: after creating our itinerary (including the flights, hotels, and cars), it can search the Web to add related information such as daily weather, local maps, driving directions, city guides.

    Another good example of information integration is the application that makes use of Web services . As a developer who mainly works on Web applications, Web service should not be a new concept. For example, company A can provide a set of Web services via its Web site, company B can write java code (or whatever language of their choice) to consume these services so as to search company A’s product database on the fly. For instance, when provided with several keywords that should appear in a book title, the service returns a list of books whose titles contain the given keywords.

    Clearly, company A is providing structured data for another application to consume. It does not matter which language company A has used to build its Web services and what platform these services are running on; it does not matter either which language company B is using and what platform company B is on; as long as company B follows the related standards, this integration can happen smoothly and nicely.

    In fact, our perfect vacation example can also be accomplished by using a set of Web services. More specifically, finding a vacation place, booking a hotel room, buying air tickets, and finally making a reservation at a car rental company can all be accomplished by consuming the right Web services.

    On today’s Web, this type of integration often involves different Web services. Just like the case where you want to have dinner in an Indian restaurant, you have to manually locate and integrate these services together (and normally, this integration is implemented at the development phase). It would be much quicker, cleaned, powerful, and more maintainable if we could have some application that helps us to find the appropriate services on the fly and also invoke these services dynamically to accomplish our goal.

    1.2.1.3 Web Data Mining

    Intuitively speaking, data mining is the non-trivial extraction of useful information from a large (and normally distributed) datasets or databases. Given the fact that the Web can be viewed as a huge distributed database, the concept of Web data mining refers to the activity of getting useful information from the Web.

    Web data mining might not be as interesting as searching sounds to a casual user, but it could be very important and even be the daily work of those who work as analysts or developers for different companies and research institutes.

    One example of Web data mining is as follows. Let us say that we are currently working as consultants for the air traffic control group at Hartsfield-Jackson Atlanta International Airport, reportedly the busiest airport in the nation. Management group in the control tower wanted to understand how the weather condition may effect the take-off rate on the runways, with the take-off rate defined as the number of aircrafts that have taken off at a given hour. Intuitively, a severe weather condition will force the control tower to shut down the airport so the take-off rate will go down to zero, and a moderate weather condition will just make the take-off rate low.

    For a task like this, we suggest that we gather as much historical data as possible and analyze them to find the pattern of the weather effect. And we are told that historical data (the take-off rates at different major airports for the past, say, 5 years) do exist, but they are published in different Web sites. In addition, the data we need on these Web sites are normally mingled together with other data that we do not need.

    To handle this situation, we will develop a specific application that acts like a crawler: it will visit these Web sites one by one, and once it reaches a Web site, it will identify the data we need and only collect the needed information (historical take-off rates) for us. After it collects these rates, it will store them into the data format that we want. Once it finishes with one Web site, it will move on to the next one until it has visited all the Web sties that we are interested in.

    Clearly, this application is a highly specialized piece of software that is normally developed on a case-by-case basis. Inspired by this example, you might want to code up your own agent, which will visit all the related Web sites to collect some specific stock information and report back to you, say, every 10 min. By doing so, you don’t have to open up a browser every 10 min to check the stock prices, risking the possibility that your boss will catch you visiting these Web sites, yet you can still follow the latest changes happening in the stock market.

    And this stock watcher application you have developed is yet another example of Web data mining. Again, it is a very specialized piece of software and you might have to re-code it if something important has changed on the Web sites that it routinely visits. And yes, it would be much nicer if the application could understand the meaning of the Web pages on the fly so you do not have to change your code so often.

    By now, we have talked about the three major activities that you can do and normally do with the Web. You might be a casual visitor to the Web, or you might be a highly trained professional developer, but whatever you do with the Web will more or less fall into one of these three categories.

    The next question then is, what are the common difficulties that we have experienced in these activities? Does any solution exist to these difficulties at all? What would we do if we had the magic power to change the way the Web is constructed so that we did not have to experience these difficulties at all?

    Let us talk about this in the next section.

    1.2.2 What Stops Us from Doing More?

    Let us go back to the first main activity: search. Among the three major activities, search is probably the most popular one, and it is also interesting that this activity in fact shows the difficulty of the current Web in a most obvious way: whenever we do a search, we want to get only relevant results; we want to minimize the human work that is required when trying to find the appropriate documents.

    However, the conflict also starts from here: the current Web is entirely aimed at human readers and it is purely display oriented. In other words, the Web has been constructed in such a way that it is oblivious to the actual information content on any given Web site. Web browsers, Web servers, and even search engines do not actually distinguish weather forecasts from scientific papers and cannot even tell a personal home page from a major corporate Web site. Search engines are therefore forced to do keyword-matching only: as long as a given document contains the keyword(s), it will be included in the candidate set that is later presented to the user as the search result.

    If we had the magic power, we would re-construct the whole Web such that computers not only can present the information that is contained in the Web documents, but can also understand the very information they are presenting so they can make intelligent decisions on our behalf. For example, search engines can filter the pages before they present them back to us, if not able to directly give us the answer back.

    For the second activity, integration, the main difficulty is that there is too much manual work involved in the integration process. If the Web were constructed in such a way that the meaning of each Web document can be retrieved from a collection of statements, our agent would be able to understand each page, and information integration would have become amazingly fun and easy, as we have shown in Sect. 1.1.1.

    Information integration implemented by using Web services may seem quite different at first glance. However, to automatically composite and invoke the necessary Web services, the first step is to discover them in a more efficient and automated manner. Currently, this type of integration is difficult to implement mainly because the discovery process of its components is far from efficient.

    The reason, again, as you can guess, is that although all the services needed to be integrated do exist on the Web, the Web is, however, not programmed to understand and remember the meaning of any of these services. As far as the Web is concerned, all these components are created equal, and there is no way for us to teach our computers to understand the meaning of each component, at least on the current Web.

    What about the last activity, namely, Web data mining? The truth is Web data mining applications are not scalable, and they have to be implemented at a very high price, if they are possible to be implemented at all.

    More specifically, each Web data mining application is highly specialized and has to be specially developed for that application context. For a given project in a specific domain, only the development team knows the meaning of each data element in the data source and how these data elements should interact together to present some useful information. The developers have to program these meanings into the mining software before setting it off to work; there is no way to let the mining application learn and understand these meanings on the fly. In addition, the underlying decision tree has to be pre-programmed into the application as well.

    Also, even for a given specific task, if the meaning of the data element changes (this can easily happen given the dynamic nature of the Web documents), the mining application has to be changed accordingly since it cannot learn the meaning of the data element dynamically. All these practical concerns have made Web data mining a very expensive task to do.

    Now, if the Web were built to remember all the meanings of data elements, and in addition, if all these meanings could be understood by a computer, we would then program the mining software by following a completely different pattern. We can even build a generic framework for some specific domain so that once we have a mining task in that domain, we can reuse it all the time – Web data mining will not be as expensive as today.

    Now we finally reached some interesting point. Summarizing the above discussion, we have come to an understanding of an important fact: our Web is constructed in a way that its documents only contain enough information for a machine to present them, not to understand them.

    Now the question is the following: is it still possible to re-construct the Web by adding some information into the documents stored on the Web, so that machines can use this extra information to understand what a given document is really about?

    The answer is yes, and by doing so, we in fact change the current (traditional) Web into something we call the Semantic Web – the main topic of this chapter and this whole book.

    1.2.3 Again, the Idea of the Semantic Web

    At this point, the Semantic Web can be understood as follows: the Semantic Web provides the technologies and standards that we need to make the following possible:

    adds machine-understandable meanings to the current Web, so that

    computers can understand the Web documents and therefore can automatically accomplish tasks that have been otherwise conducted manually, on a large scale.

    With all the intuitive understanding of the Semantic Web, we are now ready for some formal definition of the Semantic Web, which will be the main topic of the next section.

    1.3 The Semantic Web: A First Look

    1.3.1 The Concept of the Semantic Web

    First off, the word semantics is related to the word syntax. In most languages, syntax is how you say something, where semantics is the meaning behind what you have said. Therefore, the Semantic Web can be understood as the Web of meanings, which echoes what we have learned so far.

    At the time of this writing, there is no formal definition of the Semantic Web yet. And it often means different things to different groups of individuals. Nevertheless, the term Semantic Web was originally coined by World Wide Web Consortium² (W3C) director Sir Tim Berners-Lee and formally introduced to the world by the May 2001 Scientific American article The Semantic Web (Berners-Lee et al. 2001):

    The Semantic Web is an extension of the current Web in which information is given well-defined meaning, better enabling computers and people to work in cooperation.

    There has been a dedicated team of people at W3C working to improve, extend, and standardize the idea of the Semantic Web. This is now called W3C Semantic Web Activity.³ According to this group, the Semantic Web can be understood as follows:

    The Semantic Web provides a common framework that allows data to be shared and reused across application, enterprise, and community boundaries.

    For us, and for the purpose of this book, we can understand the Semantic Web as the following:

    The Semantic Web is a collection of technologies and standards that allow machines to understand the meaning (semantics) of information on the Web.

    To see this, recall that Sect. 1.1.2 has described some requirements for both the current Web and agents, in order to bring the concept of the Semantic Web into reality. If we understand the Semantic Web as a collection of technologies and standards, Table 1.1 summarizes how those requirements summarized in Sect. 1.1.2 can be mapped to the Semantic Web’s major technologies and standards.

    Table 1.1

    Requirements summarized in Sect. 1.1.2 can be mapped to the Semantic Web’s technologies and standards

    Table 1.1 may not make sense at this point, but all the related technologies and standards will be covered in detail in the upcoming chapters of this book. When you finish this book and come back to review Table 1.1, you should be able to understand it with much more ease.

    1.3.2 The Semantic Web, Linked Data, and the Web of Data

    Linked Data and the Web of Data are concepts that are closely related to the concept of the Semantic Web. We will take a brief look at these terms in this section. You will see more details about Linked Data and Web of Data in the later chapters.

    The idea of Linked Data was originally proposed by Tim Berners-Lee , and his 2006 Linked Data principles⁴ is considered to be the official and formal introduction of the concept itself. At its current stage, Linked Data is a W3C-backed movement that focuses on connecting datasets across the Web, and it can be viewed as a subset of the Semantic Web concept, which is all about adding meanings to the Web.

    To understand the concept of Linked Data , think about the motivating example presented in Sect. 1.1.1. More specifically, our agent has visited several pages and has collected a list of statements from each Web site. These statements, as we know now, represent the added meaning to that particular Web site. This has given us the impression that the added structure information for machines has to be associated with some hosting Web site and has to be either manually created or automatically generated.

    In fact, there is no absolute need that machines and human beings have to share the same Web site. Therefore, it is also possible to directly publish some structured information online (a collection of machine-understandable statements, for example) without having them related to any Web site at all. Once this is done, these statements as a dataset are ready to be harvested and processed by applications.

    Imaginably, such a dataset will not be of much use if it does not have any link to other datasets. Therefore, one important design principle is to make sure that each such dataset has outgoing links to other datasets.

    Therefore, publishing structured data online and adding links among these datasets are the key aspects of the Linked Data concept. The Linked Data principles have specified the steps of accomplishing these key aspects. Without going into the details of Linked Data principles, understand that the term of Linked Data refers to a set of best practices for publishing and connecting structured data on the Web.

    What is the relationship between Linked Data and the Semantic Web ? Once you finish this book (Linked Data is covered in Chap. 11), you should be able to get a much better understanding. For now, let us simply summarize their relationship without much discussion:

    Linked Data is published by using Semantic Web technologies and standards.

    Similarly, Linked Data is linked together by using Semantic Web technologies and standards.

    Finally, the Semantic Web is the goal, and Linked Data provides the means to reach the goal.

    Another important concept is the Web of Data . At this point, we can understand Web of Data as an interchangeable term for the Semantic Web. In fact the Semantic Web can be defined as a collection of standard technologies to realize a Web of Data. In other words, if Linked Data is realized by using the Semantic Web standards and technologies, the result would be a Web of Data.

    1.3.3 Some Basic Things About the Semantic Web

    Before we set off and get into the world of the Semantic Web, there are some useful information resources you should know about. Let us list them here in this section:

    http://www.w3.org/2001/sw/

    This is the W3C Semantic Web activity ’s official Web site. It has quite a lot information including a short introduction, latest publications (articles and interviews), presentations, links to specifications, and links to different working groups.

    http://www.w3.org/2001/sw/SW-FAQ

    This is the W3C Semantic Web frequently asked questions page. This is certainly very helpful to you if you have just started to learn the Semantic Web. Again, when you finish this whole book, come back to this FAQ, take yet another look, and you will find yourself having a much better understanding at that point.

    http://www.w3.org/2001/sw/interest/

    This is the W3C Semantic Web interest group , which provides a public forum to discuss the use and development of the Semantic Web, with the goal to support developers. Therefore, this could a useful resource for you when you start your own development work.

    http://www.w3.org/2001/sw/anews/

    This is the W3C Semantic Web activity news Web site . From here, you can follow the news related to the Semantic Web Activity, especially news and progress about specifications and specification proposals.

    http://www.w3.org/2001/sw/wiki/Main_Page

    This is the W3C Semantic Web community Wiki page . It has quite a lot of information for anyone interested in the Semantic Web. It has links to a set of useful sites, such as events in the Semantic Web community, Semantic Web tools, people in the Semantic Web community, and popular ontologies, just to name a few. Make sure to check this page if you are looking for related resources during the course of your study of this book.

    http://iswc.semanticweb.org/

    There are a number of different conferences in the world of the Semantic Web. Among these conferences, the International Semantic Web Conference (ISWC ) is a major international forum at which research on all aspects of the Semantic Web is presented, and the above is their official Web site. The ISWC started in 2001 and has been the major conference ever since. Check out this conference more to follow the latest research activities.

    With all these being said, we are now ready to start the book. Again, you can find a roadmap of the whole book in the Preface section, and you can download all the code examples for this book from http://www.liyangyu.com.

    Reference

    Berners-Lee T, Hendler J, Lassila O (2001) The Semantic Web. Sci Am 284(5):34–43CrossRef

    Footnotes

    1

    http://aws.amazon.com/

    2

    http://www.w3.org/

    3

    http://www.w3.org/2001/sw/

    4

    http://www.w3.org/DesignIssues/LinkedData.html

    Liyang YuA Developer’s Guide to the Semantic Web10.1007/978-3-642-15970-1_2© Springer-Verlag Berlin Heidelberg 2011

    2. The Building Block for the Semantic Web: RDF

    Liyang Yu¹  

    (1)

    Delta Air Lines, Inc., Delta Blvd. 1030, Atlanta, GA 30354, USA

    Liyang Yu

    Email: liyang910@yahoo.com

    2.1 RDF Overview

    2.1.1 RDF in Official Language

    2.1.2 RDF in Plain English

    2.2 The Abstract Model of RDF

    2.2.1 The Big Picture

    2.2.2 Statement

    2.2.3 Resource and Its URI Name

    2.2.4 Predicate and Its URI Name

    2.2.5 RDF Triples: Knowledge That Machine Can Use

    2.2.6 RDF Literals and Blank Node

    2.2.7 A Summary So Far

    2.3 RDF Serialization: RDF/XML Syntax

    2.3.1 The Big Picture: RDF Vocabulary

    2.3.2 Basic Syntax and Examples

    2.3.3 Other RDF Capabilities and Examples

    2.4 Other RDF Sterilization Formats

    2.4.1 Notation-3, Turtle, and N-Triples

    2.4.2 Turtle Language

    2.5 Fundamental Rules of RDF

    2.5.1 Information Understandable by Machine

    2.5.2 Distributed Information Aggregation

    2.5.3 A Hypothetical Real-World Example

    2.6 More About RDF

    2.6.1 Dublin Core: Example of Pre-defined RDF Vocabulary

    2.6.2 XML vs. RDF?

    2.6.3 Use an RDF Validator

    2.7 Summary

    Abstract

    This chapter is probably the most important chapter in this whole book: it covers RDF in detail, which is the building block for the Semantic Web. A solid understanding of RDF provides the key to the whole technical world that defines the foundation of the Semantic Web: once you have gained the understanding of RDF, all the rest of the technical components will become much easier to comprehend, and in fact, much more intuitive as well.

    This chapter is probably the most important chapter in this whole book: it covers RDF in detail, which is the building block for the Semantic Web. A solid understanding of RDF provides the key to the whole technical world that defines the foundation of the Semantic Web: once you have gained the understanding of RDF, all the rest of the technical components will become much easier to comprehend, and in fact, much more intuitive as well.

    This chapter will cover all the main aspects of RDF, including its concept, its abstract model, its semantics, its language constructs, and its features, together with ample real-world examples. This chapter also introduces available tools you can use when creating or understanding RDF models. Make sure you understand this chapter well before you move on. In addition, use some patience when reading this chapter: some concepts and ideas may look unnecessarily complex at the first glance, but eventually, you will start to see the reasons behind them.

    Let us get started.

    2.1 RDF Overview

    2.1.1 RDF in Official Language

    RDF stands for Resource Description Framework, and it was originally created in early 1999 by W3C as a standard for encoding metadata. The name, Resource Description Framework, was formally introduced in the corresponding W3C specification document that outlines the standard.¹

    As we have discussed earlier, the current Web is built for human consumption, and it is not machine understandable at all. It is therefore very difficult to automate anything on the Web, at least on a large scale. Furthermore, given the huge amount of information the Web contains, it is impossible to manage it manually either. A solution proposed by W3C is to use metadata to describe the data contained on the Web, and the fact that this metadata itself is machine understandable enables automated processing of the related Web resources.

    With the above consideration in mind, RDF was proposed in 1999 as a basic model and foundation for creating and processing metadata. Its goal is to define a mechanism for describing resources that makes no assumptions about a particular application domain (domain independent), and therefore can be used to describe information about any domain. The final result is that RDF concept and model can directly help to promote interoperability between applications that exchange machine-understandable information on the Web.

    As we have discussed in Chap. 1, the concept of the Semantic Web was formally introduced to the world in 2001, and the goal of the Semantic Web is to make the Web machine understandable. The obvious logical connection between the Semantic Web and RDF has greatly changed RDF: the scope of RDF has since then involved into something that is much greater. As we will see later throughout the book, RDF is not only used for encoding metadata about Web resources, but also used for describing any resources and their relations existing in the real world.

    This much larger scope of RDF has been summarized in the updated RDF specifications published in 2004 by the RDF Core Working Group² as part of the W3C Semantic Web activity.³ These updated RDF specifications contain altogether six documents as shown in Table 2.1. These six documents have since then jointly replaced the original Resource Description Framework specification (1999 Recommendation), and they together became the new RDF W3C Recommendation on 10 February 2004.

    Table 2.1

    RDF W3C recommendation, 10 February 2004

    Based on these official documents, RDF can be defined as follows:

    RDF is a language for representing information about resources in the World Wide Web (RDF Primer).

    RDF is a framework for representing information on the Web (RDF Concept).

    RDF is a general-purpose language for representing information in the Web (RDF Syntax, RDF Schema).

    RDF is an assertional language intended to be used to express propositions using precise formal vocabularies, particularly those specified using RDFS, for access and use over the World Wide Web, and is intended to provide a basic foundation for more advanced assertional languages with a similar purpose (RDF Semantics).

    At this point, it is probably not easy to truly understand what RDF is, based on these official definitions. Let us keep these definitions in mind, and once you have finished this chapter, review these definitions and you should find yourself having a better understanding of them.

    For now, let us move on to some more explanation in plain English about what exactly RDF is and why we need it. This explanation will be much easier to understand and will give you enough background and motivation to continue reading the rest of this chapter.

    2.1.2 RDF in Plain English

    Let us forget about RDF for a moment and consider those Web sites where we can find reviews of different products (such as Amazon.com, for example). Similarly, there are also Web sites that sponsor discussion forums where a group of people get together to discuss the pros and cons of a given product. The reviews published at these sites can be quite useful when you are trying to decide whether you should buy a specific product or not.

    For example, I am a big fan of photography, and I have recently decided to upgrade my equipment – to buy a Nikon SLR (single lens reflex) camera so I will have more control over how a picture is taken and therefore have more chance to show my creative side. Note the digital version of SLR camera is called DSLR (digital single lens reflex).

    However, pretty much all Nikon SLR models are quite expensive, so to spend money wisely, I have read quite a lot of reviews, with the goal of choosing the one particular model that fits my needs the best.

    You must have had the same experience, probably with some other product. Also, you will likely agree with me that reading these reviews does take a lot of time. In addition, even after reading quite a lot of reviews, you are still not sure: could it be true that I have missed some reviews that could be very useful?

    Now, imagine you are a quality engineer who works for Nikon. Your assignment is to read all these reviews and summarize what people have said about Nikon SLR cameras and report back to Nikon headquarter so the design department can make better designs based on these reviews.

    Obviously, you can do your job by reading as many reviews as you can and manually create a summary report and submit it to your boss. However, it is not only tedious, but also quite demanding: you spend the whole morning reading, you have only covered a couple dozen reviews, with a couple hundreds more to go!

    One idea to solve this problem is to write an application that will read all these reviews for you and generate a report automatically, and all this will be done in a matter of couple of minutes. Better yet, you can run this application as often as you want, just to gather the latest reviews. This is a great idea with only one flaw: such an application is not easy to develop, since the reviews published online are intended for human eyes to consume, not for machines to read.

    Now, in order to solve this problem once and for all so as to make sure you have a smooth and successful career path, you start to consider the following key issue:

    Assuming all the review publishers are willing to accept and follow some standard when they publish their reviews, what standard would make it easier to develop such an application?

    Note the words we used were standard and easier. Indeed, although writing such an application is difficult, there is in fact nothing stopping us from actually doing it, even on the given Web and without any standard. For example, screen scraping can be used to read reviews from Amazon.com’s review page, as shown in Fig. 2.1.

    A978-3-642-15970-1_2_Fig1_HTML.jpg

    Fig. 2.1

    Amazon’s review page for Nikon D300 SLR camera

    On this page, a screen-scraping agent can pick up the fact that 40 customers have assigned five stars to Nikon D300 (a DSLR camera by Nikon), and four attributes are currently used for reviewing this camera, and they are called Ease of use, Features, Picture quality, and Portability.

    Once we have finished coding the agent that understands the reviews from Amazon.com, we can move on to the next review site. It is likely that we have to add another new set of rules to our agent so it can understand the published reviews on that specific site. The same is true for the next site, so on and so forth.

    There is indeed quite a lot of work, and obviously, it is not a scalable way to develop an application either. In addition, when it comes to the maintenance of this application, it could be more difficult than the initial development work. For instance, a small change on any given review site can easily break the logic that is used to understand that site, and you will find yourself constantly being busy changing and fixing the code.

    And this is exactly why a standard is important: once we have a standard that all the review sites follow, it will be much easier to write an application to collect the distributed reviews and come up with a summary report.

    Now, what exactly is this standard? Perhaps it is quite challenging to come up with a complete standard right away, but it might not be too difficult to specify some of the things we would want such a standard to have:

    It should be flexible enough to express any information anyone can think of.

    Obviously, each reviewer has different things to say about a given Nikon camera, and whatever he/she wants to say, the standard has to provide a way to allow it. Perhaps the graph shown in Fig. 2.2 is a possible choice – any new information can be added to this graph freely: just grow it as you wish.

    And to represent this graph as structured information is not as difficult as you think: the tabular notation shown in Table 2.2 is exactly equivalent to the graph shown in Fig. 2.2.

    More specifically, each row in the table represents one

    Enjoying the preview?
    Page 1 of 1