In an enterprise, there are several challenges to data management among multiple people, departments, and processes. The current big data revolution is generating a mess of unruly data that’s difficult to parse and understand. Along with that, you have a lot of heterogeneous data, which is defined as being complex with a variety of different data structures. In the case of emails or documents, sometimes the data is unstructured. Hence we need a flexible, common model. Semantic graph technology is shaping up to play a key role in how organizations access these growing stores of data.
Why Knowledge Graphs?
We need a structured and formal representation of knowledge. We are surrounded by entities, which are connected by relationships. Graphs are a natural way to represent entities and their relationships.
A Knowledge Graph represents a knowledge domain. It connects things of different types in a systematic way. Knowledge graphs encode knowledge arranged in a network of nodes and links rather than tables of rows and columns.
By that, people and machines can benefit from a dynamically growing semantic network of facts about things and can use it for data integration, knowledge discovery, and in-depth analysis.
A Small representation of Knowledge Graph
Let us take an example of a movie recommendation data-set which has data like this:
- A movies table of movie id and title.
- A rating table of user ids, ratings and movie id.
- A genre table of movie id and genre.
Now we want to connect all these separate data into a graphical form like for each movie what is the genre and ratings given by the users.
Now let us see how some big shots like Google and Facebook use knowledge graphs.
Google knowledge graph:
Google’s core business is providing results to user queries. To do that, it doesn’t just present the result that closest matches a search term, but also by making broader connections between data. Google, therefore, collects and analyzes massive amounts of data on people, places, things, and facts and develops ways to present the findings in an accessible way.
The Knowledge Graph is the engine that powers the panel that’s officially called the Knowledge Graph Card. In this card, you’ll find the most visible result of the work the graph does. When there’s enough data about a subject, the card will be filled with all kinds of relevant facts, images, and related searches.
On the surface, information from the Knowledge Graph is used to augment search results and to enhance its AI when answering direct spoken questions in Google Assistant and Google Home voice queries. Behind the scenes, Google uses its KG to improve its machine learning.
For movie searches, we can directly see the show timings in our local theatres along with the facts and information of the movie.
For food searches, we can view its nutrition information also.
Facebook Graph Search:
Here the data is linked based on 4 categories People, Photos, Places, and Interests and helps us in answering queries like“People who like cycling and are working in my company,”“My friends who like Quora”
Knowledge Graph architecture
How is the knowledge graph built?
We need some common method to define relationships between things by standardizing and in a flexible way. This is where terms like RDF, RDFS, OWL comes into the picture.
RDF’s basic construct is the triple. This corresponds to a very simple English sentence and as the name suggests, it contains three components: a subject, a predicate, and an object.
Let us look into our movie data-set where we need a graph like the following for each movie:
Here we have 2 edges or triplets from the movie title:
|Toy Story (1995)||Rating||4|
|Toy Story (1995)||Genre||Animation|
The underlying data structure of RDF is a directed graph. A triple represents a single edge (i.e. a line between two nodes) that is labeled with the predicate name. The line has a single arrow pointing from the subject to the object. This forms a binary relation between the subject, the object, and the predicate intermediary.
In the above graph, there are 2 triplets :
Movie_title ->has_rating ->4.
To represent these triplets in RDF, several common serialization formats are in use, including
- Turtle, a compact, human-friendly format.
- N-Triples, a very simple, easy-to-parse, line-based format that is not as compact as Turtle.
- N-Quads, a superset of N-Triples, for serializing multiple RDF graphs.
- JSON-LD, a JSON-based serialization.
- N3 or Notation3, a non-standard serialization that is very similar to Turtle, but has some additional features, such as the ability to define inference rules.
- RDF/XML, an XML-based syntax that was the first standard format for serializing RDF.
- RDF/JSON, an alternative syntax for expressing RDF triples using a simple JSON notation.
Our example in N-Triples format:
RDFS is extending RDF vocabulary to allow describing taxonomies of classes and properties. It also extends definitions for some of the elements of RDF, for example, it sets the domain and range of properties and relates the RDF classes and properties into taxonomies using the RDFS vocabulary.
For our example, if we want to add many sub-genres for the genre “Animation” for the movie “Toy Story” we can do with the help of RDFS by creating a genre class and making fantasy a subclass of animation
In RDF/XML format:
In N-Triples format:
OWL is designed to represent rich and complex knowledge about things, groups of things, and relations between things. It is complementary to RDF and it provides explicit definitions in a machine-to-machine form that can be used to organize and connect data from multiple sources within the given domain and allows you to add more restrictions to your knowledge representation. It categorizes properties(relationships) into object and data properties and allows you to add restrictions on your properties.
It adds restrictions like, for RDFS class genre/Animation “Fantasy” is a member, if want to add something like
Here Animation is both a class which has a member “Fantasy” and it itself is a member of a class “genre”.In OWL, by contrast, or at least in some flavors of OWL, the above statements are actually not allowed: you’re simply not allowed to say that something can be both a class and an instance.
How different is the Knowledge Graph from traditional databases in terms of querying?
Graph Databases are schema-less allowing the representation of complex interactions between the data in a much more natural form. It allows for the flexibility of a Document or Key/Value Store database – but supporting Relationships in a similar way to that of a traditional Relational Database. It doesn’t mean that there is no data model associated with the database though. Simply that there is more flexibility in how you define it, which can often lead to the faster iteration of your projects.
This is all possible in other database solutions, but not always as elegantly as in a Graph Database and often involving link tables or nested documents to achieve the same level of expressiveness.
We can see how querying in a knowledge graph is easier than RDBMS with a query example using SPARQL.
Is a query language and a protocol for accessing RDF designed by the W3C RDF Data Access Working Group. The SPARQL query language is able to retrieve and manipulate data stored in RDF format.
An example of a SELECT query follows.
The first line defines namespace prefix, the last two lines use the prefix to express an RDF graph to be matched.
Identifiers beginning with a question mark ? identify variables. In this query, we are looking for a resource ?x participating in triples with predicates foaf: name and foaf:mbox and want the subjects of these triples.
There are four query result forms. In addition to the possibility of getting the list of values found it is also possible to construct RDF graph or to confirm whether a match was found or not.
- SELECT – returns the list of values of variables bound in a query pattern
- CONSTRUCT – returns an RDF graph constructed by substituting variables in the query pattern
- DESCRIBE – returns an RDF graph describing the resources that were found
- ASK – returns a boolean value indicating whether the query pattern matches or not
- To get all the movies with rating >=3:
The SQL map for the above query:
As we can see for the above simple query in SQL needs 2 joins [ one self-join and an equi-join] and this number of self joins will increase with the increase in complexity of the query.
Graph databases, together with other semantic technologies, and the principles of Linked Data allow businesses across all industries to integrate data in order to analyze performance, plan resources and budgets, and optimize business processes. The amount of data and information is only set to rise in our increasingly digital and interconnected world. So the ability to achieve business transformation and have an impact on industries and society gives the adopters of semantic technology and graph databases the competitive edge to make more sense of data.
We will dig deep in our next blogs in deciphering the semantic graph ecosystem and understanding when it’s a good idea to apply semantic web technologies. We will address these in much more detail in our upcoming blogs.