Building the Business Knowledge Graph — Machine Reading Comprehension at Web Scale.

The world wide web is the largest information source that has ever existed and continues to grow at an exponential rate. It has become the system of record for trying to find the answer to any question. Because most of the information online is in an unstructured format (e.g. text), this works well when the question is straightforward, has already been asked and a good match for the question has already been written online. If you have more complex research to carry out, something that needs to interpret, connect and consolidate information across multiple sources, then you will need to spend hours (maybe days) searching, reading and collecting the source materials manually.

At glass.ai, we have invented AI that is able to understand language at large scale and have applied this technology to turn the unstructured content found across the open web into structured datasets. This unique technology has enabled us to build the glass.ai Business Knowledge Graph, a dataset containing over 15 billion facts and relationships on businesses worldwide that maintain a web presence and allows us to answer those complex questions that require the consolidation of knowledge across multiple sources.

The glass.ai Business Knowledge Graph is built and kept up to date through an intelligent crawler that has been trained to recognise key business entities on the pages it reads and apply rules as it browses the web to follow paths that are likely to lead to further relevant information. As such, it is trying to act like a human would as they are researching a topic or area of interest — although it is able to do it at a much larger scale, 24 hours a day.

The Business Knowledge Graph was trained to recognise business websites and extract business descriptions, people connected to each business, products and services, news, job listings and contact information, such as addresses. Each classifier for these data was bootstrapped with a small number of examples from which language models were built. These small but very accurate language models comprise of dictionaries of words and phrases that are able to precisely target each type of content. This is important in the context of web scale comprehension as other, statistical, methods lose precision as they try to generalise beyond the training set. Something that is guaranteed to happen when trying to read a source as vast as the open web.

The language models are supported by a large scale ontology that has been built from crowd-sourced knowledge resources such as Wikipedia, WordNet, GeoNames and online Business Glossaries. This enables further categorisation of the content by topic, business sector and location and, together with the intelligent crawler’s deep knowledge of website structures, enables the extraction of the rich knowledge base of facts from which the Business Knowledge Graph is constructed.

As the glass.ai Business Knowledge Graph is a live dataset, the quality of the content extracted is regularly checked through independent manual review of random samples of the key entities and attributes. This has consistently shown quality across the facts collected in the Business Knowledge Graph of 95% or better. If we compare this to other automatically collated knowledge graphs¹ then the only one that matches this quality mark is YAGO3 (see above table). However, YAGO3² was built from the structured info boxes that are present some on Wikipedia pages so is a significantly simpler problem from trying to interpret content from the open web. In terms of scale, the Google Knowledge Graph is slightly bigger than the Business Knowledge Graph, 18B vs. 15B facts. However, this broad knowledge base has not been tested for quality. A smaller subset in the Google Knowledge Vault³ has been tested, and a much smaller set of facts (271M) reaches 90% quality. But — again — it should be noted that this is reading from a simpler source of structured content on the web, from web tables and standard web markup. Looking at knowledge bases constructed from unstructured content on the open web, the glass.ai Business Knowledge Graph substantially outperforms the best of these in terms of quality, 95% vs. 85% for NELL⁴, and is significantly broader in terms of size. NELL contains just 2M facts vs. 15B in the Business Knowledge Graph.

The glass.ai Business Knowledge Graph demonstrates the potential that the open web provides for extracting structured, queryable, data through scalable and accurate machine language understanding. At glass.ai, it allows us to quickly answer those complex business questions that would require significant effort to research through other means and provides rich contextual intelligence that has been applied to use cases such as sector mapping, client targeting, competitor monitoring and discovering emerging trends, to name just a few.

—————————

[1]:Heiko Paulheim. Knowledge Graph Refinement: A Survey of Approaches and Evaluation Methods. http://semantic-web-journal.net/system/files/swj1167.pdf. 2016.

[2]: Farzaneh Mahdisoltani, Joanna Biega and Fabian M. Suchanek. YAGO3: A Knowledge Base from Multilingual Wikipedias. https://suchanek.name/work/publications/cidr2015.pdf. 2015.

[3]: Xin Luna Dong et. al. Knowledge Vault: A Web-Scale Approach to Probabilistic Knowledge Fusion. https://storage.googleapis.com/pub-tools-public-publication-data/pdf/45634.pdf. 2014.

[4]: T. Mitchell et. al. Never-Ending Learning. https://www.cs.cmu.edu/~tom/pubs/NELL_aaai15.pdf. 2015.

Previous
Previous

Hard to Find B2B Data: PwC Uses Glass.AI to Discover Companies Adopting VR/AR Technologies.

Next
Next

The Value of Data.