70% of the internet isn’t there, and the useful internet is smaller than we think.

As part of our ongoing AI-driven economic and social research at Glass, our AI is continuously reading the internet to discover new businesses, understand the activities they are involved in and mapping those activities into their appropriate industries. We are digitally mapping the world’s economy. As well as enabling insights into markets and industry sectors, our robo-researcher has uncovered interesting insights about the makeup of the internet itself. Here is what we found:

According to the recently published Q4 2018 Verisign Domain Name Industry Report, there are 348.7 million domain names registered across the globe. The Glass AI intelligent web researcher has read the domains from just over half that number of sites. What it discovered is that the vast majority of registered domains are not linked to regular working websites. Whilst a large number of these are simply dead domains without an active website (44%), we also discovered that a large number of active sites are not providing web content.

Parked domains constitute 22% of the total and fall into several categories. Often a parked domain is simply used as a link farm to serve ads for other sites or improve a site's visibility to search engine algorithms. A parked domain may also simply be there as a means to try to sell the domain name itself or the sites hosting company may be using a lapsed domain registration as an opportunity to advertise their own services. Finally, the site on a domain might be presenting a holding page whilst the site is being developed or in maintenance. Sometimes these sites can be in this state for years!

(A) Dead site, (B) Link farm, (C) For sale, (D) Host services, (E) In development, (F) Redirects.

There are also several reasons why a domain may redirect to a site on a different domain (6% of total). For example, a company may have changed its name but still want to keep its old address active to point to its new location, or it could have been acquired in which case the new parent company wants the old web address to link into their own content. Or, the redirected domain may just provide a name that is easier to remember for accessing content within some other platform, in particular, linking to a social media presence. Verisign’s own analysis shows that a growing number of domains are used for this purpose, increasing up to 50% in the previous year for certain social media sites. There are also drivers from the complex waters of search engine optimisation where an organisation wants to present different access points for their site. We have seen that this can sometimes be taken to extremes. For example, we discovered a consulting company promoting a methodology they called S.W.I.M that had hijacked the expired domains of hundreds of swimming clubs in order to try to expand its reach. You can’t imagine anyone following links to their local swim club only to be redirected to a consulting firm would have been particularly pleased!

(G) Duplicate services, (H) Duplicate locations.

The other category is duplicate sites. This occurs when multiple domains contain exactly the same or very similar content. Again the driver for this is search engine optimisation, trying to improve the presence of an organisation on search engines. Where exactly the same content appears, it’s a similar picture to when multiple domains are using redirects but there are other more subtle cases. One example is where an organisation wants to use different domains to highlight different services they offer. The content is very similar on each site but maybe the landing page has different content. Alternatively, an organisation may offer a service in different areas and the provider has created a domain for each specific town or area. We have observed cases where a local plumber has hundreds of web domains, one each town or village in the area they cover. The sites are the same with the exception of a focus on a different location.

So how big is the ‘useful’ internet?

With each of these “missing” segments covering large chunks of registered domains, that leaves under 30% of the internet delivering meaningful web content. We are talking about 100 million live websites. At Glass, our AI is tuned to identify business, non-profit, government and education websites with the aim of drawing insights on markets, social and economic activity. Based on the sites that have already been read and categorised by the AI, we estimate about one-third of the remaining live sites contain organisation content that fits into these categories. Other content currently ignored by our AI includes personal sites, blogs, and certain sorts of consumer-oriented sites. So, to digitally map the world’s economic graph, we estimate that we will need to read and understand approximately 32 million websites of which our AI system says 52% contain English language content. But as we’ve seen, even before we dig into the content that has been read, there are interesting things to be discovered about the structure of the internet itself. It appears the useful internet is smaller than we think!

Sergi Martorellbatch1