On the exploit of Google's indexing of Search Results pages
Google's search results can be exploited to show "advertisements" (Ads) for illegal goods - typically drugs - and where to obtain them. This is achieved by manipulating how Google indexes websites, and in doing so it impacts the reputation of the websites affected by having these Ads shown on them.
This has the potential to affect any website, including yours. In this review we explain the nature of the exploit, its impact and what we have done to protect your website.
To ward off any sense of impending doom, we'll start off by emphasizing that your website has not been breached by this exploit.
That is to say, no "bad actor" (let's call them the Baddie) has gained access to your website or hosting infrastructure, nor any of your protected data. Nor has the Baddie altered or stored any data on your website (we'll come on to how they show their Ads on your website shortly). This is not a vulnerability in the normal security sense, its damage comes from the impact it could have to your reputation. We know that your reputation is just as important to protect as your website security.
Because this exploit is not a vulnerability in the security sense, there was no announcement and release of a security patch to fix it. If there had been we'd have been alerted to it via the security channels we monitor. Instead, this stems from a "product improvement" in Google's search indexing algorithm. As with many changes to a product, there can be unintended consequences, and this exploit is such an example.
We were alerted to the issue on the evening of Thursday 5 October by one of our clients, who in turn had been informed by a third party that their site was affected. Awareness of the issue was stimulated by an article on the Business Insider website on 28 September (their website was itself affected). Following an initial triage of the issue on Thursday evening we worked as a multidisciplinary incident team throughout Friday to identify the exposure to the exploit on our clients’ websites and implement mitigation strategies against it.
Before we take you through what we did, let's take a deep dive into how the exploit operates and its effects:
Google indexes webpages in a number of different ways. Typically, it will examine the content of a webpage and use all its secret sauce algorithms to index it so that it is shown when matched to searches people make on Google. It will also follow (crawl) any links it finds on a page and these might be to different websites. Even if one of those destination webpages tells Google not to crawl it (by using a "robots.txt disallow rule"), Google might still index it. It will respect the disallow rule, so won't index the content of the page; however, the Google search result for the page would still show its URL. This has long been the case and is the first essential element in the exploit.
The second essential element to make the exploit work is a relatively new change to how Google indexes pages that have a "query string" in their URL.
An aside on query strings if you need a refresher:
A query string can appear at the end of a URL. It's the part after a "?" and is made up of one or more parameter-value pairs.
For example, if you visit the blog index page on our website which has the URL torchbox.com/blog/ you'll be shown all our various blogs. The page allows you to filter the blogs to a category of interest by clicking a link on the page that goes to the same page but adds a query string to the URL. The "Wagtail" link, for example, has the URL torchbox.com/blog/?filter=wagtail-cms and clicking on it results in the page showing only blog posts related to Wagtail. Here, "filter" is the query parameter and "wagtail-cms" is the query value. When your browser sends the request for the page to the web server this parameter-value pair is sent as well. When our website generates the list of available blogs it recognises the "filter" parameter and uses its value to modify the content of the page it sends back to the browser in response. In this case, if no filter value is supplied then all blogs are listed, whereas if the filter value is "Wagtail" the webserver only lists Wagtail-related blogs (by looking up which blogs have been assigned to the category in the website's CMS).
Query strings are used to dynamically generate the content of a page by directing the webserver as to how it should behave in response to the parameters and their values.
Query strings are not an inherent part of a web page. Any query string can be added to a URL by anybody. It is up to the webserver to determine whether it recognises a particular parameter-value pair and how to use it. In our example, the blog index page on our website generates its own query string links to each blog category that it knows about. However, in my browser I could change the query string to modify the parameter or the value (or both), for example: torchbox.com/blog/?category=wagtail-cms or torchbox.com/blog/?filter=google-search. The former would result in all blogs being listed because the webserver doesn't recognise "category" as a parameter so it's ignored, and the latter would result in no blogs being listed because there is no "google-search" category.
Using a query string with a recognised query parameter and a query value that anyone can choose gives the exploit its route to taking advantage of Google's change to its indexing: URLs with query strings are now much more likely to be added to the index as stand-alone search results. Previously, they were mostly ignored.
For this reason, when a Baddie puts a query string link on a website that they control going to a page on your website, and Google crawls the Baddie's page, it indexes that page too. As a result, Google's search results include your website - with its high reputation - showing Ads for the Baddie's wares. This is certainly an unintended consequence of the change, but a consequence nonetheless.
A typical sort of link used in this exploit looks like this:
torchbox.com/search/?q=Buy+Illegal+things+Online+bad-website
The "+" stands in for a space character in a query string, so this query string has one parameter called "q" with a value "Buy Illegal things Online bad-website". I leave it to your imagination as to what "Illegal things" gets substituted for in a real example. The "bad-website" would be a real website address or a username (handle) on an encrypted messaging service. This is the message the Baddie wants shown on your website.
How does this happen? The Baddie targets a page on your website that recognises the "q" parameter. Most websites have their own search page and it is common for them to accept a search query as a query string, and use it to display results based on the value the user has entered (when typing a query into a search box on the website and hitting Search, the browser converts it into a query string and sends it to the webserver). It's also common for that search query to displayed back to the user along with the search results, as a useability aid. You're probably familiar with seeing search responses like "Search results for [the text you entered]:" or "Your search for [the text you entered] did not return any results".
And there's the rub - the text of the query has been displayed back to you. That's useful, but as with most useful things, it can be used in bad ways: If the Baddie can get their link seen and clicked on, people will see the Baddies message displayed on the search page. Not good.
The message may not make much contextual or grammatical sense on the page (the webserver determines how it's displayed), but the Baddie doesn't care about that, nor does someone actively using Google to searching for those illegal things - the hook-up has been made, using the website as the intermediary.
There isn't anything special about utilising a website's search results page for this exploit, it's simply the most common type of page where websites allow user-generated content to be shown without any authentication. Website operators take care whenever user-generated content is shown to avoid issues like this one; however, search results pages are an example where it's beneficial when the use is non-nefarious.
Just as you want your pages to come up at the top of Google's page results for matching searches, the Baddies want their Ads to appear there too. What's a good way for them to achieve that? By leaning on what Google uses to rate websites: Expertise, Authority and Trustworthiness (E-A-T).
A website (or more accurately, the organisation that owns the website and the content it publishes) rating highly in the subject matter of the Ad is a more prized target because it may rank highly for Google searches that match terms in the Ad's message (as stored in Google's index as part of the query string on the website's URL).
This is why the exploit can result in high-profile law enforcement agency and hospital websites displaying Ads for drugs. The Business Insider article gave examples of the search pages being exploited on the websites of the Food and Drug Administration, Interpol, the United Nations as well as many well known universities, news organizations and nonprofits; frequently impacting websites whose purpose is to counter the trade in those illegal goods.
We've looked at the nature of the exploit and its impact, now let's turn our attention to what we've been doing to mitigate it for our clients' websites. Following our initial triage to understand how the exploit worked and the ways in which website search pages could be affected, we formed a two-pronged approach:
- Our Support, Systems and Development teams systematically worked through our client websites to identify search pages that were vulnerable to the exploit, apply remedial changes to block the exploit and deploy them.
- Our SEO team similarly worked through the websites to identify any domains that had already been exploited and take action to have the pages removed from Google's search index.
This combination was designed to remove exploits and prevent new occurrences quickly, before reflecting on any longer term impacts of the mitigation and any future refinements we, Google, or the community advise.
The main remediation to block the exploit on website search results pages was to add the "noindex" HTML metatag to the page. This tag directs Google (and other search engines) not to add the page to its index in any circumstances. Many of our client websites used the "robots.txt disallow rule" to tell Google not to crawl search results pages. As we've mentioned, this doesn't prevent Google from indexing search results pages in all cases. Counter-intuitive as it may sound at first, we also removed the disallow rule for search results pages from robots.txt - Google has to be able to crawl a page otherwise it never sees the noindex tag.
The consequence of these changes is when Google next tries to index the page it not only sees the noindex tag and doesn't index it, it also deletes it from its index if it is already there. Without it being in the index, the link doesn't show up in Google's search results and the exploit is defused.
Google would only remove already exploited pages from its index the next time it tries to index them; however, in this kind of situation anything longer than "right away" is too long. Once our SEO team had identified exploited pages, they used Google's Removals and SafeSearch reports Tool via Google's Search Console to immediately request their removal from the index. We saw the removals happen almost instantly.
We carried out our investigation for our Hosting & Application Support clients (where we host the website), clients where we have developed the site but don't host it, and our Digital Marketing & SEO clients (where we don't host the website).
Of the 98 websites we investigated, six of those we host and eight of those we don't had been exploited.
Where we had direct ability to apply the mitigations we've done so as swiftly as possible, and where we don't we have been in contact to advise them of the actions to take. Most of this was able to happen on the Friday following the notification on the previous evening.
Since then we have been following up to get all of the vulnerable websites protected, consider any side effects of the mitigations we've made, and investigate what other measures we might put in place.
Some considerations we already have:
- The links that the Baddies craft still allow any message to be shown on a website's search page if the website is designed to echo back the text of a search query. Our mitigation prevents those links showing up in a Google search so won't generally be seen any longer. The only way to completely prevent them from being shown is to ensure a website never echos the text back; however, this presents usability issues for the vast majority of users submitting valid queries.
- Using the noindex tag prevents some valid use cases a website can have for wanting its search page to be indexed, e.g. curated, filtered search results.
- A search page could use a "POST" request for searches or only accept query strings if the referrer is itself. This would help avoid this sort of exploit but brings its own complexities and limitations for SEO.
- Might it be possible to still have the domain with a spam query string appearing in Google search results even if the target page (any page) doesn't recognise the query parameter and can be indexed?
- Google is the biggest player in the search results arena, but are other search engines exploitable in the same way?
We'll worry about these things so that you don't have to (too much), and update you on our recommendations wherever relevant.