Moteurs de recherche : La Menace Ignorée. (NDT : non, c'est pas un film)
Auteur : Paul Heely, February 5, 2001.
Traducteur : Val2 <valdeux@aol.com>, 5 janvier 2oo2, de l'anglais vers le francais.

  ¤¤ Si ca vous dis, je finis de traduire le week end prochain, ou ce soir si vous me mailez ... ¤¤


       .:: I. Notes du traducteur ::.

La lecture de ce texte ne vous informera pas seulement comment trouver un serveur vulnérable, il traite aussi du fonctionnement des moteurs de recheche, sur lesquels on se pose souvent des questions sans chercher plus loin ...
Le préambule et l'introduction ne sont pas nécessaires si vous voulez passer directement a la pratique, mais je vous conseil tout de meme de prendre votre temps pour lire ce document.
NB : j'ai preferé ne mettre aucun accent dans cette traduction, afin déviter tout probleme de compatibilite.
Bonne lecture.


  =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

       .:: II. Lexique ::.

.:: MR == Moteur de Recherche ::.
.:: DB == Data Base == Base de donnees ::.


  =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

       .:: III. Préambule ::.


Les moteurs de recherche facilitent la naviguation sur Internet. Ils permettent des requetes simples face a des bases de données volumineuses, permettant ainsi aux utilisateurs de trouver n'importe quelle information sur n'importe quel sujet.
Cette technique présente pourtant une menace pour de nombreux serveurs :
Ce document explique les possibilites assez dangereuses proposées par ces moteurs de recherche.
Nous commencerons par examiner comment ils procedent, puis nous verrons comment leurs bases de donnees peuvent etre exploitees pour trouver des serveurs vulnerables. En plus des sites directement vulnerables, nous verrons aussi comment obtenir des informations confidentielles voire sensibles.
Nous terminerons par une suite d'informations afin de savoir comment proteger certains sites en limitant les possibilités d'acces du MR.


  =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

       .:: IV. Introduction ::.

Si vous constatiez qu'un utilisateur etait continuellement en train de visiter, d'indexer et de cataloguer toutes les données disponibles sur le Web, Seriez vous inquit ? Beaucoups d'admins le seraient si l'utilisateur etait dialup.1fournisseur.com. Mais la plupart de ces admins ne le seraient pas si les requetes vennaient d'un MR ...
L'utilisateur et le MR effectuent les memes operations, mais les risques sont differents.
Les MR sont une part importante d'Internet ; les sites y font leur publicité, donnant leur URL aux MR. Le problèle est que les MR accederont à toutes les pages possibles, sans que l'ont leur ait forcement demande ... Malheureusement, beaucoups d'admins ne prennent pas au serieux les riques encourrus.


  =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

      .:: V. Moteurs de recherche ::.

Le terme 'Moteur de recherche' est souvent utilise pour touts les types de techniques de recherchent. Dans ce document, nous parlerons des MR de type Google : Google cherche reelement sur le web, ouvre tous les liens qu'il trouve, repertoriant ainsi par mots clés toutes les pages qu'il connait, coontrairement à Yahoo qui est, lui, mis à jour manuellement, et qui classe les sites par catégories. Cette différence est tres importante.
Le probleme se situe dans le fait que le robot (comme Google) indexera TOUS les liens parcourus, memes ceux qui paraissent inutiles à un humain : parmis eux, les scripts cgi (ex : vote pour un site, login) ...
La plupart des scripts renverront une erreur (le robot ne spécifie aucune valeur aux scripts), qui servira de description et sera stockée dans la DB du MR
Google permet une syntaxe de requete performante : en plus de "Y et Z mais pas Z", on peut demander à ce qu'un certain mot soit inclus dans l'URL, ce qui permet non seulement de trouver un site équipé d'un script précis, mais aussi parfois de trouver une information interressantes ( ex : demander à ce que l'URL contiennent 'password' ou 'userinfo' ...)


   A. Scripts vulnerables.

Ils sont un enorme probleme : en fevrier 2oo1, ils sont classes 2ème probleme sur la liste top 10 'SANS'.
Plus de 750 messages de BugTraq contiennent le mot 'cgi'.

Le nombre de scripts vulnerables, associé à la capacité des MR, est tres dangereux.
Dans cet exemple, nous avons choisi 2 scripts CGI assez connus, en utilisant Google.

Le premier est Htgrep.
Sur son site officiel, on peut lire : "Htgrep is a CGI script written in Perl that allows you to query any document accessible to your HTTP server on a paragraph-by-paragraph basis."
La faille repertoriée sur Bugtraq, presente sur les versions <= 2.4, permet de lire n'importe quel fichier du serveur.
Toutes les pages affichees par Htgrep contiennnent la phrase "generated by htgrep", suivis de la version.
Une recherche de cette phrase avec Google renvoie 106 résultats.
Une recherche de +"generated by htgrep" -"v3.2" ( le resultat de doit pas contenir "v3.2" ) à renvoyé 98 pages, 

Tous ces serveurs n'avaient donc pas la derniere version : ils etait vulnerables, je vous laisse imaginer les consequences ...

Evidemment, cela marche avec d'autres CGI bugges : voila les resultats d'une recherche sur les versions buggees de phf et php/FI :

Mot clé 	Resultat
cgi-bin/phf	229 
cgi-bin/php	400 


   B. Informations sensibles.

		[la semaine prochaine]





		          [TEXTE ORIGINAL]








ORIGINAL
Abstract 

Search engines simplify the process of locating information in cyberspace. They allow simple queries against massive databases of information, allowing users to find information on any topic. The same mechanism that provides this great searchability also presents a threat to many organizations. This paper will explore how the innocuous search engine can, in fact, be quite dangerous. We begin with a look at the basic operation of a search engine. We then look at how the contents of a search engine's database can be used by a malicious user to locate web sites vulnerable to attack. In addition to sites vulnerable to direct attack, we also look at how sites may be leaking sensitive information to the world through an entry in a search engine's database. We conclude with a discussion of controls that can be deployed to help limit what parts of a site are searched and cataloged by a search engine robot.  

Introduction: 

If you knew that a user was systematically visiting, indexing and cataloging every page and resource you have available on the Internet, would you be concerned? Many organizations would be if the user were coming from dialup.someisp.com. Many of these same organizations do not give it a second thought when a search engine robot comes along and does the same thing. The user and robot exhibit the same behavior, but the perception of risk is different between the unknown user cataloging an entire web site and the well-known search engine doing the same thing. 

Search engines are an ingrained part of the Internet. Organizations rely on search engines to make their products, services, and resources available to anybody looking for them. Organizations will submit their URLs to search engines asking them to come and catalog their organization's web site. The danger is that search engine robots will access every page they can, unless directed otherwise. Unfortunately, many organizations never consider putting these controls into place. 

Search Engines: 

The term "search engine" is often used to refer to many different searching technologies. In the context of this paper, search engines are differentiated from directories in the following way: search engines are autonomous entities that crawl the web following the links that they find and creating an index of searchable terms; www.google.com is an example of a search engine. Directories are lists of sites maintained by a human being. The entries in directories, and in some cases the searchable descriptions, are not maintained automatically and require human action to update; www.yahoo.com is an example of a directory [1].  

The distinction between search engines and directories is an important one. People maintain the data in a directory, someone submitted a particular URL for inclusion in the directory. Additionally, someone most likely wrote the description, i.e., the searchable keywords, for the submitted URL. With a search engine, each page visited by the search engine robot, also called a spider or crawler, reads the page, then follows every link contained within that page. The read data is handed off to the indexer that builds a searchable list of terms that describe the page.  

The robot's systematic following of every link it encounters is where the problem lies. A human writing a description of a site for a directory probably wouldn’t consider log files and CGI programs as being important to the description. A robot, on the other hand, will happily traverse into the log file directory, if it is accessible from another page. A robot will happily try to follow a link that points to some CGI program or database interface program. These URLs and the "contents" of the URL, even if they are just error messages generated by a CGI program that was called with bad inputs, will be sent back to the indexer. The indexer builds the database of keywords that a user searches against. Many search engines provide a powerful language for constructing search queries. Google, www.google.com, allows a user to search not only by the common "X and Y but not Z", but also to specify if a search term must be contained in the URL. It also allows searches to be narrowed by domain names; anything from abc.xyz.com to just .com is a valid parameter.  

Unfortunately, many organizations do not realize that search engine robots and indexers behave in the above manner. More importantly, organizations do not realize the kinds of information they are making available and just how easy it is to find that information. This information can be broken into two categories: CGI vulnerabilities and information leakage. 

CGI Vulnerabilities: 

The term CGI is used here to include CGI programs, server side includes, applets, servlets, and data base front ends. 

CGI vulnerabilities are a huge problem. In February 2001, CGI problems are listed as number 2 in on the SANS top ten list [2]. There are over 750 messages in the Bugtraq archives that contain the word "cgi" [3]. The Common Vulnerabilities and Exposures dictionary contains 150 entries or candidates that contain the term "cgi" [4]. 

The wide range of CGI vulnerabilities, combined with the searching capabilities of a search engine, make a lethal combination. To illustrate the actual danger, two well-known CGI vulnerabilities were chosen from the above sources. Google was used in an attempt to locate vulnerable sites that contained these vulnerable programs [5]. 

The first program chosen was htgrep. From the Htgrep website [6]: 

"Htgrep is a CGI script written in Perl that allows you to query any document accessible to your HTTP server on a paragraph-by-paragraph basis."
Htgrep has been reported on Bugtraq and has CVE entry CAN-2000-0832. The problem with htgrep is that versions <= 2.4 allow an attacker to read any file on the remote system that is readable by the user that the web server runs as, typically nobody, or on poorly configured sites, root. A little investigation revealed that pages generated using htgrep contain the text "generated by htgrep". A Google search for "generated by htgrep" returned 106 hits. Htgrep also puts the version number in the "generated by htgrep" message. The search was then refined as:  
+"generated by htgrep" –"v3.2"
This resulted in 98 hits. With about 5 minutes of work almost 100 sites running known vulnerable software were identified. 
Using htgrep, an attacker could read /etc/inetd.conf, assuming Unix or Linux of course. Having the inetd.conf file, the attacker would now know some of the network services being offered by the host and could possibly tell if access mechanisms like TCP wrappers are being used. On a Solaris system, an attacker could not only find out if vulnerable RPC services like sadmind and ttdbservd are running, but by knowing what other files to read from the file system, e.g., /var/sadm/pkg/<pkg name>/pkginfo, it is possible to determine if those services are the latest patched ones or older vulnerable ones. Using htgrep to retrieve files will be logged on the remote system if they are logging http requests. An attacker has an advantage since an organization is unlikely to monitor their logs closely enough to catch this attack and at the same time have a vulnerable CGI program installed. 

Htgrep is certainly not the only vulnerable CGI program that can be located with a search engine. Two more common CGI vulnerabilities, phf and PHP/FI, returned the following search results [2]: 
  Search String Number of Hits 
allinurl: cgi-bin/phf 229 
allinurl: cgi-bin/php 400 


No claim is being made that all hits resulting from these searches will result in exploitable programs being found. Instead, the claim is being made that an attacker can quickly, and with a low probability of detection, identify a group of hosts that have a much higher probability of being vulnerable than if the attacker performed a random brute force attack. 

The preceding examples illustrate how CGI programs that are known to be vulnerable can be used to attack a site. A site could stop a search engine robot from looking for known vulnerable CGI programs. This measure would not stop the threat however. Search engine robots have already found and indexed vulnerabilities that have not yet been discovered. On December 19, 2000, a message was posted to Bugtraq with a subject of "Oracle WebDb engine brain-damagse" [7]. The author of the message goes on to explain, in detail, a vulnerability found in Oracle databases that is remotely exploitable through the web. Also included in the message was how to identify sites that were vulnerable with a simple Google search. A search performed the same day the message was posted returned 67 hits. It is frightening to think that an attacker seeing the same message was able to identify 67 sites with a known vulnerability by simply making a search query. Even more frightening is that at the time of this vulnerabilities release there was no fix available for the problem. On February 9, 2000 a Bugtraq messages titled "ASP Security Hole" discussed a similar problem with Active Server Pages [8]. 

The preceding examples show the danger search engines pose in making vulnerable CGI programs, known and as yet unknown, easy to locate. No one is going to notice that an attacker is searching Google for all dot com sites running a particular CGI program.  

Information Leakage: 

Information leakage is the other major problem that search engines can cause for an organization. Organizations are continually making resources web accessible. In many cases, these organizations only intend for internal users to access these resources. Many organizations do not password protect these resources. Instead, they rely on security through obscurity to protect them. Organizations will create web pages that they think have no links pointing to them or that are in some obscure directory that they think no one would look in. What they forget is that search engines will catalog and index log files and software just as readily as they catalog and index widget pages. To help emphasis the risks that leakage poses we will examine two scenarios. First we will look at the risks of relying on an obscure URL to protect a resource. Second we look at the dangers and risks of making log files, and associated reports, publicly accessible and searchable. 

Imagine a large organization that purchases a commercial copy of SSH. Not everybody needs to use SSH and there is no centralized way to distribute it to those who do need it. It is decided that the SSH package and license file will be given a URL of abc.com/ssh, and that this URL will be given to those who need to use the package. At the same time, it is decided that since this is a licensed commercial package that it should not be publicly accessible. Because of this it is decided that no link to the SSH package will be placed on any of the organizations web pages. This scheme in and of itself will not cause the SSH page to be cataloged by a search engine. As long as the organization does not create a link to the SSH page, the search engines have no way to get there. The problem is that the organization cannot control who does create a link to the SSH page. Sally in Accounting could be worried that she will not be able to remember the URL so she adds a link to the SSH page on her personal home page. Along comes the search engine robot, finds Sally’s page and follows all the links, including the one to her organization's licensed copy of SSH. Now that the page has been indexed and cataloged, the whole world can find it by searching for "abc.com SSH." Chances are that making a licensed commercial software package publicly available will irritate your software vendor at the very least. In the above scenario, SSH could just as easily have been replaced with any other software: an organization's beta code for a new product or internal documents. The point is that security through obscurity fails very quickly when applied to the web. A much better solution is to at a minimum use domain restriction mechanisms or password protected web pages. 

Log files are another form of information leakage that presents a large problem. A simple search for "access_log" showed about 1000 hits from one search engine. Different servers use different names for their default log files and in many cases the user can configure the filename to be whatever they want. It is just a matter of being creative enough to search for the correct phrase. Many organizations use software to process raw log files into more usable reports. A common package used for this is WebTrends, www.webtrends.com. Searching for:  

"this report was generated by webtrends" 
generated over 5000 hits. Information contained in both formal reports and raw log files can be detrimental to an organization. 
A report like the one produced by WebTrends contains a lot of useful information. A typical report will contain [9]: 

  
Top 10 requested pages. 
Top 10 countries requesting pages. 
Activity levels by day of week and time of day. 
Top 20 referring sites. 
Top 10 search engines that users used to find an organization's web site. 
Top 10 search keywords. 
The web server administrator probably does not care that this information is available. In fact, they may knowingly make it widely available. The availability of this information however can represent a violation of confidentiality. Recall that confidentiality is to keep sensitive data from being disclosed. In the case of a business, sensitive data will include any data that may give a competitive advantage to a competitor [10]. Business and marketing people will agree that dissemination of this kind of information to a competitor could help them to have a competitive advantage.  
Access to an organization's raw log files also poses a threat if that organization allows submission of web forms using the GET method. With the GET method, data in the form is submitted by appending it to the URL. Most people have seen URLs in their browsers that look like this: 
http://www.abc.com/cgi-bin/program.cgi?var1=val1&var2=val2 
This URL is executing the program called program.cgi on the abc.com website and is passing in two variables, var1 and var2, with values val1 and val2, respectively. This entire URL, including the parameters to program.cgi, will be written to the access log file. In many instances, the parameters and their values are of no consequence to anyone. In some poorly designed web pages though, user authentication information will be transmitted in this way. Consider the following URL: 
http://www.stocktradingcompany.com/cgi-bin/login?user=smith&pass=1234 
If somebody were to find the access log with this entry they could log in as smith and do all kinds of fun things. In this case, SSL will not help to protect the data submitted with the GET mechanism. The request is first decrypted then written to the log file. A better solution in this case would be to use the POST method for form submission [11]. 

Controlling Search Engine Robots: 

No organization with a web presence can afford to block search engine robot access to its web resources. Instead, organizations should evaluate what should and should not be made available through the Internet. A mechanism exists to restrict what a robot accesses and what a robot will attempt to send back to the indexer. This mechanism is the robots.txt file. The robots.txt file is literally a file with a local URL of /robots.txt. This file allows a site to specify sets of rules that define what resources a robot is forbidden to access. The format of the robots.txt file is defined in [12]. NOTE: there is no guarantee that a robot will obey the contents of the robots.txt file, though most search engines claim that their robots will. 

The following examples illustrate how the robots.txt file can be used to specify what a robot should not access. The following robots.txt states that no robot should visit any URL starting with "/private/data", "/logs", or "/software" 

# this is a comment

User-agent: *

Disallow: /private/data

Disallow: /logs

Disallow: /software
This robots.txt states that no robot should visit any URL starting with "/data" except the robot called "nicerobot". 
# robots.txt

User-agent: *

Disallow: /data

# nicerobot can look in /data

User-agent: nicerobot

Disallow: 
Finally this robots.txt states that no robot should visit any URL at this site. 
# disallow all access by all robots

User-agent: *

Disallow: /
Conclusion: 
The Internet has become a vast expanse of usable resources. Attempting to find the resources that one needs requires some searching mechanism to be in place. Any company trying to have a web presence wants to be on the top of the search list when someone searches for their type of products and goods. Many companies do not realize what other data they are making available, or how that data may be used against them. Hopefully, this paper will serve to increase awareness to the risks posed by allowing search engine robots free reign of our web sites.