Team EaSE Article: .htaccess and robots.txt files
Last month, in our article SEO, Joomla! and your Template we explained that before visiting the pages on a web site, search engines (for instance Google and Yahoo) check for the presence of instructions about the web site on its server. These instructions can be given by two files that are distributed with Joomla! and are normally found in the top level of your Joomla! installation: htaccess.txt and robots.txt.
Both of these files can be used for security purposes: they can forbid some pages from being indexed by a search engine or prevent access to specified pages, files and folders.
What is difference between those two files?
.htaccess "speak" with Apache server and gives it instructions, whilst the robots.txt file tells search engines which pages or folders may be indexed. The .htaccess can be more restrictive then robots.txt file, because it can forbid search engines access to some or even all of your pages.
.htaccess (hypertext access)
Why is there a dot in front of htaccess? That dot means that the file is for the Apache server and so should be hidden from public view. Until we rename the Joomla! htaccess.txt file to .htaccess, it is not enabled.
After every refresh, Apache looks in every single folder for .htaccess. For this reason, it should be used very sparingly, since if it exists it slows down a server as it is read. When an .htaccess file is enabled, we are using it to make our sites more secure and for on-page SEO purposes. Another purpose of an .htaccess file is to allow a site to specify changes to the main server configuration.
Restrict, Rewrite, Redirect
Using .htaccess you can control access, redirect to error pages and control URL behaviour by giving directives to the server. Remember, though, that we can give only directives which are honoured by server administrator.
We can create .htaccess file by renaming the htaccess.txt file located in top level of the Joomla! installation. To do this, first download and edit htaccess.txt with a text editor such as Notepad and save it in ASCII mode. Using an FTP client, upload your file and change the name of the file on the server to .htaccess. Remember that this dot in front of the name is extension. Permissions for the file should be set to maximum 644.
Now that we’ve explained how to edit this file, what are some things that can be done with it?
Search Engine Friendly URLS
By default, Joomla! URLs are long and are constructed from parameters such as the component name, item id, etc. Using Joomla!’s built-in search engine friendly URL system, we are able to construct URLs that contain keywords, are shorter, and make more sense to a site visitor. This may also impact our search engine ranking.
When planning to use search engine friendly URLs, we recommend reading the article Joomla! v 1.5 Search Engine Friendly URLs (SEF URLs) by Benjamin Hättasch, which explains how to edit .htaccess under several server environments. With most Linux hosting plans, there should not be any problems with default Joomla! .htaccess file. Difficulties can be caused by customized server configurations, and Benjamin’s article explains how to troubleshoot those.
Restricting Access to Files
Simply renaming htaccess.txt does not a complete security job. For security, we can add several directives to .htaccess (after the command RewriteEngine On):
<Files ~ "\.xml$">
Deny from all
In the code above, we have restricted access to any Extensible Markup Language (XML) file on our site. We suggest doing this, because often in Joomla! XML files are those that contain information we don’t want a malicious user to see and exploit. For instance, the installation information for the extensions we’ve used on our Joomla!site may contain their version numbers. To restrict access to one type of file we use the "Files" directive. If we want to forbid access to several types of files, we can use FileMatch directive:
Deny from all
Restrict Access to the Administrator Page
Using .htaccess, we can make it harder for someone to visit the administrator page of our site and attempt a brute-force attack to ‘crack’ the username and password. To do this, we restrict access to administrator page with these directives:
Deny from all
AuthName "htaccess password prompt"
Next, we need to create an .htpasswd file. We can use http://www.askapache.com/online-tools/htpasswd-generator/ to get the code we need and to save it in a file called .htpasswd. It is very important to protect that file by putting it in a folder that is above the public root (on most servers, the public root is called something like “public_html” or “www”). In our example code above, we’ve specified a custom location: /home/restrictions/.htpasswd
The robots.txt file is used to give spiders—automatic programs that ‘crawl’ the internet adding pages to search engines—instructions about what folders and files should not be indexed in the search engine. Some spiders ignore these directives, and might be attempting to cause problems with bandwidth consumption, so use the htaccess file to forbid those spiders from visiting the site at all.
Last month we extensively covered what to add and remove in a robots.txt file, but we’d like to share a few tips:
- use one directive per line
- This file is case-sensitive; that is, uppercase or lowercase characters in names do matter
- do not use comments