Lee Kelleher’s Weblog

random posts on code, .NET, Umbraco and WordPress

Robots.txt for use with Umbraco

with 10 comments

I originally posted this over at the Our Umbraco community wiki. [Robots.txt for use with Umbraco] I am only posting it on my blog as a cross-reference. The Our Umbraco wiki version will evolve with the community’s experience and knowledge.

The Robots Exclusion Protocol has been around for many years, yet there are a lot of web-developers who are unaware of the reasons for having a robots.txt file in the root of their websites.

There have been many rumours around whether the bigger search engine crwalers (i.e. Googlebot) consider your website amateurish if you didn’t have a robots.txt – and if handled badly, could lead to your site being invisible on SERPs.

If you are happy for a crawler to crawl/index all of your website’s content, then you can use the following:

User-agent: *
Disallow:

However, when using Umbraco to power my websites, it is preferable to define which folders are accessible by the crawler. Personally, I would not like to see the contents of my /umbraco/ folder to be returned in Google’s SERPs.

Here is an example of the robots.txt that I have used on several Umbraco-powered websites.

# robots.txt for Umbraco
User-agent: *
Disallow: /aspnet_client/
Disallow: /bin/
Disallow: /config/
Disallow: /css/
Disallow: /data/
Disallow: /scripts/
Disallow: /umbraco/
Disallow: /umbraco_client/
Disallow: /usercontrols/
Disallow: /xslt/

From my perspective, there is no reason for a search engine crawler to be crawling/indexing files from any of the above folders – you may have a different perspective, to which you can amend your robots.txt accordingly.

For more information about the robots.txt standard, please refer to the official website: http://www.robotstxt.org/robotstxt.html


Written by Lee Kelleher

July 7, 2009 at 4:10 pm

Posted in blog

Tagged with , ,

10 Responses

Subscribe to comments with RSS.

  1. Hi,
    adding disallow sections to your robots.txt file can help hackers (and their robots) find where admin section is and what cms you are using.

    I use empty file.

    Petr

    Petr Snobelt

    July 9, 2009 at 10:47 am

    • Hi Petr, thanks for your comment. It is appreciated, but I disagree with you – in a positive healthy way.

      IMHO, it doesn’t matter if a ‘hacker’ discovers which CMS you are using.

      I understand it’s a potential threat if there are known exploits for certain software (i.e. older versions of WordPress) – but mentioning it in your robots.txt wont make any difference to a motivated ‘hacker’ – they’ll find out any way they can, (i.e. looking for ‘/wp-admin/’ or ‘/umbraco/umbraco.aspx’)

      … or even using a service such as builtwith.com

      Do you think that the “Umbraco powered sites” wiki page? Are they at risk from ‘hackers’?
      http://our.umbraco.org/wiki/about/umbraco-powered-sites

      From my perspective, the robots.txt is all about what I want Google to index – or more specifically what I DON’T want them to index. For me, (and my clients), the XSLT, User Controls, JavaScript and CSS (i.e. ‘code’) are all our IP and don’t want them to appear on publicly-accessible search engines.

      I am aware that there are many crawlers/search-bots that ignore the robots.txt protocol – but that is no reason to penalise the big boys (Google, Yahoo, et al)

      Cheers,
      - Lee

      Lee Kelleher

      July 9, 2009 at 11:03 am

    • Another example… see what happens if you Google this: “inurl:umbraco.aspx intitle:login”
      http://www.google.co.uk/search?q=inurl:umbraco.aspx+intitle:login

      You can exclude yourself from those results with robots.txt

      Lee Kelleher

      July 9, 2009 at 11:08 am

  2. My thoughts: in a site with umbraco, you most likely do not have a direct link to the backend. Google doesn’t know it then and doesn’t index it. No need to put it in the robots.txt.
    Same for xslt, usercontrols, data and a few others.
    If there are no links towards the files in those folders, Google will leave them alone.
    I do tend to agree with Petr here. I only place folders in it that do have links to them but I do not want Google to index it.

    PeterD

    PeterD

    July 9, 2009 at 11:30 am

    • One thing to remember, it’s not always about the links. Some browser-installed software, like the Google Toolbar (Alexa, et al) – may add the URLs you visit to their crawl/index. So if you are in the Umbraco back-end, then that’s how it “could” be indexed.

      Lee Kelleher

      July 9, 2009 at 11:35 am

  3. There are quite a few sites out there that have a link to the backend. WordPress tends to do that as well. So, yes I think they are all linking somewhere to the backends.

    I’m pretty keen on my logfiles every now and then and I tend to go through them thuroughly for a couple of hunderds of lines. I’ve never seen Google indexing any of my files in the xslt, data or config-folder. Simply because there is no link to it.

    PeterD

    July 9, 2009 at 11:54 am

  4. Hi Lee,
    no one of my sites background is in google list :-) So I’m happy with empty robots.txt. Robots.txt is well known file,/umbraco/umbraco.aspx isn’t.

    Best practise for secure sites is to do not give hackers more info then they need to know (OS, webserver version, CMS, exceptions etc.). If you can hide this information.

    Sometimes you use hosting and cannot restrict access to admin folders and is better if hackers can’t find it. I also think umbraco folder name can be changed…

    Petr

    Petr Snobelt

    July 9, 2009 at 1:11 pm

  5. I’m a firm believer in do whatever works best for you. It’s all about choice, you can decide what you want to show/expose in your robots.txt.

    It has been very interesting for me to hear your views about the use of robots.txt – I’d love to hear from other Umbraco devs about their experiences with it too!

    Lee Kelleher

    July 9, 2009 at 10:22 pm

  6. [...] a comment » Following up on my recent post of using Robots.txt with Umbraco, I decided that it would be nice to be able to edit the robots.txt directly from the Umbraco [...]


Leave a Reply