What is a robot.txt ??

callumacrae

not alex mac
Community Support
Messages
5,257
Reaction score
97
Points
48
I've heard that I need a robot.txt to get the whole of my site on google and not just the first page, but what is a robot.txt and how does it work?
 

ttony21

New Member
Messages
7
Reaction score
0
Points
0
Here's an example of a robots.txt file http://x10hosting.com/robots.txt

It's basicaly just a file that stops "robots" from accessing pages on your site and adding them to a search engine(considering the robot complies with this)

Basically a robots.txt file with no security would just say this
User-agent: *
Disallow:

That allows any robot to visit any file on your website (though I don't think you need robots.txt for this, robots.txt is meant more for protecting your web pages that you don't want to be public)

Edit: Just to clarify, a "robot" is a program that crawls around the internet and are used for several different things, this is called web spidering and search engines like google use them to find new websites and add them, technically you don't need to do any work to get a search engine to grab your website, thats why they came out with robots.txt to create rules that web robots are SUPPOSED to obey
 
Last edited:

intenex

New Member
Messages
194
Reaction score
0
Points
0
Yeah, I don't think you need a robots text to let robots search your site...Google frankly doesn't care about your privacy =p. They crawled my site within minutes of me setting it up.

______

BlackQuantum

Intenex.PNG
Intenex.PNG
Intenex.PNG
Intenex.PNG
Intenex.PNG
Intenex.PNG
Intenex.PNG
Intenex.PNG
Intenex.PNG

Intenex.PNG
Intenex.PNG
Intenex.PNG
BlackQuantum.PNG
Intenex.PNG
Intenex.PNG
Intenex.PNG
Intenex.PNG
Intenex.PNG

Intenex.PNG
Intenex.PNG
Intenex.PNG
Intenex.PNG
Intenex.PNG
Intenex.PNG
Intenex.PNG
Intenex.PNG
Intenex.PNG

Intenex.PNG
Intenex.PNG
Intenex.PNG
Intenex.PNG
Intenex.PNG
Intenex.PNG
Intenex.PNG
Intenex.PNG
Intenex.PNG
 

Sohail

Active Member
Messages
3,055
Reaction score
0
Points
36
Yeah it's simply a file that controls the way a "spider" crawls your website. But that's true, Google would probably ignore it anyway :p.
 

Smith6612

I ate all of the x10Pizza
Community Support
Messages
6,518
Reaction score
48
Points
48
Yeah, I don't think you need a robots text to let robots search your site...Google frankly doesn't care about your privacy =p. They crawled my site within minutes of me setting it up.

Oh, Google does care, their search engine just tends to be lazy/buggy some times. Otherwise, other than using a robots file to tell bots what and what not to look for and what bots can look around, there is also a way in robots.txt files if you know the syntax to tell robots to have a spider delay in seconds (if your site is VERY busy and the bots are slowing you up).
 

ttony21

New Member
Messages
7
Reaction score
0
Points
0
Oh, lol I realize the author of this post probably forgot that they posted this or something like that but in case they do come back to look at it I found something else interesting in the x10hosting ftp, the default robots.txt file for each user is this:

User-agent: *
Crawl-delay: 10

Notice the crawl-delay that Smith mentioned
 

Smith6612

I ate all of the x10Pizza
Community Support
Messages
6,518
Reaction score
48
Points
48
Yes! That's it. That's very useful if you have a very busy site and have loads of bots popping in every second, and you don't want resources being hogged by bots. It's a good idea to use the delay on free hosts with a massive amount of accounts on servers as well, as some search engines like Yahoo are known to crawl sites every second sometimes. I've had Yahoo most recently last week do that to my web server where every second for a half hour it was loading up some page on one of the sites I host here. It wasn't a problems as hardly anyone visits these sites, but if I hosted some busy sites, then that'd be a pretty big problem.
 

masshuu

Head of the Geese
Community Support
Enemy of the State
Messages
2,293
Reaction score
50
Points
48
also note if you don't want a bot or anyone else accessing a directory
like one from the x10hosting file :
Code:
Disallow: /oldhidden
make sure that that directory is not accessable by the genral public also
like if you actually go there youll get an error,
http://x10hosting.com/oldhidden

ive seen some people who add a Disallow in a robot file to keep robots from indexing critdical direcotrys, but you could still go to them and view them
and as someone else said, some robots don't even look at the robot.txt, so they can still index the directory
 

AutoItKing

New Member
Messages
32
Reaction score
0
Points
0
I have actually found that most of the time Google actually follows the rules, most of the time. But like everyone before me has said, robots are programs that crawl the web and find sites to add to their database of millions upon millions of already added sites.
 

callumacrae

not alex mac
Community Support
Messages
5,257
Reaction score
97
Points
48
Oh, lol I realize the author of this post probably forgot that they posted this or something like that but in case they do come back to look at it I found something else interesting in the x10hosting ftp, the default robots.txt file for each user is this:

User-agent: *
Crawl-delay: 10

Notice the crawl-delay that Smith mentioned

Would the ten be seconds or miliseconds?

And I did remember, but I was trying to fix my site, which broke :)
 

ttony21

New Member
Messages
7
Reaction score
0
Points
0
It's meant to be 10 seconds (but that's really up to the bot to decide lol)
 

Zdroyd

New Member
Messages
548
Reaction score
0
Points
0
Yeah, I don't think you need a robots text to let robots search your site...Google frankly doesn't care about your privacy =p. They crawled my site within minutes of me setting it up.


Wow
I check google and what do you know, my site was on there. But oddly its not the first link under "zdroyd".

Is there anyway to tell google that when someone searches "Zdroyd" to have the first link my site?
 

ttony21

New Member
Messages
7
Reaction score
0
Points
0
Is there anyway to tell google that when someone searches "Zdroyd" to have the first link my site?

Well robots.txt can't help you there, but here's a great article on how google ranks pages (the one with the best rank that also contains the searched word is the one that is at the top of the list on the google search):
http://www.switchit.com/news/improve-pagerank.asp
 
Last edited:

satheesh

New Member
Messages
883
Reaction score
0
Points
0
Nice Thread.
I am don't know this.
Now i know.

Please give one Example:
I want to some website not spider my site.
 

ttony21

New Member
Messages
7
Reaction score
0
Points
0
User-agent: *
Disallow: /

keeps all robots and spiders out (I don't no Tamil but if this helps):
அடை இயந்திர மனிதன கூட, மற்றும் சிலந்தி

Edit: lol ya I know it's not really a correct translation but just wanted to try from looking in a Tamil dictionary
 
Last edited:

medphoenix

New Member
Messages
354
Reaction score
0
Points
0
keeps all robots and spiders out (I don't no Tamil but if this helps):
அடை இயந்திர மனிதன கூட, மற்றும் சிலந்தி

Well, The tamil translation is a direct translation and the meaning is not correct as you wrote in english ;)

Here is an example Robot.txt file, which prevents some unwanted Robots crawling in your website. Unwanted mean they are like spy bots they may increase your bandwidth. I usually Disallow them and I give you those Robots information here..

Code:
# robots.txt for http://www.yoursite.com/
# created by  CyBerPhOeniX

User-agent: *
Disallow: /cgi-bin/ # This is cgi bin no robots can access
Disallow: /secret/ # I don't allow my secrets to be viewed by robots
Disallow: /topsecret.htm


User-agent: aipbot
Disallow: /

User-agent: ia_archiver
Disallow: /

User-agent: Alexibot 
Disallow: /

User-agent: Aqua_Products 
Disallow: /

User-agent: asterias 
Disallow: /

User-agent: b2w/0.1 
Disallow: /

User-agent: BackDoorBot/1.0 
Disallow: /

User-agent: becomebot
Disallow: /

User-agent: BlowFish/1.0 
Disallow: /

User-agent: Bookmark search tool 
Disallow: /

User-agent: BotALot 
Disallow: /

User-agent: BotRightHere 
Disallow: /

User-agent: BuiltBotTough 
Disallow: /

User-agent: Bullseye/1.0 
Disallow: /

User-agent: BunnySlippers 
Disallow: /

User-agent: CheeseBot 
Disallow: /

User-agent: CherryPicker 
Disallow: /

User-agent: CherryPickerElite/1.0 
Disallow: /

User-agent: CherryPickerSE/1.0 
Disallow: /

User-agent: Copernic 
Disallow: /

User-agent: cosmos 
Disallow: /

User-agent: Crescent 
Disallow: /

User-agent: Crescent Internet ToolPak HTTP OLE Control v.1.0 
Disallow: /

User-agent: DittoSpyder 
Disallow: /

User-agent: EmailCollector 
Disallow: /

User-agent: EmailSiphon 
Disallow: /

User-agent: EmailWolf 
Disallow: /

User-agent: EroCrawler 
Disallow: /

User-agent: ExtractorPro 
Disallow: /

User-agent: FairAd Client 
Disallow: /

User-agent: Fasterfox
Disallow: /

User-agent: Flaming AttackBot 
Disallow: /

User-agent: Foobot 
Disallow: /

User-agent: Gaisbot 
Disallow: /

User-agent: GetRight/4.2 
Disallow: /

User-agent: Harvest/1.5 
Disallow: /

User-agent: hloader 
Disallow: /

User-agent: httplib 
Disallow: /

User-agent: HTTrack 3.0 
Disallow: /

User-agent: humanlinks 
Disallow: /

User-agent: IconSurf
Disallow: /
Disallow: /favicon.ico

User-agent: InfoNaviRobot 
Disallow: /

User-agent: Iron33/1.0.2 
Disallow: /

User-agent: JennyBot 
Disallow: /

User-agent: Kenjin Spider 
Disallow: /

User-agent: Keyword Density/0.9 
Disallow: /

User-agent: larbin 
Disallow: /

User-agent: LexiBot 
Disallow: /

User-agent: libWeb/clsHTTP 
Disallow: /

User-agent: LinkextractorPro 
Disallow: /

User-agent: LinkScan/8.1a Unix 
Disallow: /

User-agent: LinkWalker 
Disallow: /

User-agent: LNSpiderguy 
Disallow: /

User-agent: lwp-trivial 
Disallow: /

User-agent: lwp-trivial/1.34 
Disallow: /

User-agent: Mata Hari 
Disallow: /

User-agent: MIIxpc 
Disallow: /

User-agent: MIIxpc/4.2 
Disallow: /

User-agent: Mister PiX 
Disallow: /

User-agent: moget 
Disallow: /

User-agent: moget/2.1 
Disallow: /

User-agent: MSIECrawler 
Disallow: /

User-agent: NetAnts 
Disallow: /

User-agent: NICErsPRO 
Disallow: /

User-agent: Offline Explorer 
Disallow: /

User-agent: Openbot 
Disallow: /

User-agent: Openfind 
Disallow: /

User-agent: Openfind data gatherer 
Disallow: /

User-agent: Oracle Ultra Search 
Disallow: /

User-agent: PerMan 
Disallow: /

User-agent: ProPowerBot/2.14 
Disallow: /

User-agent: ProWebWalker 
Disallow: /

User-agent: psbot 
Disallow: /

User-agent: Python-urllib 
Disallow: /

User-agent: QueryN Metasearch 
Disallow: /

User-agent: Radiation Retriever 1.1 
Disallow: /

User-agent: RepoMonkey 
Disallow: /

User-agent: RepoMonkey Bait & Tackle/v1.01 
Disallow: /

User-agent: RMA 
Disallow: /

User-agent: searchpreview 
Disallow: /

User-agent: SiteSnagger 
Disallow: /

User-agent: SpankBot 
Disallow: /

User-agent: spanner 
Disallow: /

User-agent: SurveyBot
Disallow: /

User-agent: suzuran 
Disallow: /

User-agent: Szukacz/1.4 
Disallow: /

User-agent: Teleport 
Disallow: /

User-agent: TeleportPro 
Disallow: /

User-agent: Telesoft 
Disallow: /

User-agent: The Intraformant 
Disallow: /

User-agent: TheNomad 
Disallow: /

User-agent: TightTwatBot 
Disallow: /

User-agent: toCrawl/UrlDispatcher 
Disallow: /

User-agent: True_Robot 
Disallow: /

User-agent: True_Robot/1.0 
Disallow: /

User-agent: turingos 
Disallow: /

User-agent: TurnitinBot 
Disallow: /

User-agent: TurnitinBot/1.5 
Disallow: /

User-agent: URL Control 
Disallow: /

User-agent: URL_Spider_Pro 
Disallow: /

User-agent: URLy Warning 
Disallow: /

User-agent: VCI 
Disallow: /

User-agent: VCI WebViewer VCI WebViewer Win32 
Disallow: /

User-agent: Web Image Collector 
Disallow: /

User-agent: WebAuto 
Disallow: /

User-agent: WebBandit 
Disallow: /

User-agent: WebBandit/3.50 
Disallow: /

User-agent: WebCapture 2.0 
Disallow: /

User-agent: WebCopier 
Disallow: /

User-agent: WebCopier v.2.2 
Disallow: /

User-agent: WebCopier v3.2a 
Disallow: /

User-agent: WebEnhancer 
Disallow: /

User-agent: WebSauger 
Disallow: /

User-agent: Website Quester 
Disallow: /

User-agent: Webster Pro 
Disallow: /

User-agent: WebStripper 
Disallow: /

User-agent: WebZip 
Disallow: /

User-agent: WebZip 
Disallow: /

User-agent: WebZip/4.0 
Disallow: /

User-agent: WebZIP/4.21 
Disallow: /

User-agent: WebZIP/5.0 
Disallow: /

User-agent: Wget 
Disallow: /

User-agent: wget 
Disallow: /

User-agent: Wget/1.5.3 
Disallow: /

User-agent: Wget/1.6 
Disallow: /

User-agent: WWW-Collector-E 
Disallow: /

User-agent: Xenu's 
Disallow: /

User-agent: Xenu's Link Sleuth 1.1c 
Disallow: /

User-agent: Zeus 
Disallow: /

User-agent: Zeus 32297 Webster Pro V2.9 Win32 
Disallow: /

User-agent: Zeus Link Scout 
Disallow: /
 
Last edited:

coolv1994

Member
Messages
508
Reaction score
0
Points
16
robot.txt is what you tell search engines to index. Like if you hav a folder called mystuff and you didn't want Google to list it just type it in:
Code:
User-agent: Google
Disallow: /mystuff/
 
Top