Page 4 of 4 FirstFirst ... 234
Results 31 to 37 of 37

Thread: Crawler stuck on my front door

  1. #31
    Join Date
    Oct 2007
    Location
    Todd Mission Texas
    Beans
    427
    Distro
    Ubuntu Gnome 14.04 Trusty Tahr

    Re: Crawler stuck on my front door

    'www.texasflyfishers.org' has a home page and a bunch of educational pages written in HTML. This home pages navagation menu has a link to 'Forums'. Using SMF Small Machine Forums software.

    My question would be, where do I put the 'robot.txt' file. In the same directory as the index.html file? I'll go to the SMF help site for information on their software code.

    Thanks
    Dave
    "Let us run the risk of wearing out rather than rusting out." __Theodore Roosevelt

  2. #32
    Join Date
    Sep 2006
    Beans
    8,627
    Distro
    Ubuntu 14.04 Trusty Tahr

    /robots.txt

    robots.txt always goes at the top, once for each site, and manages the whole site. So if there are owners for different sections, they will have to contact you for changes to /robots.txt

    It should be retrievable at the URL http://www.texasflyfishers.org/robots.txt

  3. #33
    Join Date
    May 2009
    Location
    North West England
    Beans
    2,676
    Distro
    Ubuntu Development Release

    Re: Crawler stuck on my front door

    If it's an honest crawler, then something was missing from robots.txt. If it's dishonest, then Limit or RedirectMatch can be used by Apache to stop it, or even IP Tables.
    Assuming we're still on about the yandex robot - it is a well behaved robot.

    If you just want robot control for your forum area then

    Code:
    <META NAME="ROBOTS" CONTENT="NONE">
    in html

    Code:
    <META NAME="ROBOTS" CONTENT="NONE" />
    in xhtml

    will stop all 'behaved' robots.

    The googlebot has it's own command - so you can allow the google bot only on ....

    Code:
    <META NAME="GOOGLEBOT" CONTENT="INDEX, FOLLOW">
    Again, changing > to /> in xhtml.

    In your headers for the forum area. As most forum areas use dynamic pages, you should have a module that generates your headers for you in your code somewhere.

    As an example, my meta-tags look like below & are called by each page as the FIRST thing it does (It also looks after handling the mime-type, char-set etc. for me... ask if you need the full details)

    Code:
     ***php header***
    <!DOCTYPE html 
         PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
         "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
    <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
    <head>
    <title><?php echo $title ?></title>
    <meta http-equiv="Content-Type" content="<?php echo $mime ?>;charset=<?php echo $charset ?>" />
    <meta http-equiv="Content-Style-Type" content="text/css" />
    <link rel="stylesheet" media="screen" type="text/css" href="./css/mgj.css" />
    <link rel="stylesheet" media="print" type="text/css" href="./css/mgj_print.css" />
    <meta name="description" content="hire plant parts spares engineering services"/>
    <meta name="keywords" content="keys, locks, engine, parts, hire, plant, services, engineering,
    consumables,"/>
    <meta name="copyright" content="M.G. Judd Ltd., 2009. All rights Reserved."/>
    <meta name="no-email-collection" content="http://www.unspam.com/noemailcollection" />
    <meta name="ROBOTS" content="ALL" /> 
    ***php footer***
    Phill.

  4. #34
    Join Date
    Oct 2007
    Location
    Todd Mission Texas
    Beans
    427
    Distro
    Ubuntu Gnome 14.04 Trusty Tahr

    Re: Crawler stuck on my front door

    Quote Originally Posted by phillw View Post
    Assuming we're still on about the yandex robot - it is a well behaved robot.
    Late yesterday I filed another trouble ticket with my host 'The Planet'. They responded and said they would sent it to the unix guys for investigation. I notice that at 10:26 AM the entries in the ban log stopped. About 8 this evening I get an email saying they had put in on the deny list. I checked the ban log and it had started back at 7:06 pm.

    I sent off another email and got a response that they did not know why it was still a problem. They would upgrade the trouble ticket.

    Thought just occured. Could it come in through my DSL line and not through my IP server?

    Let me study what information you have given me. It may be Monday before I get back to you. Tis faire season here in Southeast Texas.

    Thanks for your help
    Dave
    "Let us run the risk of wearing out rather than rusting out." __Theodore Roosevelt

  5. #35
    Join Date
    May 2009
    Location
    North West England
    Beans
    2,676
    Distro
    Ubuntu Development Release

    Re: Crawler stuck on my front door

    See next posting
    Last edited by phillw; October 23rd, 2009 at 11:37 PM. Reason: new info

  6. #36
    Join Date
    May 2009
    Location
    North West England
    Beans
    2,676
    Distro
    Ubuntu Development Release

    Re: Crawler stuck on my front door

    Thought just occured. Could it come in through my DSL line and not through my IP server?
    Depends if you are hosting on your own machine, or using planet.

    I'm bemused at your problems with yandex's spider - I have read a few forums that specialise in spiders and none report it as a bad one.

    The big boss of the forum I'm involved with is unavailable for a while (family illness). Instead of just the IP addresses can you report back the spider name. It'll be something like ..

    spider22.yandex.ru

    **UPDATE**
    I have found a site that reports that certain yandex spiders do NOT obey robots.txt - This is somewhat puzzling, but I'll give you the link.

    http://crawlerinfo.com/check/cid,280...%29.html#Hosts

    I'll go see if it is reported by others.

    **UPDATE (Again)**

    The ones complaining about it are dated circa 2008, which was when yandex spiders left russia and wandered into the world. The early spiders were
    bandwidth heavy and various comments on stopping them were discussed... Below is one such thread (You'll note that yandex isn't the only bandwidth
    hungry spider).

    http://forums.jumba.com.au/showthread.php?t=7385


    Phill.

  7. #37
    Join Date
    Oct 2007
    Location
    Todd Mission Texas
    Beans
    427
    Distro
    Ubuntu Gnome 14.04 Trusty Tahr

    Re: Crawler stuck on my front door

    Since I posted last, The Planet has notified me that they have blocked it by using the .htacess directory. I haven't gone to look at it but my ban log shows no activity from that Ip for over 24 hours.

    I'm going to finish reading all the links all of you provided so that I will be better informed.

    I'm going to go out on a limb and ask this thread be marked solved.

    Again, let me give a great big thank you for everyones help in this matter.

    Dave
    "Let us run the risk of wearing out rather than rusting out." __Theodore Roosevelt

Page 4 of 4 FirstFirst ... 234

Tags for this Thread

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •