We closed this forum 18 June 2010. It has served us well since 2005 as the ALPHA forum did before it from 2002 to 2005. New discussions are ongoing at the new URL http://forum.processing.org. You'll need to sign up and get a new user account. We're sorry about that inconvenience, but we think it's better in the long run. The content on this forum will remain online.
IndexDiscussionGeneral Discussion,  Status › Help Needed: Google and robots.txt
Page Index Toggle Pages: 1
Help Needed: Google and robots.txt (Read 1745 times)
Help Needed: Google and robots.txt
Mar 15th, 2006, 3:45pm
 
hey all,

so in the interest of making the "Search processing.org" results more useful, we're trying to use a robots.txt file to exclude the viewcvs subdirectories on dev.processing.org. otherwise this causes a zillion different versions of the code to show up, rather than what are usually more relevant results elsewhere.

the problem is that google doesn't seem to want to honor this robots.txt file and so the crap won't go away. this is our current file:
http://dev.processing.org/robots.txt

i'm not sure if this is our fault or google's (our robots file is pretty basic), but it's fairly annoying. has anyone dealt with this stuff or have any ideas about solutions?

extra bonus if you can suggest an alternative robots.txt for dev.processing.org that will *include* /processing, which is a single copy of the source that would be searched. i know it's possible to do in the robots file, but haven't tried it yet until i can get the exclusion stuff worked out for /source.

thanks for any help...
Re: Help Needed: Google and robots.txt
Reply #1 - Mar 15th, 2006, 5:29pm
 
You could try explicitly stopping google, and I dunno if it matters, but putting the / on the end seems fairly common. You coudl also explicitly deny any cgi-generated pages:

User-agent: Googlebot
Disallow: /source/
Disallow: /source/*?

It should be noted that even if this is working, it may take a month or so before the pages actually dissapear from the results.
Re: Help Needed: Google and robots.txt
Reply #2 - Mar 15th, 2006, 5:48pm
 
yeah, thought about explicitly stating googlebot, though since we want it to work for all engines i didn't want to bother introducing another variable into things (what happens if we include * and googlebot as agents? and in what order? who knows...) have also been trying with and without slash (i removed the slash this morning)

it's been a couple months since i've been trying to make the robots thing work, so that's not the issue on the updates. since it takes a while for google to update based on the changes, it's a bit frustrating to have a "tweak, wait several weeks, tweek again" cycle.

hrm..
Re: Help Needed: Google and robots.txt
Reply #3 - Mar 15th, 2006, 8:02pm
 
Hi Fry,

From the reading here:
http://www.google.com/webmasters/remove.html

The second line should look like this - Disallow: /source - which is what you now have so it should begin to work.

Also just to make sure, the robot.txt file is in the root of your server correct?

To speed up the re-indexing processes from Google you can submit your site to them here:
http://www.google.com/addurl/?continue=/addurl

Do not submit to them more than once a month or they will get mad at you.

When I do this Google usually indexes me within the week and it seems to come back every few days for about a week.

Hope it helps,

4dplane
 

Re: Help Needed: Google and robots.txt
Reply #4 - Mar 15th, 2006, 9:26pm
 
yeah, so that's what we've had for several weeks, and google is continually re-indexing the site, but the pages just won't fall out of the index, it seems.

the googlebot seems to be responding to the robots file just fine, but the pages just won't go away from google itself even after several weeks/months (i can't recall exactly when we put the stuff in). i guess i'll look into the "remove" page on their site to see if that'll do..
Re: Help Needed: Google and robots.txt
Reply #5 - Mar 23rd, 2006, 11:32am
 
Hi!

I justs wanted to tell you that thre is a function called Google Sitemaps (it's in beta as many of Google's functions are) and from what I can see it may be able to do what you want.

Here's the URL in case you haven't seen it yet:
https://www.google.com/webmasters/sitemaps/login?hl=en
Re: Help Needed: Google and robots.txt
Reply #6 - Mar 23rd, 2006, 2:23pm
 
yeah, thanks. things seem to be improving because either they've heeded the "remove" request or it just happened to start falling out of the cache. now to get cvs.processing.org to no longer show up in the results (that hasn't worked for several months, so i'm not sure why it's still in there) and we should have much much better results.
Re: Help Needed: Google and robots.txt
Reply #7 - Apr 1st, 2006, 12:26am
 
get rid of google and use joomla, mambo, etc...

anything with a search function built into the content management system.
Re: Help Needed: Google and robots.txt
Reply #8 - Apr 1st, 2006, 5:26am
 
I know it wont help, but once I made an experiment printing in the end of description meta tag, the day, hour and minute, so I can view in google searches when it was indexed last time. It can help to track if google keeps indexing some folders
Re: Help Needed: Google and robots.txt
Reply #9 - Apr 1st, 2006, 7:58pm
 
yeah, actually we're finally getting good results. submitting the robots.txt for reading and specifically asking them to remove the old cvs.processing.org address has improved the results significantly. so the "search processing.org" box is now much more useful than just the forum search or searching the bugs db, since it'll do both, along with things like release notes. so i think we're all set.
Page Index Toggle Pages: 1