You are here

Skip robots.txt check?

6 posts / 0 new
Last post
chjohans
Offline
Donator
Joined: 10 years
Last seen: 6 months
Skip robots.txt check?

Can I somehow check the robots.txt check, or at least surpress the warning message !! -- WARNING : mysite.com doesn't allow epg grabbing !!

Yes, I know I technically shouldn't, but for now this is the only place to get my EPG data from.

mat8861
Offline
WG++ Team memberDonator
Joined: 9 years
Last seen: 14 hours

it's just a warning , don't worry

chjohans
Offline
Donator
Joined: 10 years
Last seen: 6 months

I know it's just a warning, and I'm not at all worried, it's just that there are so many of these messages that it's easy to miss anything that would be relevant due to "information overload.

I'm just grabbing EGD data for 6 channels where robots.txt disallows "/", is it really necessary to log this message 60 times:

!! -- WARNING : mewatch.sg doesn't allow epg grabbing !!
[Warning ] it is advised to disable this channel / site from your channel list

It totally overtakes my logfile, which becomes unreadable after a few runs because of this "spam".

So I would like to disable the robots.txt check, disable that message or at least reduce it to a minimum, I don't need it repeated 60 times so my logfile becomes unreadable.

chjohans
Offline
Donator
Joined: 10 years
Last seen: 6 months

So, is there an option to skip checking for robots.txt?

If not, could we please have such an option?

or at the very least, could you please reduce the logging of this?

As I write above, with my current small grabbing need of only 6 channels, this message is logged 60 times in the log each time I run WG+. The way this "pollutes" the logfile makes it very hard to see anything else that might be important, since this is not really an error but just a warning it should be sufficient to log this *ONE* per server per run.

Blackbear199
Offline
Blackbear199's picture
WG++ Team memberDonator
Joined: 9 years
Last seen: 15 min

in the same folder where your webgrab config.xml is there should be a robots folder.
inside that you should find a robots.txt file for the site giving you this message.
edit the file and for all the disallow lines leave the parameter blank(delete the /).
save it and make it read only.

now when webgrab runs it will still do the robots check but cannot overwrite the file as its read only.
this will remove any warnings.

chjohans
Offline
Donator
Joined: 10 years
Last seen: 6 months

Thanks, blackbear, much appreciated.

I actually did try something like what you suggested, but at first, it didn't work. Turns out that the NTFS has a peculiarity with regards to caching of file attributes, even after a file has been deleted, it will cache ce4rtain file attributes for a few seconds even after a file has been deleted. So if a new file is created with the exact same path within a few seconds it will inherit at least the "creted date" from the deleted file.

It turned out that "created date" was used when checking the local robots file to determine if it should check for a robots.txt file on the server. So just updating the file did nothing. And even when I deleted and re-created the file (in a script) it would "inherit" the cached value of "created date". This peculiarity of NTFS is documented, but it was news to me and I would consider this a bug in the filesystem.

Setting the local robots file to read-only will only throw an error in WG+ so that's not a solution either.

I ended up with a simple sleep statement in my script, between deleting and re-creating the local robots file, 15 seconds was enough for the caching of the previous "created date" to be dropped, I could probably set this a bit shorter too but since 15 works fine I'm just leaving it at that.

So now I'm just running a script, automatically but WG+, that manipulates the local robots file on every fetch. It would strictly not be necessary to do this that often since WG+ will not check more often than every 30th day I believe it is, but this works and is simple so I'm just leaving it this way.

So finally no more "pollution" where 80% of my logfile is this warning.

Log in or register to post comments

Brought to you by Jan van Straaten

Program Development - Jan van Straaten ------- Web design - Francis De Paemeleere
Supported by: servercare.nl