Hi,I try to create an ini for swisscom.ch
but i stuck on "error downloading page: Index was outside the bounds of the array."
maybe someone can help ?
ini :
**------------------------------------------------------------------------------------------------
* @header_start
* WebGrab+Plus ini for grabbing EPG data from TvGuide websites
* @Site: services.sg1.etvp01.sctv.ch
* @MinSWversion: V2.1.5
* @Revision 1 - [25/03/2019] DeBaschdi
* -Creation
* @Remarks:
* @header_end
**------------------------------------------------------------------------------------------------
site {url=swisscom.ch|timezone=UTC|maxdays=14.1|cultureinfo=de-DE|charset=UTF-8|titlematchfactor=50}
*
url_index{url(debug)|https://services.sg1.etvp01.sctv.ch/catalog/tv/channels/list/ids=|channel|;level=enorm;start=|urldate|}
urldate.format {datestring|yyyyMMddHHmm}
url_index.headers {customheader=Accept-Encoding: gzip, deflate, br}
url_index.headers {accept=text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8}
url_index.headers {accept=application/json; charset=utf-8}
*index_urlshow.modify {cleanup(style=jsondecode)}
*
index_showsplit.scrub {(debug)multi|"Content":|"Channel"}
*index_showsplit.modify {cleanup(style=jsondecode)}
index_start.scrub {single(pattern="yyyy-MM-dd-HH:mm:ss")|"Start":"||"|"}
index_stop.scrub {single(pattern="yyyy-MM-dd-HH:mm:ss")|"End":"||"|"}
log :
[ Info ] ( 1/1 ) SWISSCOM.CH -- chan. (xmltv_id=ard) -- mode Force
[ Debug ] debugging information siteini; urlindex builder
[ Debug ] siteini entry :
[ Debug ] urldate format type: datestring, value: |yyyyMMddHHmm
[ Debug ] https://services.sg1.etvp01.sctv.ch/catalog/tv/channels/list/ids=|channel|;level=enorm;start=|urldate
[ Debug ] url_index created:
[ Debug ] https://services.sg1.etvp01.sctv.ch/catalog/tv/channels/list/ids=25;leve...
[Error ] error downloading page: Index was outside the bounds of the array. (10sec)
[Error ] retry 1 of 4 times
[Error ] error downloading page: Index was outside the bounds of the array. (20sec)
[Error ] retry 2 of 4 times
[Error ] error downloading page: Index was outside the bounds of the array. (30sec)
[Error ] retry 3 of 4 times
[Error ] error downloading page: Index was outside the bounds of the array. (40sec)
[Error ] retry 4 of 4 times
[Error ] Unable to update channel ard
[Critical] Generic syntax exception:
[Critical] message:
[Error ] no index page data received from ard
[Error ] unable to update channel, try again later
[ Info ] Existing guide data restored!
[ Debug ]
[ Debug ] 0 shows in 1 channels
[ Debug ] 0 updated shows
[ Debug ] 0 new shows added
[ Info ]
[ Info ]
[ ] Job finished at 29/03/2019 06:11:31 done in 0s
i think i have an problem to find the right index headers ?
after a littlebit testing with headers :
url_index.headers {host=services.sg1.etvp01.sctv.ch}
url_index.headers {accept=text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8}
url_index.headers {customheader=Accept-Encoding=gzip,deflate,br}
url_index.headers {customheader=Accept-Language=de,en-US;q=0.7,en;q=0.3}
url_index.headers {customheader=Upgrade-Insecure-Requests=1}
log now says :
[ Debug ] url_index created:
[ Debug ] https://services.sg1.etvp01.sctv.ch/catalog/tv/channels/list/ids=25;leve...
[Warning ] error downloading page: Error: SecureChannelFailure (The authentication or decryption has failed.)
you need to read the manual. There are a lot of things wrong. So first step use the pure url https://services.sg1.etvp01.sctv.ch/catalog/tv/channels/list/(end=201903300500;ids=401;level=normal;sa=true;start=201903290500)
this way you will understand if you are grabbing the page or something else need to be done.
thx for your response, but i get always "SecureChannelFailure"
way over php works
"SecureChannelFailure"
1.if ur on window u need min wg V2.1.5 and or netframework updated to latest for ur windows version.it cud also be tls.1.2 isnt enabled(win 7 for example) so u may have to search on how to check this also.
2.if linux u need mono > 5.0.0
PHP.... what is that the current webgrab fashion? on 100 sites only 1 need php.
tvair.swisscom.ch is one of the 99.
So remove those crap url index headers and start checking what is downloaded (first check what BB199 said)
here a start
url_index{url|https://services.sg1.etvp01.sctv.ch/catalog/tv/channels/list/(end=201903300500;ids=401;level=normal;sa=true;start=201903290500)}
url_index.headers {customheader=Accept-Encoding=gzip,deflate}
*
urldate.format {datestring|yyyyMMdd}
thx again for replay, but still "secure channel"
im on ubuntu 18.04, Mono= (4.6.2.7+dfsg-1ubuntu1)
maybe thats the problem bb said..
i try tu upgrade my mono version
**edit
yayyyy thx BB
with mono JIT compiler version 5.18.1.0 (tarball Fri Mar 15 20:41:32 UTC 2019)
nomore "secure channel" problem.
Good, now next step ;)
ok, the secure channel problem is gone away with mono >5
but im not able to recice the html.source.htm for "debugging"
maybe i should finish the php version, and you profis fix it up for "normal" usage :)
ok, i stuck on the next step, trying to scrape the start and stop time.
i already seperated my blocks, from progamm ---> programm (see log (attached))
iḿ wrong with this "rule" to scrape the starttime ?
[{"AvailabilityStart":"2019-03-29T01:08:00Z","AvailabilityEnd":"2019-03-29T01:10:00Z"}]
index_start.scrub {single(debug)(pattern="yyyy-MM-ddHH:mm:ss")|"AvailabilityStart":"||"|"}
index_start.modify {remove|T}
index_start.modify {remove|Z}
because nothing is showing up in the log for "debugging"
What do you mean php version? There is no php involved here.After the lines above if you want to check if the page is receieved, add to those 3 line: index_showsplit.scrub{multi(debug)||||} this command should show the page downloaded in webgrablog.txt and will also download the html page. Forget php...there is nothing in this site that involves php.
You have wrong showsplit....this means your idea is confused. Read page 60-61....then you proceed with time and other stuff.
i know, but the site doesnt download...?!
maybe one more bug in my webgrab version....
with my php helperfile:
<?php
$agent = 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:65.0) Gecko/20100101 Firefox/65.0';
$dir_path = dirname(__FILE__);
$type = $_GET['type'];
$start = $_GET['date'];
$time = $_GET['time'];
$stop = date("Ymd", strtotime("$start +$time days"));
if($type == '1') {
$channel = $_GET['channel'];
$url2 = 'https://services.sg1.etvp01.sctv.ch/catalog/tv/channels/list/(end=' . $stop . '2359;ids=' . $channel . ';level=normal;start=' . $start . '0000)';
$ch = curl_init ($url2);
curl_setopt ($ch, CURLOPT_HTTPHEADER, array('text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8'));
curl_setopt ($ch, CURLOPT_HTTPHEADER, array('application/json; charset=utf-8'));
curl_setopt ($ch, CURLOPT_USERAGENT, $agent);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt ($ch, CURLOPT_COOKIEFILE, $dir_path . '/swisscom.ch.cookies.txt');
$output2 = curl_exec($ch);
curl_close($ch);
echo $output2;
} elseif($type == '2') {
$url3 = 'https://services.sg2.etvp01.sctv.ch/portfolio/tv/channels';
$ch = curl_init ($url3);
curl_setopt ($ch, CURLOPT_HTTPHEADER, array('Accept: application/json, text/javascript, */*; q=0.01'));
curl_setopt ($ch, CURLOPT_HTTPHEADER, array('Content-Type: application/x-www-form-urlencoded; charset=UTF-8'));
curl_setopt ($ch, CURLOPT_USERAGENT, $agent);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt ($ch, CURLOPT_COOKIEFILE, $dir_path . '/swisscom.ch.cookies.txt');
$output3 = curl_exec($ch);
curl_close($ch);
echo $output3;
}
?>
and this in my ini :
url_index{url|http://127.0.0.1:92/wordpress/swisscom_ch.php?channel=|channel|&date=|urldate|&time=##time##&type=1}
url_index.headers {customheader=Accept-Encoding=gzip}
every think is fine ...
All stuff that u don't need. Webgrab download the page, you don't need any of that. I already told you 3 times. NO PHP is needed.
maybe its an bug in my mono version.... maybe i finish up the php version, and if u want, u can modify it for "normal usage" and we test again :)
anyway,
can u help me find the correct way to scrub the starttime ?
Above ini to start with...don't make things complex. Forget php, webgrab does it for you.
omg, iḿ an idiod, yure absolutly right.
it works without php file, just an typingerror in my ini :D
ok, back to the roots, still the problem to scrape the starttime :)
Like i said above try with one channel line (url) so you understand. Actually you going too fast. You should start getting channel list, see what is needed in URL (channel_id, start_time and end_time or whatever will be needed), make up your channel list and after that work on url.
aaaaaah
because of this "inblock"
[{"AvailabilityStart":"2019-03-29T01:08:00Z","AvailabilityEnd":"2019-03-29T01:10:00Z"}]
the (includeblock="AvailabilityStart") needs to be defined right ?
thank you very much matt, today, i learned alot :)
sometimes you are pretty puffed up, does that have to be?
I dont know what u mean.
There ist no copy paste, and start stop is defined and used.so, its absolutly wrong what u say. And a know what i do.
Defined :
$start = $_GET['date'];
$time = $_GET['time'];
$stop = date("Ymd", strtotime("$start +$time days"));
And here used :
$url2 = 'https://services.sg1.etvp01.sctv.ch/catalog/tv/channels/list/(end=' . $stop . '2359;ids=' . $channel . ';level=normal;start=' . $start . '0000)';
Also.. i dont understand everything in webgrab+. But im ready ro learn.
For sure is surprising how u can set up something like that without knowledge of basic stuff .....bit strange.
what does curl / php have to do with webgrab?
i'm new to webgrab, but have experience in other areas.
and when i see a command in an ini i understand it after a little try and error too.
but Matt is right, I should read the "instructions" first completely and experiment.
Nevertheless, I have managed to create a working ini for swisscom, of course, with the help of matt, it is not that difficult if someone is stuck behind it.
Hi Guys,
the ini for swisscom is almost done.
I have a problem with the "channel-creation"
the problem is that the channelshapes block also contains "" Identifier ":".
and after hours of tinkering I have not found a usable solution.
can u please look over it?
this is my first ini, certainly much can be solved "more professional". maybe you take positions to the individual points, and give me tips.
in advance, thank you very much
thats where time and experience come into play,there are a few ways u can do it once u get used to using the tools webgrab provides.
index_site_id.scrub {multi(exclude="-")|{"Identifier":"||",|",}
or
index_site_id.scrub {regex||\{"Identifier":"(\d+)",||}
so the first using separator string method only keeps the identifiers that dont have a "-" in them,this should only keep the ones that are really channel ids.
the second does the same thing using regex by says the value between the 2 "xxx" must be all numbers so any indetifier with a letter,- or anything thats not a number will be excluded.
thx blackbear, the regex solution work :)
hi i'm interested in the ini file. would you like to share it once completed?
shure, feel free to use it.
almost bugfree :)
tanks :-)