26 aprilie 2014

Crawling a site with Goutte (php)

Crawling with redirects: NCBI website for aligning protein - Aceasta este o alternativa la instalarea simandicosului Blast si a seturilor de date care ocupa foarte mult spatiu, prin punerea in "carca" a tuturor operatiilor pe seama interfetei de la NCBI pentru alinierea secventelor de proteine.

Crawling-ul are particularitatea de a urmari redirect-urile, putin deranjante la rularea manuala pe site, dar putin mai explicite odata ce Javascript este dezactivat din browser. Crawlerul, by default, are javascript dezactivat, deci el primeste rezultatele corect indiferent de redirectari. Input-ul ales este o secventa de aminoacizi numita hemoglobina, iar rezultatele reprezinta o lista de posibile alinieri.

--------------------------------------------------------------------
require_once 'goutte.phar';
use Goutte\Client;

$aaTest = "MHSSIVLATVLFVAIASASKTRELCMKSLEHAKVGTSKEAKQDGIDLYKHMFEHYPAMKKYFKHRENYTP
ADVQKDPFFIKQGQNILLACHVLCATYDDRETFDAYVGELMARHERDHVKVPNDVWNHFWEHFIEFLGSK
TTLDEPTKHAWQEIGKEFSHEISHHGRHSVRDHCMNSLEYIAIGDKEHQKQNGIDLYKHMFEHYPHMRKA
FKGRENFTKEDVQKDAFFVNKDTRFCWPFVCCDSSYDDEPTFDYFVDALMDRHIKDDIHLPQEQWHEFWK
LFAEYLNEKSHQHLTEAEKHAWSTIGEDFAHEADKHAKAEKDHHEGEHKEEHH";
$link = 'http://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastp&BLAST_PROGRAMS=blastp&PAGE_TYPE=BlastSearch';

//create a new client instance
$client = new Client();

// connect to main page
$crawler = $client->request('GET', $link);
$form = $crawler->selectButton('b1')->form();
$crawler = $client->submit($form, array('QUERY' => $aaTest));

// arrive on second page, has the requestId too
$secondForm = $crawler->filter('form[name=RequestFormat]')->form();
$rid = $secondForm['RID'];
$crawler = $client->submit($secondForm);

//echo "The RID is " . $rid->getValue();

// arrive to the 3rd page on, post on Blast.cgi script
$redirects = 0;
while ($redirects < 50) { // set a max number of acceptable redirects
$nextForm = $crawler->filter('form[id=results]')->form();
sleep(3);
$crawler = $client->submit($nextForm);

// stop when there is a <div id=descrInfo/> in the page meaning final results
$cnt = $crawler->filter('div[id=descrInfo]')->count();
if ($cnt > 0) {
break;
}
$redirects += 1;
}

echo "Redirects ---->>> " . $redirects;
echo $crawler->html(); // pagina finala

Niciun comentariu: