Sie sind auf Seite 1von 5

info@agiratech.

com +1 888 50 AGIRA (+1 888-502-4472)

/ Basic web scraping using Goutte and Symfony


Home / BlogDomCrawler

Corporate Blog

07 Basic web scraping using Goutte and Symfony


03 DomCrawler
Vignesh Thandapani Standard 0 comment

Share Tweet Share Share Share Mail 5 Total


Shares

Here, I am going to explain how to perform basic web scraping using Goutte and Symfony DomCrawler, and how to get
machine-readable information from Web pages by way of Web scraping. Currently, most of the API documentation
process is not written by hand, and such documentations are generated by tools meant for this purpose. There are several
tools available in the market for API document generation such as PHPDocumentor or Sami (these are more popular and
reliable).

Now, interestingly, we will reverse this process of creating documentation from code, and thereby generate code from
documents!

Required Installation
Before going to use DomCrawler, obviously, you need to install it: https://github.com/FriendsOfPHP/Goutte

composer require fabpot/goutte

Only after successful installation can we be able to use the Symfony DomCrawler, since Symfony DomCrawler uses the
service of Goutte.

Now, start a simple DomCrawler to find the available links from the web page.

Add the below lines above the class name of the file src/AppBundle/Controller/DefaultController.php
use Goutte\Client;
use Symfony\Component\DomCrawler\Crawler;

Add the below lines in the bottom of all the methods of the file src/AppBundle/Controller/DefaultController.php

/**
* @Route("/links", name="crawler")
*/
public function crawlerAction()
{
$url = "http://www.agiratech.com";
$client = new Client();
$crawler = $client->request('GET', $url);
$links_count = $crawler->filter('a')->count();
$all_links = [];
if($links_count > 0){
$links = $crawler->filter('a')->links();
foreach ($links as $link) {
$all_links[] = $link->getURI();
}
$all_links = array_unique($all_links);
echo "All Avialble Links From this page $url Page<pre>"; print_r($all_links);echo "</pre>";
} else {
echo "No Links Found";
}
die;
}

Here, I have created the new router http://localhost/links for my application (http://localhost is my local domain name)
and created one object for Client class and named it as $client. Using this object I will call a request method to gather
information in that page like the following line

$crawler = $client->request('GET', $url).

From the line $crawler->filter(a)->count() we can find HTML <a> tag count in the particular page
(http://www.agiratech.com).

Therefore, similarly, from this line $crawler->filter(a)->links() we can get the all the links form the particular page.

Similarly, again, from the line $link->getURI() we can get each of the links of the particular page.

Conclusion
The above example shows how to extract all the links from the HTML document and save them in an array as $all_links.
Likewise, we can extract several data from the particular web page.

In fact, many more powerful activities can be performed and code be extracted. For instance, in the above example, we
can even travel into all the pages from the links present, and find many more information as required. I will handle more
such extraction performances with different examples in future blogs. Try it out for yourself

Posted by Vignesh Thandapani

0 Comment

Leave a comment

Your email address will not be published.

Name

Email

Website

Your Comment...

Send Comment

Search ...

Recent Posts
Basic web scraping using Goutte and Symfony DomCrawler

Basic Implementation of Angular2 using Angular CLI

Importance of Manual Testing for Start-ups

Rails Refactoring Techniques Concerns

Guide to use the Node Package Manager (npm)

Archives
March 2017

February 2017

January 2017
December 2016

November 2016

October 2016

September 2016

August 2016

July 2016

June 2016

May 2016

April 2016

March 2016

February 2016

January 2016

Categories
Amazon Web Services (AWS)

AngularJs

API

Big Data

Code Study

DevOps

Docker

GitHub

Go

Golang

Javascript FrameWorks

Laravel

Management

Metrics

mobile application development

Non-Technical

PostgreSql

ReactJS

Ruby

Ruby on Rails

SocialMedia

Standard

Technical

Tips & Tricks

Unix

Web Development

WordPress

Contact
Us
Email : info@agiratech.com
INDIA : +91 44 4357 4451
USA: +1 888 50 AGIRA (+1 888-502-4472)

Terms of Service
Privacy policy

copyright 2015 - 2016 Agira Technologies

CHENNAI
INDIA
Agira Technologies Pvt Ltd,
#42/32, 4th Floor, Gee Gee Complex,
42, Anna Salai,
Chennai - 600 002, India.

Social
Media

Das könnte Ihnen auch gefallen