Ipv 6 Crawler

IPv6 Crawler Project Report: Part 1
IPv6 Matrix / August 2010

Author: Dr. Olivier MJ Crpin-Leblond ocl@gih.com Date: 2010_08_27
IPV6 CRAWLER PROJECT REPORT: PART 1 1 2 3 ABSTRACT INTRODUCTION TECHNICAL DETAILS HARDWARE NETWORK CRAWLER (BACK END) FUNCTIONALITY TESTS AND MODULAR APPROACH VIRTUAL SWITCHES RESULTS INPUTS, OUTPUTS, ERRORS, PERFORMANCE AND LIMITATIONS WEB SERVER (FRONT END) FUNCTIONALITY SYSTEM SOFTWARE WEB PAGE STRUCTURE SOME DATA LOOK-UP EXAMPLES MACHINE SEARCHABLE OUTPUT PROGRAMMING TEAM FEEDBACK (EGYPT) GENERAL FEEDBACK EXPERIENCE GAINED MISTAKES MADE FUTURE WORK 1 2 2 2 3 5 6 6 8 9 12 13 13 14 15 16 17 17 17 17 17 18
3.1 3.2 3.3 3.3.1 3.3.2 3.3.3 3.3.4 3.4 3.4.1 3.4.2 3.4.3 3.4.4 3.5 3.5.1 3.5.2 3.5.3 3.5.4 4
PARTNERS
1 Abstract
Following an application to the ISOC Community Projects funding, a grant was allocated to purchase equipment and write software track Internet IPv6 Connectivity worldwide. The detailed project application is available elsewhere and will not be repeated here since it would distract from the Teams achievements. The project team has designed and implemented an IPv6 Crawler, a computer and its software that runs through the DNS at preset intervals in order to detect, for example, IPv6 DNS servers and IPv6 compliant Web servers, SMTP mailers, and NTP servers. This Project aims to catalyze the rate of IPv6 adoption by creating and making available a set of tables and graphs showing the spread of adoption, per domain name. Results can be provided on automatically generated Web pages and displayed, for example, per country code top level domain (ccTLD), per gTLD, per type of organization, per business field and the classification of results is user configurable. By archiving the data, the service will be useful for future historical purposes as it will track how a radically new technology spreads on the Internet and this information might be useful to future strategic network planners working on the Future of the Internet.
2 Introduction
The Internet is running out of IPv4 addresses, and whilst many individuals and organizations currently track both the running out of IPv4 addresses, and the spreading and use of IPv6 addresses, no effort currently follows a structured method for tracking which could be expanded for future use. This project aims to introduce such a system. It is based around a set of computers which are dedicated to the task of tracking IPv6 connectivity and archiving results for immediate or future use. Its design was undertaken with a view to use well known, ubiquitous and sustainable formats in order to keep the resulting data useful for future generations.
3 Technical Details
The equipment installed in this project is made up of two servers: a back end Crawler, and a front end Web server. Although both are connected to the same part of the backbone and through the same router, the two servers function entirely independently of each other. The Crawler works through connectivity tests and generates huge quantities of data which are stored as text-based data files. The Web server integrates this data into an SQL database which can then be interrogated by Web pages to make the results available worldwide. All data generated is archived for historical purposes. The most exciting part of this project is that the data generated by the back end can be analysed completely independently of the crawls taking place an analogy would be a vacuum cleaner used to suck anything in its path, with the ability to independently design a system to open the dust bag to explore what is inside it at a later stage. As such, the front end Web server is currently a conceptualised design showing some of the capabilities for analysis which can be performed on the data. 2
It is worth noting that the multi-processor, multi-core hardware used in this project is exclusively functioning for the purpose of the primary purpose of IPv6 crawling. At present, the limit on crawling speed is imposed by the bandwidth used in the process of connectivity testing: since a large proportion of traffic generated is UDP traffic, it was decided to throttle the number of parallel processes so as not to cause ripples with our upstream providers.
3.1 Hardware
The hardware used for the project was purchased specifically for this task. There is a front end Web server and a back end Crawler. The two computers are independent, with the back end only performing crawling, and the front end performing web server, archiving, and email functions. The front end synchronises its data and downloads it from the back end at regular intervals. It also acts as a storage database.
Figure 1: The Crawler (left picture), and the Web server (right picture), Router (below)
Specifications for the Servers are shown below: Parameter Computer Systems Crawler (back end) Model Name (eth0) Specification
IPv4 address (eth0) / speed IPv6 address (eth0) / speed Name (eth1) IPv4 address (eth1) / speed CPU RAM HD Storage PSU Operating System Web Server (front end) Model Name (eth0)
HP DL360p turtle.ipv6matrix.org ; crawler.ipv6matrix.org turtle.ipv6matrix.com ; crawler.ipv6matrix.com turtle.ipv6matrix.net ; crawler.ipv6matrix.net 212.124.204.162 / 100 Mb/s 2a00:19e8:20:1::a2 / 100 Mb/s shell.ipv6matrix.org 194.33.63.250 / 1 Gb/s (GIH private address space) 2 x Dual Core Intel(R) Xeon(TM) CPU 3.60GHz 4 Gb DDR2 SDRAM 146 Gb hardware SATA 2-disk RAID (hot swappable) 2 x hot-swappable redundant 535W. CENTOS 5 Linux / updated
IPv4 address (eth0) / speed IPv6 address (eth0) / speed Name (eth1) IPv4 address (eth1) / speed CPU RAM HD Storage PSU Operating System Telecommunications Router Model Operating System DRAM Ethernet Ports / speed Interface card / speed
HP DL140 elephant.ipv6matrix.org ; www.ipv6matrix.org elephant.ipv6matrix.com ; www.ipv6matrix.com elephant.ipv6matrix.net ; www.ipv6matrix.net 212.124.204.170 / 100 Mb/s 2a00:19e8:20:1::aa / 100 Mb/s tusk.ipv6matrix.org 194.33.63.251 / 1 Gb/s (GIH private address space) 2 x Dual Core Intel(R) Xeon(TM) CPU 3.40GHz 4 Gb DDR2 SDRAM 2 x 1 Tb fast SATA Single 500W Ubuntu 4.4 Linux / updated
CISCO 2811 Advanced IP Services IOS 64 Mb 2 / 100 Mb/s MN-16ESW 16 port / 100 Mb/s
3.2 Network
Figure 2: Network set-up The transfer of data between the Crawler and the Web server takes place using a pair of private (non-routable) IPv4 addresses and a cross-over 1Gb/s CAT5e link. This allows for fast synchronisation of data between the servers. A specific set of IPv4 addresses (non routable) are used for the 1Gb link cross-over Ethernet cable: shell.ipv6matrix.org. 0 tusk.ipv6matrix.org. 0 IN IN A A 194.33.63.250 (on turtle.ipv6matrix.org Crawler) 194.33.63.251 (on elephant.ipv6matrix.org WWW)
Further peering agreements (peer 2 and peer 3) are in progress and being undertaken by 2020Media Ltd.
3.3 Crawler (Back End) Functionality

The structure of the crawler was designed so that its functionalities are divided into several groups. This makes the software suite entirely modular in structure.
3.3.1 Tests and modular approach

The core features of the Testing Procedure are tagged as being part of Procedure Test 1, hereby referred to as T1, in the first group of tests, with the rest of the features and functionalities being part of the second group of tests. This is best shown in a table as follows:
Procedure Type
Feature Category
Feature
T1 T1 T1 T1 T1
DNS DNS ASN DNS SOA
T1 T1 T1 T1 T1 T1 T1 T1 T1 T1 DNS DNS DNS MX
Address Type Trace/ping Trace/ping Trace/ping DNS Trace/ping Trace/ping Trace/ping DNS Geo DNS DNS DNS DNS
Find IPv4 address(es) from DNS Find IPv6 address(es) from DNS Find ASN lookup using internal database (requires regular updating) Find primary, secondary etc. DNS server & whether they are dual stack (IPv6 glue) Check SOA record for DNS server & test contact For each address, determine type of address from prefix: 6to4 prefix = 2002::/16; Teredo prefix = 2001::/32; 6bone etc. using manually compiled prefix database Ping & Traceroute IPv4 address(es) Ping & Traceroute IPv6 address(es) Calculate difference in latency between IPv4/IPv6, including mean and variation Identify broken AAAA records from above DNS results Identify AAAA records with no actual connectivity (through ping & traceroute) Identify MTU differences between IPv4 and IPv6 by using several packet sizes in probing Record hop count from traceroutes and compare IPv4IPv6 hop count Detect if proper Reverse DNS is defined (matching of forward/reverse) Use Geo-localisation to match geographical coordinates of the node. Use local Geo-localisation database (can be updated) Check for DNS Server Obtain Name servers from DNS records about domain itself Test Name servers according to testing procedure T1 Check MX records
IPv6 Crawler Project Report: Part 1 MX MX MX MX MX MX MX MX MX MX MX WWW DNS T1 T1 T1 T1 Connect Connect Connect Automatic Connect Connect Connect
WWW WWW WWW NTP NTP NTP
T1 T1 Connect Connect
T1
Obtain MX details from DNS records about domain itself Test MX servers according to testing procedure T1 Follow procedure for: _ Check Primary MX _ Secondary MX (if any) _ Check further MX (if any) For each MX record, connect to the remote machine using SMTP port: _ Detect remote mailer type (if possible) _ Detect if connected by IPv6 or IPv4 _ Detect if IPv6 record present but unreachable, thus fallback to IPv4 _ Detect if TLS implemented at remote machine _ Test mailer/version Check Web server Establish which domain works: Check domain prefixes www, ipv6, www.ipv6, www6, six, or indeed any other prefix (can be configured using config. file) Test Web addresses according to testing procedure T1 Test ports 80 (http) and 443 (https) Check NTP server Use address prefix time ie. time.example.com or ntp Test DNS and connectivity of NTP servers according to testing procedure T1 Use NTPDATE to check for NTP server using d and either -4 (IPv4) or -6 (IPv6) option to detect. Separately record -4 and -6 for future use (finding out when IPv4 NTP servers start decreasing) Keep option open for testing any other kind of type of server (modular approach)
NTP Other
Connect
Using an example domain example.com, the tests of procedure T1 can be performed on the following extensions: www.example.com and its variants (www6, www.ipv6, etc.), example.com SMTP mail exchangers (MX), ntp.example.com, DNS servers for example.com, hence its grouping as a common module. The software suite is built in Python 2.6.4. As a result, it is possible to enable/disable each test types for successive runs. This is seen next.
3.3.2 Virtual Switches

The selection of tests is as simple as toggling virtual switches in the configuration file named config.props, a sample of which is shown as follows: ########################### # Features: if a feature is enabled this means that for each input *domain*, we will gather a bunch of hosts to analyze. # e.g. for NS, we get NS hosts, for MX we get MX hosts, for WWW and NTP we generate hosts by combining with prefixes ########################### [features] NS = on MX = on WWW = on NTP = on ########################### #Metric: a metric defines a crawling procedure that would be done on a single *host*, each metric is applied on all #hosts of a certain feature ########################### [metrics] soa = on reverse = on geoip = on ping = on tcp80 = on tcp443 = on tcp25 = on tls = on path = on ip6Type = on To toggle a switch off, replace on with off. This is a particularly useful feature because time-consuming tests (trace routing which includes long timeouts, for example) could be performed at a lesser frequency than less timeconsuming tests. Furthermore, it will be possible to add new tests to the crawler as and when they are developed.
3.3.3 Results
Test results are recorded in files which are saved in a time-coded set of sub-directories. An example of the directory structure is as follows: crawls |-- 2010-07-18__12-24-48_summary.db |-- crawl_2010-06-16__11-40-49.log |-- crawl_2010-07-18__12-24-48.log |-- net | |-- 2010-06-16__11-40-49 | | |-- NS_net.csv | | |-- WWW_net.csv | | |-- NTP_net.csv | | |-- MX_net.csv | | |-- geoip_NS_net.csv | | |-- .... | | |-- .... | | `-- net.db | `-- 2010-07-18__12-24-48 | |-- NS_net.csv | |-- WWW_net.csv | |-- NTP_net.csv | |-- MX_net.csv | |-- geoip_NS_net.csv | |-- .... | |-- .... | `-- net.db |-- .... |-- .... |-- .... `-- com `-- 2010-07-18__12-24-48 |-- NS_com.csv |-- .... |-- .... `-- com.db
The .LOG files in the top directory (crawl) provide a record of the crawlers status. The filenames format is crawl_yyyy-mm-dd__hh-mm-ss.log giving the time stamp of the runs start, where yyyy=year; mm=month in number format; dd=day; hh=hour in 24h format; mm=minute; ss=second. The subdirectories of the crawl files can be: A top level domain, when generated from an input file listing domains all under one top level domain, and whose name is derived from the input example.csv file where example is the Top Level Domain in question. This is the preferred way of running the Crawler; or
An arbitrary name derived from the input example.csv file where example could be any word describing the criterion for example, britishuniversities, governmentsites, etc. This is seldom used because the same results could be achieved by interrogating the SQL database generated later on. Nonetheless, it is possible to run the Crawler in this way.
Each sub-directory then contains another set of sub-directories which include a time stamp in their name, whose format is: yyyy-mm-dd__hh-mm-ss, where yyyy=year; mm=month in number format; dd=day; hh=hour in 24h format; mm=minute; ss=second. These sub-directories then contain the test results in multiple comma separated value (.CSV) text files. This format is expected to be supported on digital media for the foreseeable future. The .DB files are generated later during analysis stage, from the .CSV files, in order to create a mySQL database which can be interrogated remotely and/or by a machine. Both the .CSV source files and the .DB mySQL files are archived in their entirety thanks to the fact that the directory structure is time-stamped, and therefore not over-written at subsequent runs. The following Comma Separated Value (CSV) text listings, are created for each TLD. For example, for the .UK TLD: File Names Main Data Type E-mail Exchanger Name Server WWW Network Time Protocol Start of Authority (NameSer vers) Geoip (where is the server?) Reverse IP Content Format
MX_uk NS_uk WWW_uk NTP_uk
[type, domain, host, ipv4, ipv6, rank] [type, domain, host, ipv4, ipv6, rank] [type, domain, host, ipv4, ipv6] [type, domain, host, ipv4, ipv6]
soa_NS_uk
[type, domain, soa, primary_by_rank, primary_inhouse, secondary, total, contact, serial, refresh, retry, expire, minimum] [type, domain, host, ipv4, ipv6, asn, city, region_name, country_code, longitude, latitude]
geoip_MX_uk geoip_NS_uk geoip_NTP_uk geoip_WWW_uk reverse_MX_uk reverse_NS_uk reverse_NTP_uk reverse_WWW_uk ping_MX_uk ping_NS_uk ping_NTP_uk ping_WWW_uk
[type, domain, host, ipv4, ipv6, name4, name6 ]
Ping
[type, domain, host, ipv4, ipv6, count, min, avg, max, std, min6, avg6, max6, std6]
10
IPv6 Crawler Project Report: Part 1 tcp25_MX_uk tcp80_WWW_uk tcp443_WWW_uk Tcp on 25 (SMTP) Tcp on 80 (HTTP) Tcp on 443 (HTTPS) Transport Layer Security Tracing the path
IPv6 Matrix / August 2010 [type, domain, host, port, ipv4, ipv6, tcp, tcp6] [type, domain, host, port, ipv4, ipv6, tcp, tcp6] [type, domain, host, port, ipv4, ipv6, tcp, tcp6]
tls_MX_uk
[type, domain, host, ipv4, reachable, tls]
path_MX_uk path_NS_uk path_NTP_uk path_WWW_uk ip6Type_MX_uk ip6Type_NS_uk ip6Type_NTP_uk ip6Type_WWW_uk domainPenetration_uk
[type, domain, host, ipv4, ipv6, mtu4, hops4, back4, path4, mtu6, hops6, back6, path6]
IPv6 Type
[type, domain, host, ipv6, valid, prefixid, ipv6type]
Domain IPv6 penetratn.
[domain, ns, mx, www, ntp]
In addition to the above tables, a first analytical step produces summaries generated for all crawled tlds. These are saved as follows: - geoip_summary - domainPenetration_summary - IpDuality_summary - ping_summary - path_summary [tld, type, country, hosts, ipv6hosts] [tld, total_num_domains, ipv6_enabled_domains_count] [tld, type, domains, hosts, ipv4, ipv6, ipv4_6, no_ip] [tld, type, hosts, faster, delay6, delay4] [tld, type, hosts, lesshops, hops6, hops4]
These are used as a basis for Web server to draw the Maps & display statistics. Once transferred over, these can then be processed by the Web server and formatted into SQL databases. This will be described in further detail in Section 3.4.
11
3.3.4 Inputs, Outputs, Errors, Performance and Limitations

The list of domain names to be tested is downloaded from Alexias 1 million most popular Web sites worldwide. This list was cleaned up from erroneous domains and split into text files on a per top level domain basis.
3.3.4.1 Small number of domains under a Top Level Domain

Some Top Level Domains only contain a small number of domains, largely caused by the fact that the weight of their Web server in the context of the Global Internet is negligible. This leads to erroneous results, usually abnormally high IPv6 penetration rates because a handful of IPv6 compatible domains will have a significant impact on the percentage of IPv6 connectivity. It is hoped that the total number of domains tested in these Top Level Domains can be increased by collaborating with local organisations or the local Regional Internet Registry. In general, results are felt to be gaining in accuracy when the total number of domains tested under a Top Level Domain approach or exceed 1000 domains. That said it is important to note that the domains tested are those of the most significant Web sites in a country, and that those web sites would generally be the cause of the majority of traffic in that country. For example, google.com generates much more traffic than unknowndomain.com. Whilst it would be important that the domain google.com be tested for IPv6 connectivity, unknowndomain.com would be missing from the crawlers input testing domain list, but the impact of IPv6 traffic by unknowndomain.com would be negligible. In testing for IPv6 connectivity, and with no agreed standards or methodology, we have found that the results we have gathered are somehow subjective and that more accurate trends are more likely to appear as more results are gathered over time.
3.3.4.2 Performing suspicious activity

At present, a full run of the Crawler for approximately 985 000 domains worldwide, takes over a month to complete, with the largest chunk of domains being in the .COM Top Level domain (approx 550 000 entries). This causes 5.6 million hosts to be tested with each of the tests described above. It is possible to speed the run by allowing for more simultaneous parallel threads in the software. This is user-configurable. It is worth noting that the speed ceiling on crawling and connectivity testing is caused by either a limitation of connectivity to the outside world, or a hardware limit on the number of parallel processes it can run, usually caused by a memory limit. In our particular case, the limit is network-based: the Crawler is capable of at least four times as many simultaneous crawling processes as it is currently set-up to run (currently 100 concurrent processes). Although the amount of traffic generated by each test is minimal, it is atypical since it consists mostly of UDP traffic. Running 400 concurrent processes would however have a detrimental loading effect on upstream Internet Service Providers because activity might be detected as causing denial of service attacks. This would risk causing: a blacklisting of our IP addresses, upstream traffic shaping of our trace routings connection refusals from SMTP mailers filtering of traffic by firewalls
12
In each of those cases, the results recorded by the Crawler would be affected negatively, with a high probability of erroneous results being generated. With the variety of firewall products and the high amount of malicious activity requiring tightening of firewall rules on the Internet, it is impossible to avoid false positives, except by behaving sensibly. A proposed solution for scanning a larger database of sites could be to mirror this set-up in various points around the Internet, each scanning the domain database out of synchronization with other scanners. This suggestion is described later in the Further Work section of this report.
3.4 Web Server (Front End) Functionality

The Front-End Web server is its present version provides an example of the type of analysis which could be performed on the large quantities of data archived.
3.4.1 System Software

Parameter Web Server Database Server Database Server Port Content Management System PHP Database Database extension DNS Charting GeoIP Table Handling / Display Graph / Maps Feature Apache 2.2.14 (Ubuntu) CherryPy 3.1.2 4444 Drupal 6.17 5.3.2-1ubuntu4.2 MySQL 5.1.41 pySQLite 2.6.0 Python 2.6.4 DNS module Python Google Chart module Python GeoIP module JQuery Grid JqGrid 3.5.3 Google API (Application Programming Interface)
It is worth noting that at present, part of the information imbedded in a Web page is displayed directly from the CherryPy system from port 4444; this is due to instability of the system (memory leak) probably caused by immaturity of the technologies used here. This is likely to be resolved as developers release updates.
13
3.4.2 Web Page Structure

The Web site structure is shown here.
Figure 3: Web Page Structure The detailed results tables are generated from the mySQL database back-end and can therefore be interrogated using filters and search functions. It is important to note that the Web site is a prototype of the type of analytical results which could be displayed online. The Crawlers database contains such a wealth of information that the possibilities for plotting data, chronological analysis, trends, maps, etc. far exceed the concept presented in this report.
14
3.4.3 Some Data Look-up examples

In order to guide a potential analyst through the Web site, we hereby present a few suggestions of the type of search which can be performed. These descriptions are by no means exhaustive. Indeed, a very large number of results are available and can be analysed when using filters in the Latest Data and Data Archives sections, but they are too numerous to be described in detail here.
3.4.3.1 Display of Map of IPv6 connectivity per host

Select IPv6 Host Penetration. Results are displayed as a shaded world map, followed by a table of IPv6 host connectivity. The results shown on this page make use of the GeoIP positioning of the actual hosts connected to the Internet. Indeed, many hosts under a given Country Code Top Level Domain are often not located in the country of that Country Code. Furthermore, a large proportion of the worlds most popular Web sites being hosted under Generic Top Level domains, it becomes important to use GeoIP positioning to track connectivity for those. It is possible to filter results. The international nature of the Internet is such that NameServers for a domain are distributed worldwide and may abnormally raise a domains IPv6 connectivity status, because the NameServer function is often outsourced to ISPs who have started their network upgrading by upgrading their NameServers. The use of an external IPv6 reachable NameServer does not necessarily mean that a domain has IPv6 connectivity. Using the Filter by checking the WWW box only provides a better idea of Web Site IPv6 hosting status worldwide. Ditto for Mail eXchange, which usually is a service assumed in-house. The Filter also allows for a single top level domain, or a set of top level domains to be analysed. For example, it is interesting to see the host geographic location of Web servers for a particular Top Level Domain.
3.4.3.2 Display a Map of IPv6 connectivity per top level domain

Select IPv6 Domain Penetration. This shows more classical results of IPv6 connectivity per top level domain, irrespective of whether the hosts are located in the country or not. Generic Top Level Domains are therefore not taken into account in this case, although they appear in the table below the map.
3.4.3.3 What is the percentage of dual IPv4-IPv6 stack in the world?

This can be found by selecting the Got a dual IP? tab. A filter allows for selection of Top Level Domain, as well as the type of service tested bearing in mind the discussion in Section 3.2.2.1 relating to some external NameServers being dual stack, not making a domain truly dual stack.
3.4.3.4 Find a listing of IPv6 dual stack Web Sites in the UK

Select Latest Data. Search Filter TLD equal UK. Click on WWW_uk. Search ipv6 begins with 2 and ipv4 not equal n/a (this makes sure we are looking at true dual stack nodes).
3.4.3.5 Find a listing of IPv6 only Web sites in the UK

As above, but select ipv4 equal n/a instead.
15
3.4.3.6 Find if services are really running using IPv6 on a host

This question is particularly interesting and is akin to asking whether there actually is a service running on IPv6, or whether the published IPv6 address leads to nowhere. Select Data Archives, then select a Top Level Domain, and the sub-selection tcp25 (for SMTP Email connection), or tcp443 (for Secure Web, https), or tcp80 (for Web, http). In the resulting table use the Search Filter ipv6 not equal n/a and tcp6 equal False and this shows which hosts do not respond to a call on their IPv6 address. Selecting tcp6 equal True would show which services actually respond, thus giving a true number for the number of hosts running and responding on IPv6.
3.4.3.7 Fake IPv6 addresses?

Select Latest Data. For any of the Top Level Domains, select any sub-selection which contains an IPv6 column. Search filter TLD starts with ::ffff, and this will display fake IPv6 addresses which are in fact IPv4 addresses.
We recommend spending time using the filters to detect more information, including more DNS anomalies, errors and curiosities.
3.4.4 Machine Searchable Output

The Filters used in Section 3.4.3 serve the purpose of formulating a complex HTTP request to the mySQL database. With format of the SQL databases being described earlier in this report, it is possible to formulate more complex requests without using the supplied Graphical User Interface (GUI) on the Web pages, by issuing SQL commands embedded in HTTP requests. Better still, it is possible to design a completely different set of Web pages or to run analytical software which obtain their data from the back-end mySQL database server. This is currently being studied by another Team at Nile University, as part of the second part of the project. A more complete explanation of the format of such requests, and the resulting analytical possibilities, will be discussed at further length in Part two of this report.
16
3.5 Programming Team Feedback (Egypt)

This section reproduces an extract of the Nile University programming team report. It is included here since it shows a real benefit to the team members in enhancing their hands-on knowledge and experience of the Internet, the DNS and of IPv6.
3.5.1 General feedback

As an implementation team we found the project very interesting in nature. The hardware and bandwidth resources dedicated to the project were on the level needed to deliver the results. The project leader was very cooperative and provided the implementation team with all the necessary information and support to complete the task.
3.5.2 Experience gained

If there were something that could have been improved in that project, it would be the balance between crawling and analysis. The current project methodology implied from the specifications could be labelled: crawl a lot of things and then analyze; which is actually a reasonable approach to discover the territory of possible rewarding investigations. However, after the experience gained during this project we would say that we recommend another more iterative methodology that could be labelled: crawl one thing, come up with an interesting analysis, crawl the next thing. If there is one thing we learned in that project, it was the fact the data itself guides you on what to look for. That is why a more iterative approach is our recommendation for future work in the project.
3.5.3 Mistakes made

The initial mistake that was made by the implementation team was the underestimation of data complexity. A superficial look to the crawling process could view it at as a simple loop over tiny shell script that dumps the output in a text file. In reality, the process is far from that, because on each crawling feature, the amount of special cases encountered is just beyond any anticipation. No matter how hard one tries to enumerate the resulting cases of output from a simple DNS A record lookup, for instance, one always finds, in reality, cases that were not accounted for. [Ed: particularly in cases of broken/erroneous DNS, which are more common than we think!] Thats why after having built an initial prototype, we figured out that the task is more complicated than we anticipated and we re-engineered the crawler to be extremely robust and agile to case variation. We also figured that the programming team needs not only skilled programmers but patient and pedantic data observers, who have the interest and will to stare at a log file after another of crawled data and spot and see if their crawling process made sense. This unanticipated complexity, re-engineering and re-assignment of team members have resulted into some delay in project delivery.
3.5.4 Future work

In its current status, the project has laid the foundation for wide-scale crawling and we are sitting on top of a mine of valuable data. The project time and resources were sufficient to do primary analysis of this valuable data. The logical next step is to start a sequel project to curate, filter and analyze this data.
17
4 Partners
The Internet Society has provided funding for the project as part of its Community Projects Funding.
The English Chapter of the Internet Society acts as the local home to the project.
Global Information Highway Ltd., through Dr. Olivier MJ Crpin-Leblond, has designed and coordinated the project and sponsored many of its logistics.
International
Team Leaders
CTM
Nile University (NU), Egypt, through Dr. Sameh ElAnsary, Dr. Moustafa Ghanem, and Dr. Mohamed Abouelhoda, has partnered to write the software. The talented NU Team is composed of Mr. Mahmoud Ismail, Ms. Poussy Amr and Mr. Islam Ismail. TwentytwentyMedia Ltd., through Mr. Rex Wickham and Mr. Alan Barnett, has provided connectivity and rack space in their stacks at Telehouse East, Britains largest data centre and Backbone Internet Exchange Point. CTM International Ltd., through Mr. Omer Hamid, has supplied and configured all hardware required for the project, including all servers and telecommunication equipment required to connect to the Internet.
Dr. M. Ghanem Dr. O. Crpin-Leblond
Dr. S. Al Ansary
Mr. O. Hamid Mr. R. Wickham
18

Ipv 6 Crawler

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Ipv 6 Crawler

Hochgeladen von

Copyright:

Verfügbare Formate

IPv6 Crawler Project Report: Part 1

IPv6 Matrix / August 2010

IPv6 Crawler Project Report: Part 1

IPv6 Crawler Project Report: Part 1

IPv6 Matrix / August 2010

IPv6 Crawler Project Report: Part 1

IPv6 Matrix / August 2010

IPv6 Crawler Project Report: Part 1

IPv6 Matrix / August 2010

IPv6 Crawler Project Report: Part 1

IPv6 Matrix / August 2010

IPv6 Crawler Project Report: Part 1

IPv6 Matrix / August 2010

3.3 Crawler (Back End) Functionality

3.3.1 Tests and modular approach

DNS DNS ASN DNS SOA

T1 T1 T1 T1 T1 T1 T1 T1 T1 T1 DNS DNS DNS MX

IPv6 Matrix / August 2010

WWW WWW WWW NTP NTP NTP

IPv6 Crawler Project Report: Part 1

IPv6 Matrix / August 2010

3.3.2 Virtual Switches

IPv6 Crawler Project Report: Part 1

IPv6 Matrix / August 2010

IPv6 Crawler Project Report: Part 1

IPv6 Matrix / August 2010

MX_uk NS_uk WWW_uk NTP_uk

[type, domain, host, ipv4, ipv6, name4, name6 ]

[type, domain, host, ipv4, reachable, tls]

path_MX_uk path_NS_uk path_NTP_uk path_WWW_uk ip6Type_MX_uk ip6Type_NS_uk ip6Type_NTP_uk ip6Type_WWW_uk domainPenetration_uk

[type, domain, host, ipv6, valid, prefixid, ipv6type]

Domain IPv6 penetratn.

[domain, ns, mx, www, ntp]

IPv6 Crawler Project Report: Part 1

IPv6 Matrix / August 2010

3.3.4 Inputs, Outputs, Errors, Performance and Limitations

3.3.4.1 Small number of domains under a Top Level Domain

3.3.4.2 Performing suspicious activity

IPv6 Crawler Project Report: Part 1

IPv6 Matrix / August 2010

3.4 Web Server (Front End) Functionality

3.4.1 System Software

IPv6 Crawler Project Report: Part 1

IPv6 Matrix / August 2010

3.4.2 Web Page Structure

IPv6 Crawler Project Report: Part 1

IPv6 Matrix / August 2010

3.4.3 Some Data Look-up examples

3.4.3.1 Display of Map of IPv6 connectivity per host

3.4.3.2 Display a Map of IPv6 connectivity per top level domain

3.4.3.3 What is the percentage of dual IPv4-IPv6 stack in the world?

3.4.3.4 Find a listing of IPv6 dual stack Web Sites in the UK

3.4.3.5 Find a listing of IPv6 only Web sites in the UK

IPv6 Crawler Project Report: Part 1

IPv6 Matrix / August 2010

3.4.3.6 Find if services are really running using IPv6 on a host

3.4.3.7 Fake IPv6 addresses?

3.4.4 Machine Searchable Output

IPv6 Crawler Project Report: Part 1

IPv6 Matrix / August 2010

3.5 Programming Team Feedback (Egypt)

3.5.1 General feedback