Sie sind auf Seite 1von 59

Chapter

1: Introduction


The automated process of procuring data from websites has become an essential
component of today’s business marketplace. The skill of writing programs or functions
that can pull data from websites in a structured and scalable manner is in high demand.
The best part of this new trend is that web scraping is not overly complicated. While an
experienced programmer can learn web scraping in no time at all, even individuals with a
very limited amount of programing experience can become super-star scrapers if they have
the right tools, information and, of course, a very eager attitude.
Web scraping, also referred to as web data extracting, harvesting, crawling, etc.,
has received a bad rap of late. In the business world, it has become synonymous with
stealing data from competitor’s websites. While this certainly can be the case, web
scraping can serve a myriad of other purposes. For example, let’s pretend your company is
working with a client whose website has data necessary for your purposes. It’s possible
that your client does not have a convenient means of producing the data you need to carry
on with day-to day business. This can leave your company spending hours manually
extracting data from their website. In this situation, web scraping is a viable, and ethically
acceptable, option for saving your company time and manpower.
The skill of web scraping can even be valuable outside of the workplace. I’ve
personally used scrapes to procure Major League Baseball statistics and data. The data is
then in excel for me to analyze, or automatically calculate, the best fantasy team for the
day. I’m not a big gambler so I’ve yet to strike it rich in fantasy baseball, but it’s a perfect
example of how one can use scrapes for personal interests or hobbies. One can see how
easily this skill can be used for sports, news, stocks and so forth. The internet is a never
ending supply of new information, and scraping offers an efficient means of obtaining the
data you may want for a myriad of reasons.

Legal Issues
Legality regarding the scraping of some sites is a bit of a fuzzy issue at times. Websites
frequently have disclaimers explicitly prohibiting the use of data extraction tools. It is
your responsibility to make sure you are not breaking any legal rules in performing your
scrapes. If you work for a company that intends to use scraped data, you should definitely
talk to your supervisors and legal department to ensure that you are not getting your
company or yourself in any trouble. This subject can actually get fairly complicated at
times from a legal standpoint and I am not a lawyer so I always recommend consulting
with your company’s legal representatives before performing any scrapes for business
purposes. In general, if a site states that you are not allowed to scrape their data then you
probably shouldn’t

Ethical Issues
Now that I’ve made it clear that this book does not promote the breaking of any
laws, let’s address the ethics of scraping data. While I’m not aware of any company that
I’ve worked for using the techniques in this book for nefarious reasons, the truth of the
matter is that many, if not most, large companies are performing scrapes of their
competitor’s sites regardless of disclaimers. Scrapers also justify using these techniques by
pointing out that scrapes are only procuring data that is readily available to the public.
There is no hidden or confidential information being obtained. In theory, you could
accomplish the same tasks by simply copying and pasting data from the site into a
spreadsheet. The only difference that the scrape makes is that it saves time and energy.
This leads into the next advantage of learning the skill of scraping data. By
knowing the tools and techniques used by what you might consider to be nefarious
scrapers, you can stay one step ahead in protecting data on your site. As I alluded to
before, adding a legal disclaimer to your site will only do so much in terms of preventing
scrapers. Let me just say that if I wanted to prevent individuals from scraping data from a
site I had, I would not rely on a legal disclaimer for accomplishing this goal. Luckily, there
are other ways to prevent, or at least inhibit, the scraper from taking data you don’t want
them to have. While the prevention of scraping is not the focus of this book, learning the
process is the first step in prevention.

For This Book
This book will contain many examples of code in both VBA and HTML. For the sake
of showing examples in these languages, brackets will be used to specify the part of the
code that would need to be added or changed. For instance, when you’re looking an
example and a line in the code reads: [Type your code here], you do not literally write
“[Type your code here]” in the program. The brackets and their content are there to
illustrate or describe what piece of information would go at that location. For instance,
let’s look at the sentence:
My name is [Your name here].
For me, this would translate to:
My name is David.
Before we can get into the details pertaining to web scraping, we must consider and
review the key components involved in the process. Once this is established, we can get
into the basics of writing web scrapes for Excel and then we can address some of the most
common issues that can frustrate a new scraper. Ideally, by the end of this book you
should have a good idea of what it takes to write simple, yet valuable, web scrapes using
Microsoft Excel.


Chapter 2: VBA Basics


Microsoft Excel is ubiquitous in the business world. Every day, millions of
Americans get in their car, drive to work, and promptly open up Excel. For most of this
population, spreadsheets are just a compilation of rectangles which hold letters and
numbers in a very structured and, at times, aesthetically pleasing manner. Their day
typically consists of moving or transforming data by typing, clicking, copying and pasting
it to where it needs to go while occasionally utilizing a formula or two to save time. While
Excel is certainly useful for this manual work, the truth is that most Excel users are
completely unaware of the potential that the program has to turn repetitive and boring
tasks into a thing of the past.
This is where VBA comes into play. Visual Basic for Applications (VBA) is the
scripting language used to automate tasks in Excel. Much like any programing language,
the coding for VBA can be as complicated or as simple as it needs to be. If you’ve ever
recorded or played a macro in Excel, then VBA is being utilized. When the recording
function is used, Excel is essentially writing a VBA program for you. While the language
is useful for recording and playing basic macros, this is really just the tip of the iceberg in
terms of what it can accomplish. In addition to automating almost any task in Excel, VBA
can be used to interact with other programs, which will be especially useful when we get
to the actual scraping portion of the book.
To go over every task that is possible with VBA would take at least another book in and
of itself and would not be practical for our purposes. Since this book is specifically
focusing on the use of VBA in scraping and manipulating data from the web, we will only
be covering the essential terms and functions that VBA provides for these tasks. However,
we will also be covering the more basic elements of programing that are also critical for
scraping the web.

VBA Programming
This book is intended for the use of beginners as well as experienced
programmers. With that being said, this section is probably not necessary for advanced
VBA developers or programmers in general, so it may behoove you to skim or completely
skip this basic review. For you beginners though, if you are completely new to
programming then please note that this is a very quick review of the most basic
components. Again, this book does not intent to give an all-encompassing view of
programming with VBA let alone programming in general. If you’re an absolute beginner
who intends to become a master programmer of various languages, then there is plenty of
information readily available on the internet for every language you can imagine. The
following material is only meant to give a brief overview of the basic concepts of
programming with VBA and how they will be utilized for our purposes.

Variables
There’s a good chance that variables are exactly what you think they are. In
programming, just like in algebra, variables represent, or hold, a value. In VBA, assigning
a value to a variable is especially easy. One simply has to type the variable name, add an
equals sign (=) and, finally, add the value that they want the variable to hold.
Examples:
1.X=3
2.X=3.0
3.X= “Three”
One might notice that not all variables contain the same type of data. The values
assigned in Examples 1,2, and 3 are 3, 3.0, and “three” respectively. The three in example
1 is an integer. The three in example 2 is a number but the decimal shows that it is not an
integer like it is in example 2. The X in Example 3 is, of course, not a number at all; it is a
collection of characters, in this case letters, that spells the word “three”. Why am I stating
the obvious? Because the example demonstrates how even though the pieces of
information, in a sense, all mean the same thing to us humans, they are in different forms,
or data types, to computers. Typically, in most languages the programmer must specify the
type of data that the variable will be holding. In the previous examples, Example 1 would
be specified as an integer, Example 2 would be a double, and example three would be a
string.
One of the great things about learning programming through VBA is that it is not
as crucial to specify the variable type as it is in other languages. However, this is generally
only true for writing basic macros. While it is generally good practice to always specify
the variables that you will be using, the beginner can take solace in the fact that their
program probably won’t break if they don’t specify the data type for every variable. The
reason for this is that when a data type is not specified for a variable in VBA, it then
defaults to a variant data type which is extremely versatile and can be used as an integer,
double, string, etc. This can be good or bad for the beginning VBA programmer. On one
hand it saves time and is a luxury to not worry about what data type a variable is,
especially for basic programs or functions. On the other hand, the variant uses much more
memory than the other commonly used data types. When you get into more advanced
concepts and coding it will be crucial to correctly specify what type of data a variable is
housing. There is a time and a place to take advantage of the variable defaulting to variant,
but for our purposes, that being writing macros that scrape data from the web, we’ll
operate under the assumption that defining variables correctly is necessary.
Some of the most common data types that we will be using:
• Boolean- Binary.
• Examples: True or False
• Integer- Whole number (No decimals)
• Examples: 1, 2, 12, etc.
• Double- Whole numbers or fractions (decimals)
• Examples: 1.2, 1.33333, 50.12, etc.
• String- Text
• Examples: “Cat”, “Dog”, “VBA is awesome”, etc.
• Object- Objects will be discussed in more detail later on as the concept is not
as straightforward as the previous data types. For now, just think of an object as a
separate entity or thing such as an occurrence of an outside program (i.e. internet
explorer). While that’s not technically the entire definition, it’s the most relevant
feature of the data type for our purposes.
• Variant- Variants were briefly discussed earlier. They are the data type that
VBA defaults a variable to when you don’t specify the type. The variant is very
flexible and can serve the purpose of almost any data type. However, they use up
a large amount of memory which can effect performance. Relying on defaulting
to the variant will also limit you when you get into more advanced programing in
VBA. I only rely on defaulting to variants on small projects in need of a quick fix
which is definitely not the case when it comes to scraping web sites.
Once you’ve decided what data type your variable is, specifying a variable’s data type
is as simple as typing, “Dim [variable] as [data type]”. For example, let’s assume that I
want the variable x to represent 3, an integer. To specify that x will be an integer, I would
type:
Dim x as integer
I could then assign X my integer value of 3 which would be:
X=3
To specify multiple variables of the same data type, simply type “Dim [variable1],
[variable2], [variable3], [etc.] as [data type].” For example, if I wanted to use variables x,
y and z all as integers I could write:
Dim x, y, z as integer
You can specify a variable’s data type an any point in the program before giving the
variable a value. I believe it helps keep the program organized and adds a sense of
professionalism to specify all variables at the beginning, or technically before the macro is
run, but it can really be done at any point. We’ll illustrate examples of both options after
we’ve covered more basics and can write a program. Again, this is almost completely an
aesthetic choice and will not affect how the program runs.
There is an unlimited number of ways that programs can use variables. However,
there are some aspects of programming that are almost universally used across languages.
This section will give a brief review of the most relevant ones for our purposes. Before we
do this however, if you are complete newb let’s get writing your first macro, or program,
out of the way.
First, open a blank excel workbook. To get to the environment you will be writing
your code in, right click on the tab at the bottom of the spreadsheet. This will typically be
labeled Sheet1 in a new workbook, and select View Code. You can also access the
environment by pressing Alt-F11. A new window labeled Microsoft Visual Basic for
Applications - Book 1 should appear. At the top of this window, select Insert and click
Module. A blank white window labeled Book1 - Module1() should appear. This is where
you will be typing your code.
All macros in this environment will begin and end the say way. At the beginning,
type “Sub” followed by whatever you want to name your macro, followed by “()”. To end
your macro, type “End Sub”. So, if you wanted to name your macro “FirstMacro” it
would look something like this:

Sub FirstMacro()
[Type your code here]
End Sub

Note that unlike many other languages, VBA is not case sensitive, meaning that you
need not concern yourself with deciphering uppercase from lowercase. To complete your
first macro, we’ll be utilizing VBA’s MsgBox function, which will display a dialog box
with a message when it is run. To use this feature, simply type MsgBox followed by your
message that you want displayed.
This is your complete first macro. There are several ways to run code once you have it
written. For example, you can click the green Run Sub button (button with the green
triangle) or you can go back to your spreadsheet, click the View tab, click the Macros
section and a dialog should appear with the name of your macro displayed. Select your
macro and click the Run button. If you’ve correctly entered the code displayed above,
when you run your macro, a dialog displaying “Hello World!” should appear.

Conditional Statements
Conditional statements are often referred to as “if statements” for pretty obvious
reasons. They check for the existence of a certain condition. If the condition is true it
performs a specific action. If it is false, then it performs a different action or no auction at
all. Here’s how a conditional statement might look in VBA.

If [condition is met] then
[action]
else
[different action]
end if

To illustrate this principle, we’ll slightly modify our current code and add a conditional
statement. To do this, we’ll utilize a variable which we will call Var1


If you’ve entered the code correctly, then a dialog should appear with the message “The
variable is five.” To further test the code, change the variable to any integer other than
five. Run your macro and a dialog will display showing “The variable is not five.”

Loops
Loops might be looked at as a certain type of condition statement. The difference
between loops and the previously discussed conditional statements is that loops repeat a
section of code multiple times until the condition is met.

For Loops
There are several different types of loops. For our purposes we will primarily be
working with For Loops. For Loops are perhaps the simplest type of loop as it repeats the
loop for a predetermined number of times (generally speaking). Let’s look at an example
to illustrate the concept. If I had a section of code that I wanted to repeat 5 times I would
write:

For x = 1 to 5
[Action to be repeated]
Next x

We’ll again use the MsgBox function in an example.


When you run this code, the dialog displaying the message “This is my loop
example” should be displayed 5 consecutive times (you must click the OK button each
time).
In this example, and in most circumstances that we’ll be using For Loops, each iteration of
the loop adds 1 to the variable x. To demonstrate this, we’ll modify our loop example by
including our variable.


When this code is run it should show 5 consecutive dialog boxes displaying the iteration
of the loop.

Do While Loops
Another type of loop that we’ll be using regularly is the do loop. The do loop is
similar to the For Loop in that it repeats a section of code. There are two types of do
loops; Do While and Do Until. Do While Loops repeats the section of code while a
specified condition is true. Do Until Loops repeat the section while a condition is not met
and stops when it becomes true. Clearly, the two options are very similar. I find it easier
to pick one of the two options and stick with it. Most of the time if a task can be
accomplished by one then it can easily be by the other by slightly altering the inner
portion of code. For this reason, I generally stick with Do While Loops, but the choice is
up to you. The general structure of the Do While Loop will look something like this:

Do While [condition]
[Repeat this action]
Loop

Let’s look at an example to illustrate how the Do While Loop works for our
purposes. We’ll also introduce the Application.wait function in this example as we’ll
commonly use this action in while loops. We will be using this function to pause our
macro for a specific amount of time, usually for one or a few seconds. This function was
created to pause a macro until a specified date and time. Naturally, we do not know the
exact date and time of day that we’ll need to use this function, so to get around this we’ll
take the current date and time and add the number of seconds that we want our macro to
pause for. If this is confusing for you, don’t sweat it. Just understand that when you see
Application.Wait (Now + #12:00:01 AM#) it will pause the macro for 1 second. When you
see Application.Wait (Now + #12:00:02 AM#) the macro will pause for 2 seconds, and so
on. When the following macro is initiated it will pause for 5 seconds and then display a
dialog indicating that the macro has finished.


This concept tends to trip some people up so we’ll walk through it. At the
beginning of the macro, the variable y is 0. The condition in the loop stipulates that the
action in the loop will repeat while y is less than 5. Each time the loop runs it will use the
application.wait function to pause excel for one second. After this pause, it will add 1 to
our variable y before starting the next iteration of the loop. Therefore, the loop will run 5
times, pausing for one second each time, before y is no longer less than 5 which means the
condition is no longer met and the loop is done. As the macro continues, it will then use
the MsgBox function to display a dialog indicating that the macro has finished. Therefore,
when you run this macro it will pause for 5 seconds before displaying the dialog.

Basic Spreadsheet Interaction
VBA offers various ways of inputting data on a spreadsheet. For inputting data into
individual cells, I personally use and recommend the Cells method. This method simply
looks at the spreadsheet as a plane of coordinates, with cells(1,1) being the first cell in the
spreadsheet (top left corner) with the first number controlling the vertical placement and
the seconding representing the horizontal placement. It might be best to think of this
method as cells([row],[column]). For example, as previously mentioned, cells(1,1) is
associated with the top left corner of the spreadsheet. To reference the cell on the far left
but in the second row, you would use cells(2,1) and so on.
To place the value you want in the cell, simply type cells([row],[column]).value =
[value you want to enter]. If you want to put in a static value, make sure you use
quotations around your text. For example, if I wanted the cells in the second row of the
second column to say “Dog” I would put:

Cells(2,2).value = “Dog”

You should probably notice that when using this method in this manner, the sheet in your
workbook that you want to use is not specified. When using this method as we did above,
VBA will assume that you want this action performed on whatever sheet is currently
active. If you want to specify what sheet the cell is on that you are going to be using then
put Sheets([Name of sheet in quotations]) followed by your cells statement. For example,
if you wanted to put “Cat” in the third row of the first column on a sheet called “Pets” in
your workbook, you could put:

Sheets(“Pets”).cells(3,1).value =”Cat”

Clearly, putting a static value in a cell is a pretty simple process. What we will be doing
frequently is only slightly more complicated, in that we will be using a loop variable to
dictate the cells being used. This is more frequently done in the row coordinate. Here’s a
simple example:


This code should put the word “Bird” in the first 10 rows of the first column of the
worksheet titled “Pets”. Of course, it would be very unusual to have to enter a static value
in such a manner. Typically, the “Bird” portion of the above loop will be a variable that is
changing with each iteration of the loop. To better understand this concept, here is an
example illustrating this principle. If each cell of the first ten rows of the first column had
a unique value and we wanted to use this method to input each of these values into the
same row that they are currently in but in column two we could write:


This leads into the next important concept for our purposes. Unlike the above
example, it would be very rare for us to be using a static value for the number of rows to
loop through. We typically won’t know if it’ll be ten rows [as in the example above), 100
rows, 8,675,309 rows, etc. Let’s say in the far left column we have a list of URL’s that we
will ultimately have to use for our scrape and ​​we need to loop through them one at a time.
It would be highly inconvenient to manually determine the number of rows and then enter
the number into our loop. To get around this inconvenience, we need to automatically
count how many rows are populated with data in the first column and then use that
number in our loop. There are multiple ways to do this in VBA. One easily understandable
method is:
Sheets([Sheet name]).cells(1,1).CurrentRegion.Rows.Count
Clearly there are multiple things happening here to get the number of rows used in
Column 1. In short, the code looks at Cells(1,1) and the “block” of populated cells that it
is part of. It then counts the number of rows in this block, thus producing the number of
rows that we want to loop through. We can then use this number as the upper limit in our
loop. Here’s the same loop that we just used but now with a variable dictating how many
rows to loop through:


Formatting Code
As mentioned previously, what you name your variables is entirely up to you. I
personally like to use a single letter for variables representing the iterations of loops and
use more descriptive names for other variables. In the example above, the variable named
z could just as easily be named loopvariable1, q , or RonaldReagan. However, regardless
of your preference I recommend not naming all variables randomly. While it can be funny
at first, eventually you’ll probably have to look at your code again months or years after it
was originally written and having random names for variables will make it more difficult
to follow the logic of your program. Another aspect of the style of your code that is a
matter of preferences in the use of indentation and spaces between lines. For example, you
might find the last code that we used more organized if it was formatted as follows:


This formatting is completely optional and is a matter of preference. I sometimes
wont format very short codes (like the one’s we’ve been using thus far) but always use my
preference in formatting for longer programs. My only recommendation is that if you do
use formatting to keep your code organized, be consistent with how you use it. Notice in
the example above that I indented the beginning and end of the loop so that they are
visually in line with one another and put and additional indentation for the loops contents.
This may add little value in codes as short as these examples, but when you start writing
long programs that may include loops inside of loops inside of loops, this type of
formatting makes troubleshooting whatever problems arise much easier.



Chapter 3: Scraping with VBA



Navigating the Web
Before we can get into great detail regarding pulling pieces of data out of websites, we
first must become familiar with using VBA to navigate the web. This skill can serve a
myriad of purposes even if you don’t intend to scrape data from sites. For example, it’s not
uncommon to use VBA to create macros that navigate to and through various sites simply
for display purposes. More importantly, these techniques can be used in website testing.
VBA can be used to navigate through all pages of a website, testing it for reliability and
accuracy.
Just like you and me, VBA requires a web-browser to navigate throughout the web.
However, there’s a good chance that you, as a human, are quite a bit more versatile when
it comes to operating different web browsers. If you’re like me, you might prefer Firefox
on a daily basis when it comes to manual internet browsing. At the same time, you
probably can easily switch to Chrome or Internet Explorer if you ever felt like it.
However, when it comes to programming with VBA, internet explorer is the most
common web browser to use as it is by far the easiest. The reason for this is pretty
obvious. VBA was developed by Microsoft to be used with Microsoft programs so
naturally Internet Explorer is the easiest one to utilize in this context. This is not to say
that it’s impossible to use Firefox or Chrome with VBA, but it generally requires much
more effort. Even if Internet Explorer isn’t your personal preference when it comes to
browsing the web, it should be more than sufficient for your scraping purposes.
To start your web-browsing with VBA, we must create our Internet Explorer
object. We’ll start by defining our variable, which we’ll call IE, as an object. Object
variables are not established in the exact same way as other variables. The only difference
you need to be aware of is that when declaring Object variables, you must add the word
“Set” before the name of the variable. So when we create the Internet Explorer variable it
will look something like this:

Set IE = CreateObject(“InternetExplorer.Application”)

There are other ways of creating Internet Explorer objects in VBA. However, this often
involves adding outside references to your VBA project. This task is not difficult and any
seasoned VBA developer will know how to do it, but I find that using the aforementioned
method of creating this object to be just as effective as any other means so it is generally
the method that I stick with. After you’re done using your browser for each macro, it’s
important to clear the object, which essentially means you’re changing the object to
nothing which can be done with this code:

Set IE = Nothing

This step is often overlooked. People often skip it because they assume that once the
browser is closed then it is no longer using the computers resources, however this is often
not the case. This situation often leads to what’s referred to as a data leak. In larger
scrapes, there will be times when it’s necessary to open and close a browser object many
times. Each browser object you create takes up a certain amount of the computer’s
available resources. If the objects are never cleared, even if the browser has been closed,
the amount of memory dedicated to the objects you create continue to accumulate, which
can ultimately lead to your scrape running very slowly.
This leads into the next point. Creating an object, such as one that is used for web
browsing, does not mean the object will automatically become visible. It is very possible
to create and use a browser that remains completely invisible to the computers user. While
there may be an appropriate time to use Internet Explorer in this manner, I generally never
do. Being able to see the navigation that the browser performs is of vital importance when
testing and monitoring a new scrape. Keep in mind that even if you write your code
flawlessly, there are certain aspects when it comes to navigating the internet that will
remain out of your control. Websites will be updated and changed over time which can
lead to your scrape not working properly, so it’s ideal to be able to see the progress of your
navigation when you can. Naturally, you’re not going to want to monitor your scrapes at
all times, but it’s important to be able to see the browser when you want to. To make this
object visible, we’ll use the code:

IE.Visible = True

Now that your browser is created and visible, it’s time to navigate to whatever site you’ll
be displaying or scraping. This code is also short and simple:

IE. Navigate [website URL]

For example, if you wanted to navigate to Google.com, your code would look like this:

IE.Navigate www.google.com

Putting these three pieces together will create your first web-browsing macro:


Timing
When interacting VBA with Internet Explorer, time is of critical importance. This
may seem like one of the more trivial matters affecting your scrapes, but timing issues can
actually be some of the most troublesome and frustrating obstacles you encounter when
navigating and scraping sites on the web.
Allow me to elaborate; websites typically require at least a few seconds to load.
While we know that it would be a futile effort to operate a website before it is done
loading, your VBA program does not know any better unless it is specifically told to wait
for the page to load. As far as VBA is concerned, once it processes IE.navigate
“google.com” it is ready to continue on to the next line of your code whether or not
Internet Explorer has finished loading the site. When this happens, your program will most
likely be curtailed with an error message.
To get around this problem, you must tell the program to wait for the website to
finish loading. There are multiple ways to do this. While at times it can be useful to use
the application.wait function that was previously mentioned, this is not the ideal choice
when it comes to waiting for web-pages to load. The reason for this is the unpredictability
in the amount of time that it takes for a page to load. At any given time, this process may
take one, ten or twenty second. Having a static value in the application.wait function
leaves room for error in this context as it may leave too much or not enough time for a
page to load. While the idea of leaving an excessive amount of time for a page to load
makes sense in theory, if your scrape is navigating through many pages, as will probably
be the case, this wasted time can be very valuable. Even an extra second or two to each
page will end up adding up to a significant amount of lost time when you’re navigating
through thousands of pages.
There are various ways to get around the page-loading-timing issue. One of the
most commonly used codes used to pause the macro while the page loads utilizes a while
loop:


Another option for accomplishing the same goal is:


In theory, either of these methods should accomplish the same goal of pausing the
macro until the site has finished loading, but it should be noted that both of these methods
have been criticized for being inconsistent. However, many have found that utilizing both
of these conditions at the same time adequately accomplishes the task. The combination of
the two codes might look something like this:

As previously stated, these methods have been criticized for lacking consistency. In my
experiences, it is not technically accurate that these methods are inconsistent per se,
although they do leave something to be desired. It’s been my experience that when these
methods work for a certain site, they will typically always for that site but if they don’t
work for a site, they will never for that site. There are times when the loop fails to pause
the macro at all and there are other times when the macro will get stuck in the loop
indefinitely. In general I believe the above methods work most of the time so they are a
good starting point when trying to pause a macro while a page loads. Just keep in mind
that this method is not perfect and it will sometimes be necessary to utilize more creative
programming to accomplish this task.
There are occasions when these methods will work most of the time but occasionally
gets stuck in an infinite or seemingly infinite loop. Fortunately, there are methods around
this. One of my favorites is to put a variable which counts every iteration of your loop up
to a maximum limit. In theory, this can be done without a pause, but the limit would
typically have to be a very large number for the loop to serve a useful purpose. Therefore,
it’s ideal to have a pause in each iteration of the loop. For example:


The application.wait function is necessary in our loop because if there were no
pause, the variable q would meet the condition (add up to five) in less than a second,
which would hardly be useful to us. Since the variable q increases by one for every
repetition of the loop, in the fifth iteration the condition of q < 5 will no longer be true,
thus activating the Exit Do portion of the code, ending the loop. So if we look at the entire
code we can see that the loop will continue until IE.Busy is false or IE.ReadyState <>
READYSTATE_COMPLETE is false. However, due to the q<5 portion of the code, the
maximum number of times that the loop can go through is 5 before it exits the loop.
The final method I use to pause macros while a web-page loads is less commonly
utilized, but I’ve found it to be invaluable at times. As previously mentioned, there will be
sites where the IE.Busy or IE.ReadyState <> READYSTATE_COMPLETE function simply
doesn’t work at all. This method utilizes the LocationURL property. When we add this
property to our browsing object, we can get whatever the browser’s current URL is. How
is this useful? When we direct the browser to a specific URL it does not navigate there
instantly, it takes time for the new URL to load. We can use this to our advantage with the
following logic. Use IE.LocationURL to determine the browser’s current URL. We’ll now
refer to this URL as URL1. At this time, we tell the browser to navigate to the desired
URL which we’ll call URL2. While this page is loading, the IE.LocationURL operation
will still display URL1 until URL2 has loaded. Therefore, once you’ve instructed the
browser to go to URL2, you can start a loop that repeatedly checks IE.LocationURL until
it reads URL2 at which time the page as loaded and you can resume your macro. Here is
an example of such a loop.


It’s going to take a practice and repetition to know when to use which method to stall your
program. As previously mentioned, there are an infinite number of websites that come in
an endless combination of styles and operations, so there is not one method that will work
for all of them.

Interacting with Websites
Now that we’ve covered navigating to websites, we can begin covering how to
interact with the site. While there will be scrapes that simply load a page, scrape the data
and then carry on to the next page, there are times where it will be of paramount
importance to understand how to interact with the sight through VBA in the same way that
you would manually interact with a site. For instance, there will be times when you need
to click a button or buttons for the site to produce the information that you want. There
will also be times when you need to insert text in a file on the page for it to operate the
way you want it to. Furthermore, at times it will be easier to click an element on a page to
take you to another page in oppose to relying on entering a new URL. Each separate
component of the site that are of interest to us, such as buttons, text boxes, icons, filters,
etc. will be referred to as elements.

Identifying Elements
Regardless of how you intend to utilize an element within a website, whether that
means clicking on it, inserting data into it, taking data from it, etc., you must determine
how to identify the element within the coding of the website and how you can reference it
in your VBA program. In order to do this, you need only a very basic level of knowledge
about the structure of websites and how they are organized. As you’re probably aware of,
the basic structure of websites is created with HTML (Hypertext Markup Language).
However, as technology has evolved HTML became less sufficient for providing the most
advanced features used in most modern websites. It has gotten to the point where HTML
really only provides the “backbone” of most modern sites. In fact, it’s likely that every site
you regularly use incorporates multiple other languages such as CSS, JavaScript, PHP and
so on.
Lucky for us, you don’t need a wealth of knowledge regarding all of these languages to
be a proficient web scraper. The reason for this is that much of the value that these
additional languages provide focuses on aesthetics or graphics, which is to say that a large
portion of the actual content or information that you want from a site will still reside in the
HTML portion of the code.
There are a few ways to go about identifying the element of a website that you want to
use. The most basic way of doing this is to manually scan through the HMTL code
looking for the element you want. To do this, open a browser and navigate to a desired
site. A dialog should appear with a few options depending on the browser you’re using.
Typically, the option you’re looking for will say “View Source”, “View Source Code”,
“View Code” or something similar. By clicking on this, a window should appear which
displays all of the available coding behind this site. At this point, you can scan through the
code, and try to follow it to the identifying feature of the field that you want. One effective
strategy is to copy and paste whatever text is in your element, or an element near the
element that you want, and doing a search for the copied text in the “View Source”
window. I could go into more detail about how to do this, but I won’t. The truth is that
while it’s important to know how to do this in the event that you need to review the entire
page’s code, it is by far the most tedious and inefficient way of finding the identifying
feature of the element you want.
Many web browsers offer a more efficient means of accomplishing this goal. This is
one of the reasons that I generally prefer doing my manual browsing with Mozilla Firefox
as it provides an easy way of finding the code for the exact element you want. To do so,
simply move your cursor over the element that you want to use, right-click and select
“Inspect Element.” A new section at the bottom of your browsing window will appear
with the section of code responsible for the element highlighted.
A final option for identifying the elements that you want to use in your scrapes is
to run a macro that procures the innertext, value, or other identifying feature for each
element on the page and then searching through the results for the element you want. I
have included a code below that accomplishes this goal which will take the innertext and
identifying features for each element that has innertext and places them in the workbook.
It’s important to note that there are many more elements on each page that won’t appear
on the worksheet as they don’t have innertext. This code could certainly be improved;
however, this would call for a more complicated programming. Ideally this macro will be
simple enough for individuals that do not have much experience with VBA to understand.
If you are brand new to VBA, it may look intimidating, but don’t harp on every detail.
Hopefully by the end of this book you’ll have a better understanding of how this code
works and you can then manipulate it to your liking. If you combine the material in this
book with plenty of practice then this code will begin to look simple, if not elementary.
Furthermore, it is easy to find similar macros online that accomplish similar tasks and
have extra bells and whistles. Keep in mind that even though these macros can be useful
tools in identifying the element you want to use; they are still only tools which are
intended to help the process. It is still ultimately up to you to determine the elements you
want to use and the best way to reference them in your scrape.



While it is usually easy to find the HTML code for the element that you want to
use, it is not always easy to properly reference the element. There are several factors that
come into play here, but it typically depends on what information the HTML code
provides for the desired element. To explore this issue, we’ll observe the various methods
that VBA provides for locating an HTML element. For several of the forthcoming
examples, we’ll use the following HTML code:
<tagname class=“ghi” name=“def” id=“abc”>This is my example HTML code
</tagname>
We will first discuss the methods used to properly reference an HTML element
through VBA and then we will explore how to use these references to accomplish our
goals.

GetElementByID
The GetElementByID method is ideal for scraping data when the option is
available. The primary reason for this is that an ID will typically only refer to a single
element on a site, meaning that there will not be multiple elements that have the same ID.
When it comes to scraping, finding an element you want that provides an ID is like taking
candy from a baby. If we continue using IE as our browser object variable, the code for
taking data for these elements is simply:

IE.Document.GetElementByID(“[Element ID]”).InnerText

So, let’s apply this procedure to our example HTML code. Notice that the ID portion of
the code has been bolded for the sake of demonstration.

<tagname class=“ghi” name=“def” id=“abc”>This is my example HTML code
</tagname>

We can see that the ID for the element we are using is “abc”. Therefore, to reference this
element we could use the code: IE.Document.GetElementByID(“abc”).

GetElementsByName
Similar to the GetElementByID procedure is the GetElementsByName procedure. Other
than “Name” replacing “ID” in the procedure, there is a major difference between the two.
Notice that the “Element” portion of the GetElementsByName procedure has an “s” at the
end, which is not true of the GetElementByID procedure. Unlike IDs, there can be multiple
elements with the same name. Since multiple elements can have the same name, it’s
important to further specify which one you are referring to. When multiple elements have
the same defining feature, such as the element’s name, then each of these same named
elements are assigned a hidden index number, starting with 0, which you can add at the
end of your GetElementsByName reference in parentheses.
This might sound confusing but it’s actually very simple. If we have several
elements that have the same name, then the first one has an index of 0, the second has an
index of 1, the third has an index of 2, and so on. All you have to do is add this index
number between two parentheses to the end of your name reference. Let’s look at our
example code to further elaborate.

<tagname class=“ghi” name=“def” id=“abc”>This is my example HTML code
</tagname>

Notice that the name portion is bolded for the sake of demonstration. If we wanted to
reference the first element with the name of “def” we could write:
IE.Document.getelementsbyname(“defg”)(0). Even if there was only one element on the
page named “def”, it is still necessary to put the 0 as a reference.
The example above is technically accurate but not very realistic. If an element has an
ID, it would be simpler to use that in your reference since it does not require an index
number. Let’s look at this section of code to get a better understanding of how using
GetElementsByName can be useful.

<tagname class=“ghi” name=“def”>This is my first HTML code </tagname>
<tagname class=“ghi” name=“def”>This is my second HTML code </tagname>
<tagname class=“ghi” name=“def”>This is my third HTML code </tagname>
<tagname class=“ghi” name=“def”>This is my fourth HTML code </tagname>

This chunk of HTML code provides a much better illustration of when
GetElementsByName can be used. Notice that there are no IDs in these four elements and
they all have the same name. However, each of the four is displaying a different phrase in
the text portion of the elements.
To reference the first one, we would write:

IE.Document.GetElementsByName(“def”)(0)

To reference the second one, we would write:

IE.Document.GetElementsByName(“def”)(1)

To reference the second one, we would write:

IE.Document.GetElementsByName(“def”)(2)

etc.

GetElementsByClassName
The GetElementsByClassName procedure is almost the exact same as the
GetElementsByName procedure. The only difference is that the procedure looks for the
elements class instead of its name. Everything else, including the use of index numbers, is
the exact same.

<tagname class=“ghi” name=“def”>This is my first HTML code </tagname>

Assuming this element is the first or only one with a class name of “ghi”, we would
reference it by writing:

IE.Document.GetElementsByClassName(“ghi”)(0)

GetElementsByTagName
At this point, you can probably guess how to use the GetElementsByTagName
procedure. This procedure is, essentially, the exact same as the previous two procedures,
except that it utilizes the name of the tag. Everything else, including the use of index
numbers, is exactly the same. For this example, we’ll use “td” as our tagname.

<td class=“ghi” name=“def”>This is my first HTML code </td>

Notice that the “td” portion of the code has been bolded for demonstration. If this is the
first or only element on the page with the tagname, “td”, we could reference it by writing:

IE.Document.GetElementsByTagname(“tagname”)(0)
Working with Elements
Now that you have a basic idea of how to use some of the most common
procedures for referencing HTML elements, we’ll explore how these features are useful.
Simply adding these procedures to your VBA program will do nothing in and of
themselves. If we did, the code would simply find the element and do nothing with it.
Therefore, you must instruct the procedure what to do with each element that it is
referencing. To explore this topic, we’ll look at some of the most common actions that we
typically perform while browsing the web. To do so, we will explore the functionality of
the two most commonly used devices that communicate instructions to our computer: the
mouse and the keyboard.

The Mouse
We’ll start with the basic functionality that the mouse provides. When browsing
the internet one usually uses the mouse to navigate the cursor to the element that they
want to use and, most of the time, clicks the left mouse button to activate whatever
function it is that the element that they navigated the cursor to does. For our purposes, the
actual navigation of the mouse is irrelevant. This is because you don’t need to literally
navigate the mouse to an element to click on it. We can, instead, reference the element that
we want to use with the previously discussed procedures and then instruct VBA to click on
it. In a sense, our reference to the element is doing the navigating for us, so all we have to
do is instruct VBA to click on it. To do so, we can simply put “. click” at the end of our
code that refers to the element. For example, if there was a button on the website which
had the following HTML code:

<div id=”button1” name = “button”>

To click this button in VBA, we could write:

IE.Document.GetElementByID(“button1”).click

The Keyboard
In everyday life the keyboard is used to enter text into fields. The difference is that if
you already know how to reference the field that you want to add text to, there is no need
to click on it or press “Tab” in order to activate it. Instead, simply state your reference to
the element, add “.innertext=”, then add the text that you want to add in quotations. For
example, we’ll pretend that you want to enter text into this HTML element:

<div id=”abc” name = “def”></div>

To enter the string “This is my Text” you could write:


Pulling Data
While entering text will be necessary at times when performing a scrape, it will
typically be much more common to do just the opposite by taking text from an element
and storing it somewhere, such as a cell on your spreadsheet or in a variable. To do so,
take the code that we used to enter text, but reverse the text on each side of the equal sign:
Let’s look at our last example with this adjustment:


Of course, the string “This is my text” can’t be used to store the contents of the
referenced element, so we must change this portion of the code to a variable that is able to
store the elements contents. If we use Var1 as our variable, our code would look
something like this:
Var1 = IE.Document.GetElementByID(“abc”).innertext. If, instead of storing the
element’s contents in a variable, we wanted to store it in our spreadsheet, we can replace
the variable in our code with the reference to the cell that we want to use. We’ll use cell
A1 for this example.

This technique can be used with any of the aforementioned


techniques for referencing an element. Similar lines of code are going to be the “bread and
butter” of your scrapes:


Putting the pieces together
Before moving on to the more complicated concepts that may be required, let’s put
together the pieces that we’ve covered so far to make a complete macro. In theory, you
now have all of the tools required to create a web scrape, however it would have to be a
very simple one. By looking at the code for an entire, albeit simple, web scrape, it should
be easier to understand the value of the more advanced topics that will follow in addition
to providing a recap of the material covered. For this imaginary scrape we’ll navigate to
FakeURLl, pull text from the element with an id of “abc” and then pull text from the first
element with a name of “def”. I’ve added comments to walk you through every step of the
process.


Chapter 4: Creative Scraping


As previously mentioned, the tools and techniques covered in the previous section
will generally make up most of your scrapes. However, it is rare for a website to be as
straightforward as the imaginary one that we created. There are an infinite number of ways
to create and structure the code behind a website, therefore you must be flexible in your
ability to apply the aforementioned techniques to create a successful scrape. This section
will cover some of the more common problems and solutions that I’ve frequently come
across in my scraping career. Even though many of the tools have been provided for you,
it will ultimately be up to you to piece them together in the right way to accomplish your
goals. The following solutions are just a few of the common ones I’ve had to implement.
I’ll illustrate one of the more common challenges that you’ll come across with an
anecdote. Say you’re a fantasy baseball guru and you have a list of pitchers that you want
to get statistics for. You already know the site that you want to use, but you have to go to a
separate page for each pitcher. For each pitcher you want to get their earned run average
(ERA), innings pitched (IP), strikeouts(K) and wins (W). If, for each page, each element
had its own element with an ID, this would be a pretty easy scrape. The HTML code
might look something like this:

<div class = “stats”>
<div id=“era”>ERA: 3.12<div>
<div id=“ip”>IP: 200<div>
<div id=“k”>K: 100<div>
<div id=“w”>W: 10<div>
<div>

This would be the ideal situation to get the desired stats. You could simply use the
GetElementByID method for each stat. We’ll make a variable for each one:

If there is a unique id for each stat on one page of the site, there’s a good chance that the
next page will utilize the same id for the same information for the next pitcher. However,
there is a good chance that there will not be a unique id for each stat you want. For
instance, the HTML code might look something like this:

<div class = “stats”>
<div class = “stats1”>ERA: 3.12<div>
<div class = “stats1”>IP: 200<div>
<div class = “stats1”>K: 100<div>
<div class = “stats1”>W: 10<div>
<div>

This would also be pretty easy to scrape, assuming each pitcher’s page is displaying the
same number of stats. You could just use the GetElementsByClassName(“stats1”) with
indexes of 0,1,2 and 3.

Using this method, however, will be problematic if each pitcher’s page does not
have a consistent number of stats. To illustrate this, let’s look at this section of code on
two separate pages:

Pitcher 1 (First Page):
<div class = “stats”>
<div class = “stats1”>ERA: 3.12</div>
<div class = “stats1”>IP: 200</div>
<div class = “stats1”>K: 100</div>
<div class = “stats1”>W: 10</div>
</div>

Pitcher 2 (Second Page):
<div class = “stats”>
<div class = “stats1”>ERA:2.51</div>
<div class = “stats1”>WHIP:.8</div>
<div class = “stats1”>IP: 250</div>
<div class = “stats1”>K: 200</div>
<div class = “stats1”>W: 18</div>
</div>
Notice that on the second pitcher’s page there’s an additional stat after ERA;
WHIP. If we were to use the VBA code that we discussed in the last section, the procedure
that pulled the IP stat from the first pitchers page,
IE.Document.GetElementsByClassName(“stats1”)(1).innertext, would now be pulling the
WHIP instead. This is because it is looking for the second element with a classname of
“stat” regardless of what information is in the element. By adding the additional stat, the
index after the GetElementsByClassName procedure is now off by one for all of the
remaining GetElementsByClassName procedures on the page. In other words,
IE.Document.GetElementsByClassName(“stats1”)(0).innertext would correctly pull the
ERA stat, but IE.Document.GetElementsByClassName(“stats1”)(1).innertext would pull
the WHIP instead of IP, IE.Document.GetElementsByClassName(“stats1”)(2).innertext
would incorrectly pull IP instead of K and so on.
This is where creative problem solving is of enormous importance. With this
limited amount of information, my suggestion would be to loop through all of the
elements in the class, looking through the text of each element to determine whether or not
it contains the data that we’re after. In this situation, we can utilize the first few characters
of each field to determine whether or not it is one of the elements that we want.
For instance, only the ERA field will contain the text “ERA:”. As you can see in
our previous example with Pitcher 1, the ERA field contains “ERA: 3.12”. If we don’t
know what the index number will be for the desired ERA statistic, we can loop through
each element checking for the string “ERA:” and once we find a field that does contain the
string, we can infer that it is our ERA statistic and take the data we want from it. To set up
this loop, we’ll count how many elements have a class name of “stats1” and then loop
through each one, checking for the text “ERA:” To do this we’ll utilize the InStr
procedure. This procedure searches for a string inside of a larger string and is performed
as follows: InStr([starting position], [large string], [small string]). For the [starting
position] portion of the code, I almost always use the number 1 so that the search starts at
the beginning of the larger string. The rest is pretty self-explanatory. The [Large string] is
the string that we assume will, at some point, contain the small string that was are
searching for. For example, if we use “ABCDEFG” as our large string and “E” as our
small string, the procedure would look like this: InStr(1, “ABCDEFG”,”E”). The result
of this procedure should be the number 5 as “E” is in the fifth position of our large string.
However, if we try to run this procedure again with a small string that is not in the large
string, for example InStr(1,”ABCDEFG”,”Q”), the result will be 0. Thus, if we apply this
strategy to determine whether or not each element that we are looping through contains
the text “ERA:”, if we get a result of anything that is greater than 0, then the procedure is
implying that “ERA:” must exist somewhere in the string that it is testing and, therefore,
must be the element that we are looking for.
To make the code easier to write and follow, I created an object variable to hold
my collection of element references (or multiple
IE.Document.GetElementsByClassName(“stats1”) references if you want to think of it
that way). In other words, by putting:


for the rest of the code I can write PitchStats instead of the entire reference,
IE.Document.GetElementsByClassName(“stats1”) every time that I want to use the
reference. You might also notice my For Loop runs from 0 to (PitchStats.Length - 1). This
is done because, as previously mentioned, the index for elements starts at 0 instead of 1, so
we must start the loop from 0 instead of 1. However, if the loop ran from 0 to
PitchStats.Length (the number of elements that we’re checking), it would run through one
too many iterations, causing an error. It must, therefore, only run to PitchStats.Length - 1
to account for the index starting at 0 instead of 1. You might also notice the Exit For
procedure after the variable is assigned in the if statement. This was added because once
we’ve found the element that we are looking for and the variable is assigned, there’s no
need to run through the remaining iterations of the loop.

Parsing
The skill of parsing data can be imperative when it comes to giving structure to scraped
data. Being proficient at parsing creates more options for the scraper in terms of what
elements they scrape and how to use this data. Here’s a common scenario that you’ll come
across while scraping. Let’s pretend that I’m going to scrape from our HTML code again,
but this time I will pull the parent element which has a class name of “stats” instead of the
children elements that have a class name of “stats1”. In doing so, I will be pulling data
from all of the child elements at one time instead of individually from each child element.
To do so I could write:

While this method seems much simpler than the previous one, there’s a major issue in that
it will pull all of the text from all of the child elements into a single string. For this
example our HTML code will be:

<div class = “stats”>
<div class = “stats1”>ERA: 3.12<div>
<div class = “stats1”>IP: 200<div>
<div class = “stats1”>K: 100<div>
<div class = “stats1”>W: 10<div>
<div>

If we pulled the innertext from the “stats” element, it would look something like this:
“ERA: 3.12IP: 200K: 100W: 10”. While this is clearly an easy way to get ahold of a large
chunk of data, without parsing this string into coherent pieces this chunk of data is
essentially useless.
As you can see, in this situation it would be much easier to just pull each of the
desired pieces of data from the child elements (“stats1”). However, there will be times
when this will not be a practical option and it will be necessary to parse your misshapen
wad of data into something useful. There are an infinite number methods and tools that
can be used for parsing. However, for our purposes we’ll stick with some of the basic
which can be very versatile. There’s a good chance you’re familiar with these concepts
already, but for the sake of being thorough we’ll review them anyways.
Left()- This function pulls a specified number of characters from the string starting
from the first character on the far left of the string. Here is the structure for using these
functions Left([String], [Number]). For example, Let’s look at the string “ABCDE”. If we
were to write “Left(“ABCDE”,1)” it would pull the letter “A” because A is the single
letter to the far left of the string. If we were to write “Left(“ABCDE”,2)” it would pull
“AB”. “Left(“ABCDE”,3)” would pull “ABC” and so on.
Right()- This function is the exact same as the Left() function except it starts from the
opposite side of the string. “Right(“ABCDE”,1)” it would pull the letter
“E”, “Right(“ABCDE”,2)” would pull the letters “DE”, etc.
Mid()- The Mid() function is quite possibly the most valuable when it comes to
parsing data. It is structure as Mid([String], [Number1], [Number2]). The string is,
obviously, the string that you want to pull data from. Number1 is the index number of the
character that you want to start pulling from. This sounds confusing, but it isn’t. Just think
of each character of your initial string being assigned a number, starting at 1 with the first
character, 2 for the second, 3 for the third and so on. So, if your initial string was
“ABCDEFG” and you wanted to pull the characters starting with the letter “B”, your
Number1 in your mid function would be 2 because “B” is the second character in the
string. If you wanted to start from “C”, then Number1 would be 3, etc. Number2 in your
function is just the number of characters that you want to pull. Assume you want to pull 2
characters starting with the letter “C”, you could write “Mid(“ABCDEFG”,3,2)”. The
results for this function would be “CD” because “C” is the third character in the string, so
if we start from “C” and pull two characters, they would be “CD”. If we use the exact
same function but change Number2 to 3 instead of 2, “Mid(“ABCDEFG”,3,2)”, then the
result would be “CDE” etc.
Len()- This function is pretty simple. It simply gives the number of characters in
the string and is structured as “Len([String])”. For example, Len(“ABC”) would result in
the number 3, Len(“ABCD”) would be 4, etc.
InStr() - We’ve already reviewed this function so just to briefly recap, the InStr
function finds the location of a small string within a larger string. The result of this
function is the character that the smaller string begins in the larger string. For example,
InStr(1,“ABCDEFG”,”DEFG”) would return 4 since the smaller string begins at the
fourth character of the larger string.
We’ll now take these basic functions and use them to parse our example string;
“ERA: 3.12IP: 200K: 100W: 10”. The first stat that we want to procure from the string is
ERA. For the sake of our first demonstration, we’ll pretend that the ERA stat can only be a
single digit followed by a decimal followed by two more digits (i.e. X.XX). If this were
the case, then this would be a very easy parse with the mid() function. As you recall, this
function requires 3 inputs or variables: The string which we will take our substring from,
the starting position of our sub string and the length of our substring. We, of course,
already know our string variable, “ERA: 3.12IP: 200K: 100W: 10”. The next input we
need is the starting position for our substring. If we are confident that this string will
always begin with “ERA:”, which is four characters long, then we can infer that our ERA
stat will always begin at the fifth position of our string, therefore our [Number 1] would be
5. Now we just have to determine our [Number2] input which is the length of our
substring. Since for these example we established that the stat will be in the format of
“X.XX”, this substring will always be 4 characters long. By putting all of the pieces
together our procedure looks something like this:

Mid(“ERA: 3.12IP: 200K: 100W: 10”,5,4)

Now let’s use the same example but with a much more realistic circumstance that
ERA will not always be in the “X.XX” format. Let say that some of the pitchers that we’ll
be collecting data for haven’t been doing so well and have ERA’s that require an
additional digit and must be in the format “XX.XX”. In this circumstance, if we continue
using the Md() function then our first two inputs would be the same, but we would have to
adjust our variable for the length of our substring since it will sometimes be 4 and
sometimes 5. For this example, we’ll pretend that the stat IP will always be following the
ERA stat. Since the substring “IP” is always after the ERA stat that we are after, we can
find the location of “IP”, to help determine how long our ERA stat is. If we use the InStr
function we can determine where “IP” is at in our string and, thus, determine where the
ERA stat ends. We can then take that position and subtract 6 (We subtract 5 since we
determined that our ERA substring will always be starting at 5. We subtract an additional 1
since the “IP” in the string is one character in front of the last ERA digit) to get the length
of our ERA stat and we can then use it as [Number 2] in our Mid() function.

InStr(1,“ERA: 3.12IP: 200K: 100W: 10”,IP) = 10
10-6 = 4

Therefore, our ERA stat is 4 characters long. If we were to do the same procedure with a
larger ERA:

InStr(1,“ERA: 10.12IP: 200K: 100W: 10”,IP) = 11
11-6=5

Our ERA stat is 5 characters long.
We can now use this logic to finish our Mid() function. I’ll create a variable called
Num2 which will be our [Number2] input. I’ll also create the variable StatString for the
string that we are parsing.

Upon running this macro, a dialog should appear displaying the ERA stat in the correct
format. Because of the logic that we’ve used, we can now run the exact same macro with
the exact same string, except for a larger ERA stat, and it will still display the correctly
formatted ERA:

Let’s now attempt this same parse under a more complicated scenario. For this
example, we’ll assume that we don’t know what stat is following our ERA stat.
Consequently, one page’s ERA might be followed by wins or walks or any other stat.
Regardless of the specifics, the take-away is that we can’t rely on whatever stat is
following ERA since it may not be consistent, so we cannot utilize it as we did in our
previous section.
There is still however, on feature on the structure of our ERA stat that is in our favor.
Even if you are not familiar with baseball statistics, it would not take long to notice that
regardless of the decimal values of one’s ERA, it is almost always displayed with two
digits. We can use this characteristic to our advantage by determining the location of the
decimal. This can be done multiple ways. You could, for instance, use the InStr function
again. However, for this example I will use an additional method solely for the sake of
demonstrating its use. With this technique you create a loop that checks each character in a
string for the character that you’re looking for.
Observe the code below. The For Loop is setup to run from 1 to the number of characters
in our StatString.

With the Mid(StatString, x, 1) section of code, each iteration of the loop is looking
at a substring of the StatString starting at x (the iteration of the loop) that is a length of one
character. In a sense, this technique looks at each character of the string as a substring and
determines whether or not it is equal to “.”. When the loop finds a character that equals
“.”, the loop is ended with x still containing the number for the character in the string that
contains “.”. With this example, x would be the number 6. Since we know that the ERA
stat will end two digits beyond 6, we add 2 to get the position of the last character for
ERA, which means that it is in the 8th position. Of course, it would not make sense for our
string to be eight characters long. We must, therefore, subtract the number of characters in
our larger string that come before our substring. Since we already know that our starting
position will always be 5, this means that we will always be subtracting 4 from the end
position number to get the number of characters for our substring to use as [Number2] in
our mid function. For the sake of keeping our code as simple as possible and omitting
unnecessary steps, instead of adding 2 and then subtracting four to get negative two
([Number2] = X + 2 - 4), we skipped the unnecessary arithmetic and simply subtracted
two from X to get the length of our string ([Number2] = X - 2).
As previously mentioned, there are an infinite number of ways to go about parsing
strings. These are just a few of the methods that I often employ to get the data that I need.
As you do more parsing, you will develop your own techniques and styles that suite your
needs the best and which make the most sense to you. There are times when parsing a
particular string to suite your needs might feel like a rubix cube where each move has the
potential to destroy previously made progress. As such, it will behoove you to keep the big
picture, or overall goal, in sight as you work on the smaller pieces of code. Eventually
even the most complex of parses will become second nature to you.

















Chapter 5​: Finishing Touches


The material in this chapter is not necessary when it comes to performing scrapes.
However, these tips and tricks can make the regular use of your scrapes easier. As an
added bonus, they also do a good job of impressing managers and colleagues.

Count and Summary
One of the easiest “extras” that I typically add to scrapes is one or more counters which
keep track of records as they are scraped. The purpose of this is to provide a summary at
the conclusion of the macro. Counters can be used to keep track of almost any aspect of
the data you are pulling. To do so, create an integer variable and add 1 to it every time the
event you are counting occurs. This can just as easily be done when counting multiple
types of events by created an integer variable for each count. At the end of the macro, add
a message box that displays the count and whatever other information you want to include
in the summary. For this example, we’ll again pretend that I’m scraping a baseball website
for pitcher data with each pitcher’s statistics being displayed on its own page and I want to
keep two separate counts; one for right handed pitchers and one for left handed pitchers
and display a summary of the results at the end of the scrape. For this example, we’ll
pretend that the element that displays the pitcher’s handedness has an ID of “Hand” and
that the NoOfPages variable holds the number of pages that will be scraped.

As this example demonstrates, it’s important to set the variables to 0 outside of the main
loop. If this part is put inside the loop, then each iteration of the loop will reset the
variables to 0 which would make them pretty useless at counting.

Error Handling
Error handling is a crucial component to advanced VBA development. While the use of
error handling might not be necessary for the simplest of VBA scrapes, when it comes to
larger or more complicated scrapes, knowing how to properly dictate procedure when
errors occur can make your life much easier. Since there are so many things that can go
wrong when working with VBA that can cause an error to occur, it would be impossible to
address every potential situation, so we’ll just cover the basic idea of error handling and
some of the functionality that I’ve found to be most useful.
The basic idea behind error handling is simple. It is simply telling the macro how to
proceed when an error occurs in the course of your script. The most commonly used
method for handling errors is often placed at the beginning of a simple script: “On Error
Resume Next”. This method is simply telling the program to proceed to the next line when
an error occurs. This technique can be useful for simple scripts, but will likely have to be
avoided for the most complicated macros. This is because when an error occurs in a larger,
more complex code, it’s unlikely that simply skipping a line will really fix the problem, so
your program is likely going to be curtailed even if you skip the line where the error is
first noticed. The technique that I find more useful is to create a section of code, usually
placed at the very end of the macro, dedicated to the handling of errors.

Sub ExampleMacro()
On Error GoTo Errhandler
[Main portion of macro]
Exit Sub
Errhandler:
‘Execute this section of code when an error occurs.
End Sub

After your errors are addressed in this section of code, you’ll likely want to either end
your macro, or specify a location in your script for it to resume. If your error handling
section is at the end of your script, the macro will end after the error procedure is
performed. If you want your macro to resume after the error code is performed, you must
have a specified location in your code for this point. To establish this, simply add the term
you’d like to refer to this point as with a semicolon at the location where you’d like your
script to resume. To continue running your macro at this point, put “Resume [point name]”
at the end of you error code. In this demonstration, I’ll call this point Restart.

Sub ExampleMacro()
Restart:
On Error GoTo Errhandler
[Main portion of macro]
Exit Sub
Errhandler:
[Execute this section of code when an error occurs]
Resume Restart
End Sub

Naturally you won’t be able to create an error handler for every possible problem that
can trigger an error in your code. This is especially true of scrapes compared to other
macros as you have no control over the content of the websites you are scraping and they
can change at any time. There’s no conceivable way that you can write a script that can
adjust to every potential change that a website can make, therefore your error handlers will
probably have limited value in terms of ameliorating new problems that occur when
pulling data from a website. What I’ve found to be a useful technique for long scrapes that
are impractical to constantly monitor is to place a section of code in the error handling
section that triggers an email to be sent to me which notifies me that an error has occurred
in my macro so that I know to address the problem when I have the chance. In these
situations, my error handling code would look something like this:

This is a simple, yet effective code to use for addressing macro errors. It uses Microsoft
Outlook to create a blank email, adds the relevant information that might be useful when
addressing the error, and sends the email. Notice that the subject lines has the date and
time that the error occurred and the body of the email has the specific error code. Whether
or not you want the macro to try to continue running after sending this email will depend
on the overall structure of the code and the tasks it is performing. Keep in mind that if you
instruct your macro to resume after you run your error code, if errors continue then you
will continue getting emails. In these situations, it may be beneficial to put a counter in
your error code which stops the macro when it gets to a certain number of errors so that
you don’t end up with an inbox with thousands of emails informing you of macro errors.


Scrolling
In a typical scenario, your macro will be pulling data and inputting it into your
spreadsheet one line at a time. Depending on your zoom, after about 20-30 rows of data
are added you will no longer be able to observe the data being added unless you manually
scroll down. This can be a pain when you’re trying to keep track of how many records
your scrape has pulled during the course of its run. To get around this nuisance, I’ve found
it beneficial to add a piece of code that automatically scrolls down as each row of data is
added. This way, the row where your data is being added is always visible when looking at
your spreadsheet. This small piece of code will scroll the view of your spreadsheet one
row at a time: ActiveWindow.SmallScroll down:=1.
Since you probably want several rows to be populated before this scrolling begins (if it
started immediately you’d only see the top row of data being populated) it can be helpful
to add a counter which counts each row as it is populated and instructing this section of
code to only run when the counter reaches a certain number. When this counter reaches
the predetermined number, then the scrolling function will occur one row at a time for
each iteration of the scrape.


Chapter 6: Conclusion


Review
This book has covered various aspects of web scraping with VBA and some of the
more common challenges that you may come across as you learn the art. To briefly recap,
we started with exploring what web scraping is and how it can be used both professionally
and personally as well as legal issues that may accompany the use of this skill in certain
scenarios. We also explored the utility of VBA and how it can be used to manipulate data
in Excel as well as interact with other programs. From there, we reviewed the basics of
programming with VBA while focusing on the features that will most benefit the web
scraper, after which we explored the essential and more advanced components and tools
that can be utilized in VBA to interact with Internet Explorer to navigate the web and
extract data from websites. From there, we briefly explored the art of parsing data and
how it is essential to the web scraper.

Final Thoughts
You may have noticed that I’ve repeatedly referred to web scraping as an art. This
is because despite the fact that there is a degree of technical know-how required, the more
advanced scrapes will require a large amount of individual creativity which means that
each program you write will be unique to your own style of programing, your way of
thinking about data, and the best way to mold the data available to you into the structure
that best suits you.
Web scraping has become more common in the marketplace and is now considered
by many to be an essential component to a successful business. This is a skill that is highly
in demand and can set you apart from others in the workplace. I know this because I have
personally experienced it. Sharing this experience and how the material in this book can
benefit others is one of the primary reasons that I made the decision to write this book.
It’s my hope that the information in this book will help VBA developers, or aspiring
VBA developers, acquire the necessary skills in the art of web scraping to further their
careers and ultimately achieve their aspirations. This book isn’t supposed to instantly turn
the reader into an all-star VBA developer or data scraper. Achieving these goals requires
work and practice which can’t be learned simply by reading a book. However, the
information contained in this book should provide you with some of the most valuable
tools of the trade and prepare you for some of the more common challenges you’ll face in
your scraping endeavors.

Das könnte Ihnen auch gefallen