Sie sind auf Seite 1von 10

Institute of Technology Management & Research

WEB SCRAPING AND ITS PROJECTION

SRI VENKATESWARA COLLEGE OF ENGINEERING

TEAM-10

MENTOR: N.BALAMURUGAN

MEMBERS: SRUTHI VANDHANA.T


SRIVIBHUSHANAA.S
SIDDHARTH.K

GOAL:
Write a PYTHON program to web scrap a tabular data
from a
browser and store the results in an excel file. The
program should do the basic excel operations and update
the dashboard in the excel.
Ex: Student results are available on the web table
or the financial data are available on yahoo finance.
The developed program will login to the site without
human intervention web scrap the table and paste the
data in an excel file. If the data is financial data
the program will do the basic statistics on the data
and identify the mean, median, mode, standard deviation
and the trend with a projection to the next two months,
which is considered as a dashboard.

PLATFORM: python
WEB SCRAPING:

import csv
import urllib.request
from bs4 import BeautifulSoup
import pandas as pd
import datetime as dt
import matplotlib.pyplot as plt
import numpy as np

f=open('dataoutput.csv', 'w', newline='')


writer = csv.writer(f)
soup =
BeautifulSoup(urllib.request.urlopen("https://in.fi
nance.yahoo.com/quote/INFY/history?period1=15362586
00&period2=1541961000&interval=1d&filter=history&fr
equency=1d").read(), 'lxml')

tbody = soup('table', {"class":"W(100%)


M(0)"})[0].find_all('tr')
for row in tbody:
cols = row.findChildren(recursive=False)
cols = [ele.text.strip() for ele in cols]
writer.writerow(cols)
print(cols)

PROJECTION:
import quandl, math
import numpy as np
import pandas as pd
from sklearn import preprocessing, svm
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
from matplotlib import style
import datetime

style.use('ggplot')

df = quandl.get("WIKI/GOOGL")
print(df)
df = df[['Adj. Open', 'Adj. High', 'Adj. Low', 'Adj.
Close', 'Adj. Volume']]
df['HL_PCT'] = (df['Adj. High'] - df['Adj. Low']) /
df['Adj. Close'] * 100.0
df['PCT_change'] = (df['Adj. Close'] - df['Adj. Open'])
/ df['Adj. Open'] * 100.0

df = df[['Adj. Close', 'HL_PCT', 'PCT_change', 'Adj.


Volume']]
forecast_col = 'Adj. Close'
df.fillna(value=-99999, inplace=True)
forecast_out = int(math.ceil(0.01 * len(df)))
df['label'] = df[forecast_col].shift(-forecast_out)

X = np.array(df.drop(['label'], 1))
X = preprocessing.scale(X)
X_lately = X[-forecast_out:]
X = X[:-forecast_out]

df.dropna(inplace=True)

y = np.array(df['label'])

X_train, X_test, y_train, y_test = train_test_split(X,


y, test_size=0.2)
clf = LinearRegression(n_jobs=-1)
clf.fit(X_train, y_train)
confidence = clf.score(X_test,y_test)
print(confidence)

forecast_set = clf.predict(X_lately)
df['Forecast'] = np.nan

last_date = df.iloc[-1].name
last_unix = last_date.timestamp()
one_day = 86400
next_unix = last_unix + one_day

for i in forecast_set:
next_date =
datetime.datetime.fromtimestamp(next_unix)
next_unix += 86400
df.loc[next_date] = [np.nan for _ in
range(len(df.columns)-1)]+[i]

df['Adj. Close'].plot()
df['Forecast'].plot()
plt.legend(loc=4)
plt.xlabel('Date')
plt.ylabel('Price')
plt.show()
print(df['Forecast'])

PYTHON PACKAGES:

BEAUTIFULSOUP:
Beautiful Soup is a library that makes it easy to
scrape information from web pages. It sits atop an
HTML or XML parser, providing Pythonic idioms for
iterating, searching, and modifying the parse tree.

QUANDL:
Quandl is a marketplace for financial, economic
and alternative data delivered in modern formats for
today's analysts, including Python, Excel, Matlab,
R,etc.,

PANDAS:
pandas is an open source, BSD-licensed library
providing high-performance, easy-to-use data
structures and data analysis tools for
the Python programming language.
NUMPY:
Besides its obvious scientific uses, NumPy can
also be used as an efficient multi-dimensional
container of generic data. Arbitrary data-types can be
defined.

SKLEARN:
scikit-learn. Machine Learning in Python. Simple
and efficient tools for data mining and data analysis;
Accessible to everybody, and reusable in various
contexts

OUTPUT:
Date
2004-08-19 100.01 104.06 95.96 100.335
44659000.0 0.0 1.0 50.159839
52.191109 48.128568 50.322842 44659000.0
2004-08-20 101.01 109.08 100.50 108.310
22834300.0 0.0 1.0 50.661387
54.708881 50.405597 54.322689 22834300.0
2004-08-23 110.76 113.48 109.05 109.400
18256100.0 0.0 1.0 55.551482
56.915693 54.693835 54.869377 18256100.0
2004-08-24 111.24 111.60 103.57 104.870
15247300.0 0.0 1.0 55.792225
55.972783 51.945350 52.597363 15247300.0
2004-08-25 104.76 108.00 103.88 106.000
9188600.0 0.0 1.0 52.542193
54.167209 52.100830 53.164113 9188600.0
... ... ... ... ...
... ... ... ...
... ... ... ...
2018-03-21 1092.57 1108.70 1087.21 1094.000
1990515.0 0.0 1.0 1092.570000
1108.700000 1087.210000 1094.000000 1990515.0
2018-03-22 1080.01 1083.92 1049.64 1053.150
3418154.0 0.0 1.0 1080.010000
1083.920000 1049.640000 1053.150000 3418154.0
2018-03-23 1051.37 1066.78 1024.87 1026.550
2413517.0 0.0 1.0 1051.370000
1066.780000 1024.870000 1026.550000 2413517.0
2018-03-26 1050.60 1059.27 1010.58 1054.090
3272409.0 0.0 1.0 1050.600000
1059.270000 1010.580000 1054.090000 3272409.0
2018-03-27 1063.90 1064.54 997.62 1006.940
2940957.0 0.0 1.0 1063.900000
1064.540000 997.620000 1006.940000 2940957.0
[3424 rows x 12 columns]
0.9777413402684438
Date
2004-08-19 00:00:00 NaN
2004-08-20 00:00:00 NaN
2004-08-23 00:00:00 NaN
2004-08-24 00:00:00 NaN
2004-08-25 00:00:00 NaN
...
2018-03-08 05:30:00 1115.042961
2018-03-09 05:30:00 1071.495918
2018-03-10 05:30:00 1043.968692
2018-03-11 05:30:00 1071.446397
2018-03-12 05:30:00 1020.706227
Name: Forecast, Length: 3424, dtype: float64

PREDICTED CHART:

Das könnte Ihnen auch gefallen