Department of Computer Science & Information Systems BITS, Pilani Text Book Hector G Molina, Jeffrey D.Ullman & Jennifer Widom. Database Systems – The Complete Book, Pearson Education, 2002.
Home Page: http://www-db.stanford.edu/~ullman/dscb.html
Data Data n So what is Data? n Why are we interested in Data? n Sources of Data? n Management of Data n Big Data! n What’s the biggest asset of companies like Google, Yahoo, Amazon, FB, Walmart, etc.? n We are living in a data driven world!
A word about DATA n If data had mass, the Earth would be a black hole!! n Data is the new Oil!! n Expected to reach 40 ZB by 2020!! n In 2012, we had about 2.8 ZB* n Only 1/4th of this data could produce useful information n Only 3% of it was tagged n Only 0.5% of it was actually used for some kind of analysis (*Report by John Gantz & David Reinsel – sponsored by EMC) Some Interesting Facts n During the first day of a baby’s life, the amount of data generated by humanity is equivalent to 70 times the information contained in the library of the Congress (www.BabyCenter.com) n How do babies learn to speak? (Prof. Deb Roy, MIT Media Asia Lab) n Human Speechome project n 11 video cameras, 14 microphones,…. n 200 GBs of data each day!!!! Some Interesting Facts n Google n 50% of all internet users use Google every day n 7.2 bn page views per day n 20 PB of data processed daily n Youtube n 48 hours of video uploaded every minute n 4 bn views per day n Most viewed video – Bieber’s Baby - 763684702 Tsunami of Data n Telecom data (≈ 4.6 bn mobile subscribers)
q There are 3 Billion Telephone Calls in US each day,
30 Billion emails daily, 1 Billion SMS, IMs.
q IP Network Traffic: up to 1 Billion packets per hour per router.
Each ISP has many (hundreds) routers! n WWW n Weblog data (160 mn websites) n Email data n Satellite imaging data n Social networking sites data n Genome data n CERN’s LHC (15 petabytes/year) Tsunami of Data n In 2005, mankind created 150 exabytes of data n In 2010, it created 1200 exabytes n How much are we creating now??? Tsunami of Data n No. of pics on Facebook n 15 bn unique photos
Databases Everywhere!!! n DBMS contains information about a particular enterprise n Collection of interrelated data n Set of programs to access the data n An environment that is both convenient and efficient to use n Database Applications: n Banking: all transactions n Airlines: reservations, schedules n Universities: registration, grades n Sales: customers, products, purchases n Online retailers: order tracking, customized recommendations n Manufacturing: production, inventory, orders, supply chain n Human resources: employee records, salaries, tax deductions n Social Media – Facebook & Twitter use Graph Databases n Databases touch all aspects of our lives
Biggest OLTP System n SABRE n Sabre is a computer reservations system/global distribution system (GDS) used by airlines, railways, hotels, travel agents and other travel companies n Used by more than 200 airlines
DBMS – Is it a Dry Area? n The area of DBMS is a microcosm of computer science in general n The issues addressed and the techniques used span a wide spectrum including n Languages n Object-orientation & other programming paradigms
DBMS – Is it a Dry Area? n Compilation n Operating systems n Concurrent programming n Data structures n Algorithms n Parallel & distributed computing n User interfaces n Expert systems & AI n Statistical techniques & Dynamic programming Reference: DBMS by Raghurama Krishna & Gherke, 3e
Basic Definitions n Database: A collection of related data. n Data: Known facts that can be recorded and have an implicit meaning. n Mini-world: Some part of the real world about which data is stored in a database. For example, student grades and transcripts at a university. n Database Management System (DBMS): A software package/ system to facilitate the creation and maintenance of a computerized database. n Database System: The DBMS software together with the data itself. Sometimes, the applications are also included.
DBMS Functionalities n Define a database : in terms of data types, structures and constraints n Construct or Load the Database on a secondary storage medium n Manipulating the database : querying, generating reports, insertions, deletions and modifications to its content n Concurrent Processing and Sharing by a set of users and programs – yet, keeping all data valid and consistent n Crash Recovery
File System vs. DBMS n A company has 500 GB of data on employees, departments, products, sales, & so on.. n Data is accessed concurrently by several employees n Questions about the data must be answered quickly n Changes made to the data by different users must be applied consistently n Access to certain parts of the data be restricted
File System vs. DBMS n These drawbacks have prompted the development of database systems n Database systems offer solutions to all the above problems?
Advantages of a DBMS n Program-Data Independence n Insulation between programs and data: Allows changing data storage structures and operations without having to change the DBMS access programs. n Efficient Data Access n DBMS uses a variety of techniques to store & retrieve data efficiently n Data Integrity & Security n Before inserting salary of an employee, the DBMS can check that the dept. budget is not exceeded n Enforces access controls that govern what data is visible to different classes of users
Advantages of a DBMS n Data Administration n When several users share data , centralizing the administration offers significant improvement n Concurrent Access & Crash Recovery n DBMS schedules concurrent access to the data in such a manner that users think of the data as being accessed by only one user at a time n DBMS protects users from the ill-effects of system failures n Reduced Application Development Time n Many important tasks are handled by the DBMS
Benchmarking DBs n The term transaction is often applied to a wide variety of business and computer functions. Looked at as a computer function, a transaction could refer to a set of operations including disk read/writes, operating system calls, or some form of data transfer from one subsystem to another
Benchmarking DBs n While TPC benchmarks certainly involve the measurement and evaluation of computer functions and operations, the TPC regards a transaction as it is commonly understood in the business world: a commercial exchange of goods, services, or money. A typical transaction, as defined by the TPC, would include the updating to a database system for such things as inventory control (goods), airline reservations (services), or banking (money).
Benchmarking DBs n In these environments, a number of customers or service representatives input and manage their transactions via a terminal or desktop computer connected to a database. Typically, the TPC produces benchmarks that measure transaction processing (TP) and database (DB) performance in terms of how many transactions a given system and database can perform per unit of time, e.g., transactions per second (tpsC) or transactions per minute (tpmC)
The SQL Query Language DMLs n Whenever DML statements are embedded in a PL, that language is called as the host language and the DML is called the Data Sublanguage n In object DBs, the host language & data sublanguage form one integrated language – for eg. C++ with some extensions to support database functionality n Some RDBMSs also provide integrated languages – for eg. ORACLE’s PL/SQL.
Steps in Query Processing n Parsing and translation n Translate the query into its internal form. n Translation is similar to the work performed by the parser of a compiler n Parser checks syntax, verifies relations n Parse tree representation n This is then translated into RA expression
Steps in Query Processing n Example select balance from account where balance < 2500 n RAEs n σbalance<2500(∏balance(account)) n ∏balance(σbalance<2500(account)) n E.g., we can use an index on balance to find accounts with balance < 2500, n or can perform complete relation scan and discard accounts with balance ≥ 2500
Query Optimization n For optimizing a query, the Query Optimizer must know the cost of each operation n Cost is hard to compute n Depends on many parameters such as actual memory available to the operation n Systems work with rough estimates