Sie sind auf Seite 1von 60

File System

Hierarchical
Databases

Storage
Network
and Database
Databases
History

Relational
Databases
Dra. Loreto Bravo
Facultad de Ingeniería
Universidad del Desarrollo
1725 - 1975: Punch Cards

• Early data storage was very inefficient


• Processing of data was very sequential
• Data was stored as flat files (read from
beginning to end)

http://royal.pingdom.com/2008/04/08/the-history-of-computer-data-storage-in-pictures/
1932 - Disk Drum - 10kB
• Formed central memory and secondary storage, also primarily
sequential, leading to flat files as best performant data
storage.

http://royal.pingdom.com/2008/04/08/the-history-of-computer-data-storage-in-pictures/
1956 - Hard Disk - 5MB
• IBM ships first Hard
Disk of the size of two
refrigerators
• This allowed the advent
of data storage beyond
flat files.

http://royal.pingdom.com/2008/04/08/the-history-of-computer-data-storage-in-pictures/
• 1981
File based storage - 1968
• Predecessor of databases
File-Based Processing
File Systems
• Enroll “Mary Johnson” in “CSE444”:

Write a C program to do the following:


Read ‘students.txt’
Read ‘courses.txt’
Find&update the record “Mary Johnson”
Find&update the record “CSE444”
Write “students.txt”
Write “courses.txt”
File Systems
• System crashes: Read ‘students.txt’
Read ‘courses.txt’
Find&update the record “Mary Johnson”
CRASH !
Find&update the record “CSE444”
Write “students.txt”
Write “courses.txt”

• What is the problem ?


• Simultaneous access by many users
• Need locks
File based storage - 1968
• File systems problems:
• Requires extensive programming
• Data redundancy
• Same data is held by different programs, thus, wastes space and resources
• Data is separated:
• Each program maintains its own set of data
• They cannot be easily combined
• High cost of propagation of updates
• update anomalies and inconsistencies
• No abstract data model
• requires knowledge of storage details
• No standard query language
• Weak security:
• High cost to enforce security policies in which different users have permission to access
different subsets of the data
• Sharing granularity is very coarse
Database Approach
• Arose because:
• Definition of data was embedded in application programs, rather than
being stored separately and independently.
• No control over access and manipulation of data beyond that imposed by
application programs.
Data + Base
1960 SAGE anti-
aircraft
command and
control network
Cold War Era
technology to
track and
coordinate many
separate military
installations.

http://ed-thelen.org/SageIntro.html
• Far more complex than any
other computer project of
the 1950s
• The first major system to
run in “real-time” –
responding immediately to
requests from its users and
to reports from its sensors.
• SAGE had to present an up-
to-date and consistent
representation of the
various bombers, fighters
and bases to all its users.
• Popularized the term "data
base" to refer to the data
underlying the many
different views allowed.
http://ed-thelen.org/SageIntro.html
Hierarchical
Database

Network
Databases
Database
History

Relational
Database
Hierarchical Database - 1960

• Files are related in a parent/child manner, with each child file having at most one
parent file.
• Developed by North American Rockwell and IBM as the IMS (Information
Management System)
• IMS formed the basis for hierarchical data model
• Still Available!! http://www-01.ibm.com/software/data/ims/
• American Airlines and IBM jointly developed SABRE for making airline reservations
• SABRE is used today to populate Web-based travel services such as Travelocity
Hierarchical Database - 1960

Advantages Limitations

Efficient searching. Complex implementation

Difficult to manage and lack of standards, can’t


Less redundant data.
easily handle many-many relationships.

Data independence. Lacks structural independence.

Applications programming and use


Database security and integrity.
complexity

Changes in data structure require changes in


application programs that access that structure
Network Data Model - 1969

• Integrated data store, first general-purpose DBMS designed by Charles Bachman at GE


• Formed basis for network data model
• Bachman received Turing Award in 1973 for his work in database area
• Extension of the hierarchical data model based on acyclic digraph
• Standardized (1971) by the CODASYL group (Conference on Data Systems Languages)
• Network data model identified the following three database components:
• Network schema—database organization
• Sub-schema—views of database per user
• Data management language — at low level, procedural
Network Data Model - 1969

• Integrated data store, first general-purpose DBMS designed by Charles Bachman at GE


• Formed basis for network data model
• Bachman received Turing Award in 1973 for his work in database area
• Extension of the hierarchical data model based on acyclic digraph
• Standardized (1971) by the CODASYL group (Conference on Data Systems Languages)
• Network data model identified the following three database components:
• Network schema—database organization
• Sub-schema—views of database per user
• Data management language — at low level, procedural
Network Data Model - 1969

Advantages Limitations

Conceptual simplicity

Ability to handle more relationship System complexity and difficult to


types (many-to-many) design and maintain

Lack of structural independence as


Ease of data access
data access method is navigational.

Data Integrity “Navigation” is even harder

Data Independence
Problems with first DBMS’
• Access to database was through low level pointer operations
• Storage details depended on the type of data to be stored
• Adding a field to the DB required rewriting the underlying
access/modification scheme
• Emphasis on records to be processed, not overall structure
• User had to know physical structure of the DB in order to query
for information
• Overall first DBMS’ were very complex and inflexible which made
life difficult when it came to adding new applications or
reorganizing the data
Relational Databases - 1970

• Instance – a table with rows and columns.


• Schema – specifies the structure (name of relation, name and type of each column)
Relational Databases - 1970
• Relational Databases
• Edgar Codd, at IBM, proposed relational data model.
• Codd's paper “A Relational Model of Data for Large Shared Data Banks.”
• “It provides a means of describing data with its natural structure only--that is,
without superimposing any additional structure for machine representation
purposes. Accordingly, it provides a basis for a high level data language which will
yield maximal independence between programs on the one hand and machine
representation on the other.”(Codd 1970)
• In other words the Relational Model consisted of:
• Data independence from hardware and storage implementation
• High level, nonprocedural language for accessing data. Instead of processing one
record at a time, a programmer could use the language to specify single operations
that would be performed across the entire data.
• Codd won 1981 Turing Award.
Codd vs. IBM
• Codd’s model had an immediate impact on research, however, to
become a legitimacy within the field, it had to survive at least two
battles:
• One in the technical community at large
• One within IBM
• Within IBM
• Conflict with existing product IMS which had been heavily invested into
• New technology had to prove itself before replacing existing revenue
producing product
• Codd published his paper in open literature because no one at IBM
(himself included) recognized its eventual impact
• Outside technical community showed that the idea had great potential
Codd vs. IBM
• Within IBM
• IBM declared IMS its sole strategic product, setting up Codd and his ideas
as counter to company goals
• Codd speaks out in spite of IBM’s dissatisfaction and promotes relational
model to computer scientists. He arranges a public debate between
himself and Charles Bachmann, who at the time was a key proponent of
the CODASYL standard.
• Debate produced further criticism from IBM for undermining its goals,
but also proved his relational model as a cornerstone to the technical
community.
• Finally, Two main relational prototypes emerge in the 70’s
• System R from IBM
• Ingres from UC-Berkeley
System R
• Prototype intended to provide a high-level, nonnavigational, data-
independent interface to many users simultaneously, with high
integrity and robustness.
• Led to a query language called SEQUEL(Structured English Query
Language) later renamed to Structured Query Language(SQL) for
legal reasons. Now a standard for database access.
• Project finished with the conclusion that relational databases
were a feasible commercial product
• Eventually evolved into SQL/DS which later became DB2
Ingres
• Two scientists, Michael Stonebraker and Eugene Wong at UC-
Berkeley) became interested in relational databases
• Used QUEL as its query language
• Similar to System R, but based on different hardware and
operating system
• Developers eventually branched off to form Ingres Corp, Sybase,
MS SQL Server, Britton-Lee.
• System R and Ingres inspire the development of virtually all
commercial relational databases, including those from Sybase,
Informix, Tandem, and even Microsoft’s SQL Server
Where’s Oracle!?
• Larry Ellison learned of IBM’s work and founded Relational
Software Inc. in 1977 in California
• Their first product was a relational database based off of IBM’s
System R model and SQL technology
• Released in 1979, it was the first commercial RDBMS, beating IBM
to the market by 2 years.
• In the 1980’s the company was renamed to Oracle Corporation
and throughout the 80’s new features were added and
performance improved as the price of hardware came down and
Oracle became the largest independent RDBMS vendor.
Database Management System (DBMS)
Relational Databases - 1970

Advantages Limitations

Control of data redundancy,


Complexity, size, cost of DBMS
consistency, abstraction, sharing

Improved data integrity, security,


Higher impact of failure
enforcement if standards and
economy of scale

Improved data accessibility and


maintenance
1980's
• Birth of IBM PC. RDBMS market begins to boom.
• SQL becomes standardized through ANSI (American National
Standards Institute) and ISO (International Organization for
Standardization)
• By Mid 80’s it had become apparent that there were some
fields(medicine, multimedia, physics) where relational databases were
not practical, due to the types of data involved.
• More flexibility was needed in how their data was represented and accessed.
• This led to research in Object Oriented Databases in which users could
define their own methods of access to data and how to represent and
manipulate it. This coincided with the introduction of Object Oriented
Programming languages such as C++ which started to appear
1990’s
• Considerable research into more powerful query language and richer
data model, with emphasis on supporting complex analysis of data
from all parts of an enterprise
• First OODBMS’ start to appear from companies like Objectivity. Object
Relational DBMS’ hybrids also begin to appear.
• Several vendors, e.g., IBM’s DB2, Oracle 8, Informix UDS, extended
their systems with the ability to store new data types such as images
and text, and to ask more complex queries
• New application areas: Data warehousing and OLAP(Online Analytical
Processing, a category of software tools that provides analysis of data
stored in a database), internet, multimedia, etc
• Development of personal/small business productivity tools such as
Excel and Access from Microsoft.
Late 90’s-2000’s
• XML
• Starts incorporation (as middleware or enabled DBMS) in 1997
• Data Junction, ADO, Delphi
• Oracle 8i, 9i, MS Access 2002, SQL Server 2000, DB2, Informix
• Native XML DBMS, 2000
• TigerLogic XDMS, Raining Data, Tamino, Software AG, Birdstep
• Large investment in internet companies fuels tools-market boom for
Web/Internet/DB connectors:
• Active Server Pages, Front page, Java Servlets, JDBC, Java Beans, ColdFusion,
Dream Weaver, Oracle Developer 2000, etc
• Open source projects come online with widespread use of gcc,cgi,
Apache, MySQL
• Three main companies dominate in the large DB market: IBM,
Microsoft, and Oracle
2010’s….
• Big Data:
• Google processes 20 PB a day (2008)
• Wayback Machine has 3 PB + 100 TB/month (3/2009)
• eBay has 6.5 PB of user data + 50 TB/day (5/2009)
• Facebook has 36 PB of user data + 80-90 TB/day (6/2010)
• New ways for efficient query answering are needed:
• For example:
• INSERT only, not UPDATES/DELETES 640K ought to be
enough for
• No JOINs, thereby reducing query time anybody.
• This involves de-normalizing data
2010’s
• NoSQL
• Stands for Not Only SQL
• Class of non-relational data storage systems
• Usually do not require a fixed table schema nor do they use the concept
of joins
• NoSQL movement started from:
• BigTable (Google)
• Dynamo (Amazon)
• Gossip protocol (discovery and error detection)
• Distributed key-value data store
• Eventual consistency
NoSQL solutions
• NoSQL solutions fall into two major areas:
• Key/Value or ‘the big hash table’.
• Schema-less which comes in multiple flavors, column-based, document-
based or graph-based.
• In NoSQL solutions we are giving up:
• joins
• group by
• order by
• ACID transactions
• SQL as a sometimes frustrating but still powerful query language
• Easy integration with other applications that support SQL
Other Types of Databases

https://db-engines.com/en/ranking/relational+dbms
Relational Database System

Curso: Bases de Datos


Profesora: Loreto Bravo
Relational DBMS

Mobile App

User Webpage Relational


Database
Management
System
Software (DBMS)

Expert Relational
User Database

Database Admin
DBMS

Database
Management
System
(DBMS)

Database
DBMS

Relational
Database
Management
System
(DBMS)
Relational
Database

https://db-engines.com/en/ranking/relational+dbms
Describing Data in a DBMS

Semantic Model (e.g. ER)

Conceptual/Logical Model

Physical Database Design / Relational DBMS


DBMS Advantages
Data Efficient Data
Independence Access

Concurrent
Access and Data Integrity
Crash and Security
Recovery

Data
Administration
Data Independence
• Data Independence:
• Appications that use the database are insulated from changes in the way
the data is structured and stored.
• Two types:
• Logical data Independence
• Physical data independence
• Independence is achieved through the use of three levels of
abstraction:
Levels of Abstraction in a DBMS

View 1 View 2 View 3

Conceptual/Logical Model

Physical Database Design / Relational DBMS

Disk
Conceptual/Logical Model

• Describes the stored data in terms of the data model of the DBMS.
• In a relational DBMS, this model describes all the relations that are stored:
Physical Database Design / Relational DBMS

• The Physical schema specifies additional storage details.


• Describes how the relations of the conceptual schema are stored in disks
• It influences how fast we can access certain data.
• It uses indexes to organize how the records are stored.
• For example:
• We can store the employees in any order
• We can create an index on the first column and
store them organized by it
View 1 View 2 View 3

• Several views can be created to authorize access to specific parts


of the data.
• For example, we could have the following view:

• Data Independence:
• If we modify table Employee and add an extra attribute Salary, the View
EmployeeD would still be the same.
DBMS Advantages
Data Efficient Data
Independence Access

Concurrent
Access and Data Integrity
Crash and Security
Recovery

Data
Administration
Efficient Data Access
• Efficient Data Access:
• DBMS utilizes sophisticated
techniques to store and retrieve
data effectively
• Query optimization:
• Is the part of the query process in
which the database system
compares different query strategies
and chooses the one with the least
expected cost.
• The query optimizer is a key part of
the relational database and
determines the most efficient way
to access data.
• It makes it possible for the user to
request the data without specifying
how these data should be retrieved.
DBMS Advantages
Data Efficient Data
Independence Access

Concurrent
Access and Data Integrity
Crash and Security
Recovery

Data
Administration
Data Integrity and Security
• Data Integrity:
• DBMS allows to include Integrity Constraints to maintain the Data
Integrity
• There are several types of integrity constraints:
• Domain
• Primary Key
• Foreign Key
• Assertions
• …
• Data Security:
• DBMS have Access Control Mechanisms that ensure that only
authorized people can view specific parts of a DB and also limits
how they can modify it.
DBMS Advantages
Data Efficient Data
Independence Access

Concurrent
Access and Data Integrity
Crash and Security
Recovery

Data
Administration
Data Administration
• The data representation is determined by experienced
professionals, called Database Administrators
• They take into consideration:
• the nature of the data being managed,
• how different groups of users use it
• Responsible for organizing to:
• Minimize redundancy and
• Fine-tuning the storage of the data to make retrieval efficient.
Roles in the Database Environment
• Database Administrator (DBA)
• Design of the conceptual and physical schema
• Security and authorization
• Data availability and recovery from failures:
• DBMS provides software suppor to achieve this
• But the DBA should also backup the data and keeps logs
• Database tunning
• Database Implementors
• Build DBMS software
• Work at Oracle, IBM, etc
• Application Programmers
• Develop applications that interact or use databases to simplify
• Ideally should interact with views
• End Users (naive and sophisticated)
Roles in the Database Environment
Mobile App

User Webpage Relational


Database
Management
System
Software (DBMS)

Expert Relational
User Database

Database Admin
DBMS Advantages
Data Efficient Data
Independence Access

Concurrent
Access and Data Integrity
Crash and Security
Recovery

Data
Administration
Concurrent Access and Crash Recovery
• Concurrent Access:
• DBMS allow for several users to interact with a database at the same
time.
• When there are several requests to modify, the DBMS must order
their requests carefully to avoid conflicts.
• A locking protocol is used:
• It is a set of rules that allows a several transactions to be performed at the same
time while ensuring that the result is the same as if they would have been done
in a sequence

T1  T3  T2
Concurrent Access and Crash Recovery
• Crash Recovery:
• If the system fails, the DBMS should bring the database back to a
consistent state.
• A DBMS must ensure that the changes made by incomplete
transactions are removed from the database.
• For example, if the DBMS is in the middle of transferring money from account A
to account B, and has debited the first account but not yet credited the second
when the crash occurs, the money debited from account A must be restored
when the system comes back up after the crash.
• To achieve this the DBMS uses logs where the write operations are
written before they are performed
DBMS Advantages
Data Efficient Data
Independence Access

Concurrent
Access and Data Integrity
Crash and Security
Recovery

Data
Administration

Das könnte Ihnen auch gefallen