Sie sind auf Seite 1von 15

Usman Akram

Mc130201353

Keep remember in your prayers


CS614 Current FinalTerm Paper 20 August 2016
All Questions was from Past Papers..
37 Mcqs was from Past paper + Quizzes
Subjective Totally from Past Papers
MCQS are:
The goal of ___________ is to look at as few blocks as possible to find the matching records(s).
Indexing (Right Answer)
Partitioning
Joining
None of above
The automated, prospective analyses offered by data mining move beyond the analysis of past
events provided by respective tools typical of ___________.
OLTP
OLAP
Decision Support systems
None of these
Pre-computed _______ can solve performance problems
Aggregates
Facts
Dimensions
Data mining uses _________ algorithms to discover patterns and regularities in data.
Mathematical
Computational
Statistical
None of these
To identify the __________________ required we need to perform data profiling
Degree of Transformation (Right Answer)
Complexity
Cost

Usman Akram

Mc130201353

Keep remember in your prayers


Time
Execution can be completed successfully or it may be stopped due to some error. If some error occurs,
execution will be terminated abnormally and all transactions will be ___________
Committed to the database
Rolled back
All data is ______________ of something real.
I.
II.

An Abstraction
A Representation Which of the following option is true?

I Only (Right Answer)


II Only
Both I & II
None of I & II
For a DWH project, the key requirement are ________ and product experience.
Tools
Industry (Right Answer)
Software
None of these
_________________ contributes to an under-utilization of valuable and expensive historical data, and
inevitably results in a limited capability to provide decision support and analysis.
The lack of data integration and standardization (Right Answer)
Missing Data Data Stored in Heterogeneous Sources
DTS allows us to connect through any data source or destination that is supported by ____________ OLE
DB (Right Answer)
OLAP
OLTP
Data Warehouse
If some error occurs, execution will be terminated abnormally and all transactions will be rolled back. In
this case when we will access the database we will find it in the state that was before the
Execution of package (Right Answer)

Usman Akram

Mc130201353

Keep remember in your prayers


Creation of package
Connection of package
To judge effectiveness we perform data profiling twice.
One before Extraction and the other after Extraction
One before Transformation and the other after Transformation (Right Answer)
One before Loading and the other after Loading
Pre-computed _______ can solve performance problems
Aggregates (Right Answer)
Facts
Dimensions

De-Normalization normally
speeds up Data Retrieval
(Page 51)
Data Modification
Development Cycle
Data Replication
For a given data set, to get a global view in un-supervised
learning we use One-way Clustering (Page 271)
Bi-clustering
Pearson correlation
Euclidean distance

_____________ is a process which involves gathering of information about column


through execution of certain queries with intention to identify erroneous records.
Data profiling (Page 439)
Data Anomaly Detection
Record Duplicate Detection
None of these

Usman Akram

Mc130201353

Keep remember in your prayers


The automated, prospective analyses offered by data mining move beyond the analyses
of past events provided by retrospective tools typical of ______________ .
OLTP
OLAP
Decision Support Systems Click here for detail
None of these
It is called a _____________ violation, if we have null values for attributes where NOT NULL constraint
exists

Load
Transform
Constraint page 161
Extraction
UAT stands for

User acceptance testing page 193


The application development quality assurance activities cannot be completed until the data is
Stabilized page 308
Identified
Finalized
Computerized
Product selection phase fall in Kimball

Lifecycle Technology Track page 290


Lifecycle Data Track
Lifecycle Analytic Applications Track
None of the given
Which is not an issue of Click stream data.

Identifying the Visitor Origin


Identifying the Session
Identifying the Visitor
Identifying the server .
HTTP true statement

Is stateless page 364


Non world wide web protocol
Used to maintain session
Message routing protocol

Usman Akram

Mc130201353

Keep remember in your prayers


The ith bit is set to 1, if ith row of the base table has the value for the indexed column. The statement
refer to

Inverted
Bitmap page 233
Dense
Sparse index
In context of web data ware house. Which is NOT one of way to identify session

Using asynchronous session tracking protocol


Using Time-contiguous Log Entries
Using Transient Cookies
Using HTTP's secure sockets layer (SSL)
Using session ID Ping-pong
Using Persistent Cookies
The application development quality assurance activities cannot be completed until the data is
_____________

Stabilized page 308


Identified
Finalized
Computerized

Others Mcqs from Midterm + 3, 4 was from Handouts but very easy .
Subjective:
2 Marks Questions
There are four categories of data quality improvement. Write any two. (2
marks)
Answer:
The four categories of Data Quality Improvement
Process
System
Policy & Procedure
Data Design
Write two unsupervised learning? 2 marks
Answer:
One way clustering
Two way clustering

Usman Akram

Mc130201353

Keep remember in your prayers


Statement meaning be a Diplomat not technologist 2 marks
Answer: The biggest problem you will face during a warehouse implementation will be people, not the
technology or the development.
1. Management: Youre going to have senior management complaining about completion dates and
unclear objectives.
2. Development Team: Youre going to have development people protesting that everything takes too
long and why cant they do it the old way?
3. Users: Youre going to have users with outrageously unrealistic expectations, who are used to systems
that require mouse-clicking but not much intellectual investment on their part.
4. And youre going to grow exhausted, separating out Needs from Wants at all levels. Commit from the
outset to work very hard at communicating the realities, encouraging investment, and cultivating the
development of new skills in your team and your users (and even your bosses).
Most of all, keep smiling. When all is said and done, youll have a resource in place that will do magic, and
your grief will be long past. Eventually, your smile will be effortless and real.
Define click stream? 2marks
Answer:
Clickstream is every page event recorded by each of the company's Web servers
Web-intensive businesses
Although most exciting, at the same time it can be the most difficult and most frustrating.
Not JUST another data source.

3 Marks Questions
As the number of processes increase, the speedup should also increase. Thus
theoretically there should be a linear speedup; however this is not the case in
real. List at least 2 barrier of linear speedup. 3 marks
Answer:

Amdahl Law
Startup
Interference
Skew
Common Dimensions in context with Web data warehouse. 3marks
Answer: ------Name of three DWH development methodologies? 3 marks
Answer:

Usman Akram

Mc130201353

Keep remember in your prayers


Development methodologies
Waterfall model
Spiral model
RAD Model
Structured Methodology

Data Driven
Goal Driven
User Driven
One question was to identify statement is correct or not (was from Midterm) 3
marks
5 Marks Question:
Before sitting down with the business community to gather information, it is
suggested to set you up for a productive session. Write three activities
requirement preplanning phase 5 marks
Answer:
Requirements preplanning: This phase consists of activities like choosing the forum, identifying and
preparing the requirements team and finally selecting, scheduling and preparing the business
representatives.
This query was given SELECT*FROM R WHERE A= 5 and we have to tell which
Technique is appropriate from dense, sparse, B-tree and has indexing. 5 marks
Answer: Hash Indexing is appropriate for the given query, because hash indexing is good for matching
queries.
According to Amdahls Law prove that the speedup does not remain same if
the fraction of the problem and number of processors are doubled. Please note
that 0 overhead and perfect parallelism is used. Use following examples 5
marks
a)
Fraction of the problem that must be computed sequentially is 5% and
number of processors is 100.
b)
Fraction of the problem that must be computed sequentially is 10% and
number of processors is 200.
Ans: REF (Handouts Page # 204,205)
Amdahls law: S 1 / f + ( 1 f ) / N
a)

1 / 0.05 + ( 1 0.05 ) / 100 = 16.81

b)

1 / 0.10 + ( 1 0.10 ) / 100 = 9.57

Hence it is evident that the speedup does not remain same if we double the fraction and number of
processors.

Usman Akram

Mc130201353

Keep remember in your prayers


Attributes of Page Dimension: 5 marks
Answer: Page no : 362 Ch# 40
CS614 Current FinalTerm Paper 24 August 2016
Question # (01): (2 Marks)
With respect to Data retrieval, queries can be categorized in three sets, one is Full Table
Scan queries (i-e the queries that scan the complete table ) Name other two sets
Question # (02): (2 Marks)
What is meant by the Statement Be a diplomat NOT a technologist in the context of data
warehousing development project?
The biggest problem you will face during a warehouse implementation will be people, not
the technology or the development. Youre going to have senior management
complaining about completion dates and unclear objectives. Youre going to have
development people protesting that everything takes too long and why cant they do it
the old way? Youre going to have users with outrageously unrealistic expectations, who
are used to systems that require mouse-clicking but not much intellectual investment on
their part. And youre going to grow exhausted, separating out Needs from Wants at all
levels. Commit from the outset to work very hard at communicating the realities,
encouraging investment, and cultivating the development of new skills in your team and
your users (and even your bosses). Most of all, keep smiling. When all is said and done,
youll have a resource in place that will do magic, and your grief will be long past.
Eventually, your smile will be effortless and real.
Question # (03): (2 Marks)
Is there any fixed strategy to standardize a column?
There are no fixed strategies to standardize the columns. Again it depends on the project
designer what methodology he/she devises. We can devise a simple methodology that
can
later be used for other columns as well.
Question # (04): (2 Marks)
List down any two parallel Hardware Architectures?
Parallel Hardware Architectures
Symmetric Multi Processing (SMP)
Distributed Memory or Massively Parallel Processing (MPP)
Non-uniform Memory Access (NUMA)
Question # (05): (3 Marks)
Identify the given statement as correct or incorrect and justify your answer in either casw
Bayesian Modeling is an example of unsupervised learning
incorrect. Bayesian modeling is an example of supervised learning incorrect. Bayesian

Usman Akram

Mc130201353

Keep remember in your prayers

modeling is an example of supervised learning because type and number of classes are
known in advance
Question # (06): (3 Marks)
Name any three Data warehouse Development methodologies?
Development methodologies Waterfall model Spiral model RAD Model
Structured Methodology Data Driven Goal Driven User Driven
Question # (07): (3 Marks)
List down any three ways of Handling missing data during Data cleansing process
DWH major issues of data cleansing had arisen due to data processing and handling
at four levels by different groups of people i.e.
(i) Hand recordings by the scouts at the field level
(ii) typing hand recordings into data sheets at the DPWQCP office
(iii) photocopying of the scouting sheets by DPWQCPpersonnel and finally
(iv) Data entry or digitization by hired data entry operators.
Question # (08): (3 Marks)
HTTPs secure socket layer (SSL) is used to identify the session on the world wide web
however the are some limitation of this technique, Briefly explain the two limitations?
Offers an opportunity to track a visitor session
Limitations
To track the session, the entire information exchange needs to be in high
overhead SSL
Each host server must have its own unique security certificate.
Visitors are put-off by pop-up certificate boxes.
This offers an opportunity to track a visitor session because it may include a login action
by the visitor and the exchange of encryption keys.
Limitations
The downside to using this method is that to track the session, the entire
information exchange needs to be in high overhead SSL, and the visitor may be
put off by security advisories that can pop up when certain browsers are used.
Each host server must have its own unique security certificate.
Question # (09): (5 Marks)
Briefly explain any two types of precedence constraints that we can use in DTS?
Unconditional: If you want Task 2 to wait until Task 1 completes, regardless of the
outcome, link Task 1 to Task 2 with an unconditional precedence constraint.
On Success: If you want Task 2 to wait until Task 1 has successfully completed, link Task
1 to Task 2 with an
On Success precedence constraint.
On Failure: If you want Task 2 to begin execution only if Task 1 fails to execute
successfully, link Task 1 to Task 2 with an On Failure precedence constraint. If you want
to run an alternative branch of the workflow when an error is encountered, use this
constraint.
Question # (10): (5 Marks)

Usman Akram

Mc130201353

Keep remember in your prayers

Briefly explain merge/purge problem while applying data cleansing in data warehousing field?
Within the data warehousing field, data cleansing is applied especially when several
databases are merged. Records referring to the same entity are represented in different
formats in the different data sets or are represented erroneously. Thus, duplicate records
will appear in the merged database. The issue is to identify and eliminate these
duplicates.
The problem is known as the merge/purge problem. Instances of this problem appearing
in literature are called record linkage, semantic integration, instance identification, or
object identity problem.
Question # (11): (5 Marks)
Give two reason, why Rapid Application Development (RAD) is more suitable for data
warehousing development as compared to others traditional development methodologies?
Rapid Application Development (RAD) is an iterative model consisting of stages
like scope, analyze, design, construct, test, implement, and review. It is much better
suited
to the development of a data warehouse because of its iterative nature and fast
iterations.
User requirements are sometimes difficult to establish because business analysts are too
close to the existing infra-structure to easily envision the larger empowerment that data
warehousing can offer. Development and delivery of early prototypes will drive future
requirements as business users are given direct access to information and the ability to
manipulate it. Management of expectations requires that the content of the data
warehouse be clearly communicated for each iteration.
CS614 Current FinalTerm Paper 25 August 2016
Q.What may be possible implications if the developing organization never freezes the
requirements throughout the DWH development i.e. it always behaves like an accommodating
person. 5

Write down any two drawbacks if Date is stored in text format rather than using proper date
format like dd-MMM-yy etc. 5
There are different data mining techniques e.g. clustering, description etc. Each of the
following statement corresponds to some data mining technique. For each statement name
the technique the statement corresponds to. 5
a) Assigning customers to predefined customer segments (i.e. good vs. bad)
b) Assigning credit applicants to predefined classes (i.e. low, medium, or high risk)
c) Guessing how much customers will spend during next 6 months

Usman Akram

Mc130201353

Keep remember in your prayers

d) Building a model and assigning a value from 0 to 1 to each member of the set. Then
classifying the members into categories based on a threshold value.
e) Guessing how much students will score more than 65% grades in midterm.
a) Assigning customers to predefined customer segments (i.e. good vs.
bad) classification
b) Assigning credit applicants to predefined classes (i.e. low, medium, or high risk)
classification
c) Guessing how much customers will spend during next 6 months prediction
d) Building a model and assigning a value from 0 to 1 to each member of the set.
Then
classifying the members into categories based on a threshold value. Estimation
e) Guessing how much students will score more than 65% grades in midterm.
Prediction

There are two justifications for a task to be performed in parallel, either it manipulates
significant amount of data (i.e. size) or it can be solved by divide and conquer (D&C) strategy.
From the given list, provide the justification for each of the task to perform it in parallel. 5
a) Large table scans and joins

Size

b) Creation of large indexes

Size

c) Partitioned index scans

D&C

d) Bulk inserts, updates, and deletes Size


e) Aggregations and copying

D&C

What are the tasks performed through import / export data wizard to load data?Write any three
3
Import and Export Data Wizard provides the easiest method of loading data.
The wizard creates package which is a collection of tasks
Tasks can be as follows:
Establish connection through source / destination systems
Creates similar table in SQL Server
Extracts data from text files
Apply very limited basic transformations if required
Loads data into SQL Server table
In context of Four Cell Quadrant Technique, which business process (from the diagram below)
will have highest priority? Justify with reason. [Marks 3]
So the process having higher feasibility and impact is given higher priority over the
process having lower feasibility and impact. In example of Figure 33.2, process A has
highest priority while the process D has lowest priority PG# 298

Usman Akram

Mc130201353

Keep remember in your prayers

Consider the following two statements. Specify that each statement correspond to which
activity of data quality analysis project. 3
a) Identify functional user data quality requirements and establish data quality
metrics.
Define
b) Measure conformance to current business rules and develop exception reports.
Measure
Identify the given statement as correct or incorrect and justify your answer in either case.
Bayesian Modeling is an example of Unsupervised Learning. 3
The problems associated with the extracted data can correspond to non-primary keys. List
down any four problems associated with the non-primary key.5
Non primary key problems
1. Different encoding in different sources.
2. Multiple ways to represent the same information.
3. Sources might contain invalid data.
4. Two fields with different data but same name.
1. Data may be encoded differently in different sources. The domain of a gender field
in some database may be {F, M} or as {Female, Male} or even as {1, 0}.
2. There are often multiple ways to represent the same piece of information. FAST,
National University, FAST NU and Nat. Univ. of Computers can all can be found
in the literature as representing the same institution.
3. Sources might contain invalid data. A point of sale terminal may require that the sales
clerk enters a customers telephone number. If the customer does not wish to give it,
clerks may enter 999-999-9999.
4. Two fields may contain different data but have the same name. There are a couple of
ways in which this can happen. Total Sales probably means fiscal year sales to one part
of an enterprise and calendar year sales to another. The second instance can be much
more dangerous. If an application is used by multiple divisions, it is likely that a field that
is necessary for one business unit is irrelevant to another and may be left blank by the
second unit or, worse, used for otherwise undocumented purposes.
What is Reverse Proxy?2
Reverse Proxy Another type of proxy server, called a reverse proxy, can be placed in
front of our enterprise's Web servers to help
them offload requests for frequently accessed content. This kind of proxy is entirely
within our control and usually presents no
impediment to Web warehouse data collection. It should be able to supply the same kind
of log information as that produced by a
Web server.
Why analytics track is called as the fun part" while designing a data warehouse?2

Usman Akram

Mc130201353

Keep remember in your prayers


The final set of parallel activities following the business requirements definition is the
analytic application track, where we design and develop the applications that
address a portion of the users' analytic requirements. As a well-respected application
developer once told, "Remember, this is the fun part!" We're finally using the investment
in technology and data to help users make better decisions.
List two main types of unsupervised learning.2
One-way Clustering
Two-way Clustering
1) Issues of Click stream
Issues of Click stream Data: (Page#341)
Click stream data has many issues:
Identifying the Visitor Origin
Identifying the Session
Identifying the Visitor
Proxy Servers
Browser Caches
2) Lexical error
Lexical errors: For example, assume the data to be stored in table form with each row
representing a tuple and each column an attribute. If we expect the table to have five
columns because each tuple has five attributes but some or all of the rows contain only
four columns then the actual structure of the data does not conform to the specified
format.
3) Limitation of HTTP
Using HTTP's secure sockets layer (SSL)
Offers an opportunity to track a visitor session
Limitations
To track the session, the entire information exchange needs to be in high
overhead SSL
Each host server must have its own unique security certificate.
Visitors are put-off by pop-up certificate boxes.
This offers an opportunity to track a visitor session because it may include a login action
by the visitor and the exchange of encryption keys.
Limitations
The downside to using this method is that to track the session, the entire
information exchange needs to be in high overhead SSL, and the visitor may be
put off by security advisories that can pop up when certain browsers are used.
Each host server must have its own unique security certificate.
4) Name of Activities in DWH lifecycle
DWH Lifecycle: Key steps
1. Project Planning

Usman Akram

Mc130201353

Keep remember in your prayers

2. Business Requirements Definition


3. Parallel Tracks
3.1 Lifecycle Technology Track
3.1.1 Technical Architecture
3.1.2 Product Selection
3.2 Lifecycle Data Track
3.2.1 Dimensional Modeling
3.2.2 Physical Design
3.2.3 Data Staging design and development
3.3 Lifecycle Analytic Applications Track
3.3.1 Analytic application specification
3.3.2 Analytic application development
4. Deployment
5. Maintenance
5) Issues of Data Acquisition cleansing n agriculture
The pest scouting sheets are larger than A4 size (8.5 x 11), hence the right end
was cropped when scanned on a flat-bed A4 size scanner.
The right part of the scouting sheet is also the most troublesome, because of
pesticide names for a single record typed on multiple lines i.e. for multiple
farmers.
As a first step, OCR (Optical Character Reader) based image to text
transformation of the pest scouting sheets was attempted. But it did not work even
for relatively clean sheets with very high scanning resolutions.
Subsequently DEOs (Data Entry Operators) were employed to digitize the
scouting sheets by typing.
6) Table of Bitmap index
Pg# 233
7) Gender guidelines
A mechanism can be formulated to correct gender
Use a standard gender guide
Create another table Gender guide with
columns Name and Gender
Copy distinct first names to Gender guide
Manually put the gender of all names in Gender
Guide
Transform St_Name in Exception such that first
name gets separated and stored in another
column
Make a join of Exception table and Gender guide
to fill missing gender A mechanism can be formulated to correct gender
Use a standard gender guide
Create another table Gender guide with

Usman Akram

Mc130201353

Keep remember in your prayers

columns Name and Gender


Copy distinct first names to Gender guide
Manually put the gender of all names in Gender
Guide
Transform St_Name in Exception such that first
name gets separated and stored in another
column
Make a join of Exception table and Gender guide
to fill missing gender
8) Choose tha Data mining metrics
9) Agree with the statistic algorithm
Data Mining uses statistical algorithms to discover patterns and regularities (or
knowledge) in data. For example: classification and regression trees (CART, CHAID),
rule induction (AQ, CN2), nearest neighbors, clustering methods, association rules,
feature extraction, data visualization, etc.
Data mining is, in some ways, an extension of statistics, with a few artificial intelligence
and machine learning twists thrown in. Like statistics, data mining is not a business
solution, it is just a technology.
10) statement is corrct/incorrect abut Orr's Law
Orrs Laws of Data Quality
Law #1: Data that is not used cannot be correct!
Law #2: Data quality is a function of its use, not its collection!
Law #3: Data will be no better than its most stringent use!
Law #4: Data quality problems increase with the age of the system!
Law #5: The less likely something is to occur, the more traumatic it will be
when it happens!