Sie sind auf Seite 1von 6

Reusable Component Inventory

Component

Service Features

Data Validation

Validates data using predefined schema. Checks data


for
1. Overall validity data is complete. This
comprises checks on size of the data. This
ensures data received for the processing is
complete.
2. Well formed data is formed as per
configurable or provided metadata. Example
configuration metadata file contains n
columns, then data should confirm to that.
3. Business validity Fields defined as
mandatory are not missing.
4. Format validity data is formed as per
metadata. If comma is defined as the
delimiting character then data should confirm
to that.
5. Field validity data follows acceptable
character and data type policy as per
metadata.
6. Malicious data data confirm to the rules of
data quality. Example, sensor gets hacked and
generates data in the same form as other valid
sensors. But this data may contain some no
typical content which may be defined by rules.
Component should have configurable actions
1. Discard data if invalid and move ahead
2. Fast fail mode
3. Log exception, retain data and move ahead
These actions should be configurable at each level of
validation step.

Big Data Functional


Area
Big Data Quality/
Big Data Governance

Open Source
Availability

REST based client

Configuration
Bootstrapping Service

Component is likely to be series of map reduce jobs.


Component connects to the endpoint REST and
downloads the data to specified location. It must have
following configuration
1. REST service URL
2. Exception message/code configuration
3. Security configuration ( Unsecured/ token/
HTTPS/ username-password authentication)
4. Storage directory (single or multiple)
5. Naming policy
6. Action Error time log. This is a readable log
of when calls are made, how it ended, time
required
Component can reconnect after specified time if
connection attempt fails.
This component loads data which is essential for
smooth running and maintenance of the system. This
includes various configurations and static data.
Example is training data sets in case of machine
learning application. Also algorithm settings
parameters, data lookups etc. may be required to be
loaded at the startup and utilized in all layers of
application.
1. Configuration setup has to work for
distributed systems.
2. Any new processing node joining should have
access to all the static data and configuration
in fast and efficient manner.
3. Single storage and dependability is avoided to
provide high availability.
4. The data load configuration should be in easy
to understand, self-explanatory and operating
system agnostic.
5. Should contain reusable parser for popular

Data Ingestion

Information
integration /
Infrastructure and
Platform management

Big Data Testing web


application

System Notifications

Data Movement

data formats like CSV, TLV, prop, xml.


There should be pluggable web application which lets
user explore various large datasets in efficient
manner. This is a basic web application which takes
the following as configuration parameters.
1. Location of big data source (HDFS file and
HBase supported to start)
2. Set data size to be explored in one go.
This component web service is useful in testing as it
should provide search like interface to search on the
data. This is primarily for analytics and business users
to understand the following points
1. Check the nature of the data
2. Do similar operations as head, tail, and basic
index slicing and searching.
User has to be notified about the estimated time
required for browsing the data.
Component should be designed to generate alerts and
notifications at appropriate stages. This should have
the following configurations
1. Creating events
2. Listening to interesting events in the system
3. Managing events
4. Create customized notification templates for
publishing the events occurred.
The event publication should have predefined
templates for sending events out the intended
recipients.
The vent generation should take into account the
distributed system. For example, If we want to send
email notification after a specific job and data update
is completed then it should be able to hold that.
In Big data applications, there are situations where

Auditing and
Monitoring / Quality
of service

Infrastructure and
Platform management
/ Security

Infrastructure and

Data Purging

Job Scheduler

data is moved without changing format or further


processing within the system. One such instance is
data archiving. The component should have
1. data source configuration
2. data target configuration
3. Naming policy including conflict resolution
strategy
4. Configurable compression strategy for saving
space
5. Configurable tracking of target folder disk
space level.
Component should purge data and have following
capacity
1. Two level purging providing accidental
deletion reversal.
2. Log of deleted data should contain timestamp
of both soft and hard deletion, user who
requested purge, size of purge. This record is
created only after successful purge operation.
Various scripts and jobs need to be scheduled for
successful operation of the system. The Scheduler
component should have the following features
1. Ability to schedule heterogeneous jobs using
the underlying Operating System.
2. Ability to schedule over the calendar.
3. Ability to schedule and prioritize multiple jobs
handling defined interdependency
4. Simple logic based scheduling.

Platform management
/Security / Big data
governance

Platform management
/Big data governance

Platform management

Layers of Big data Application


Layers Addressing Big Data Functional concerns
Data Ingestion
Distributed Storage and search
Data Access
Data Discovery and Analysis

Visualization

Cross cutting layers addressing engineering concerns

Information integration

Infrastructure and Platform management


Security
Cross cutting layers addressing quality and governance
Big data governance
Auditing and Monitoring
Quality of service

Big Data Pipeline (Reference)

Das könnte Ihnen auch gefallen