Reusable Components

Reusable Component Inventory
Component
Service Features
Data Validation
Validates data using predefined schema. Checks data

for
1. Overall validity data is complete. This
comprises checks on size of the data. This
ensures data received for the processing is
complete.
2. Well formed data is formed as per
configurable or provided metadata. Example
configuration metadata file contains n
columns, then data should confirm to that.
3. Business validity Fields defined as
mandatory are not missing.
4. Format validity data is formed as per
metadata. If comma is defined as the
delimiting character then data should confirm
to that.
5. Field validity data follows acceptable
character and data type policy as per
metadata.
6. Malicious data data confirm to the rules of
data quality. Example, sensor gets hacked and
generates data in the same form as other valid
sensors. But this data may contain some no
typical content which may be defined by rules.
Component should have configurable actions
1. Discard data if invalid and move ahead
2. Fast fail mode
3. Log exception, retain data and move ahead
These actions should be configurable at each level of
validation step.
Big Data Functional

Area
Big Data Quality/
Big Data Governance
Open Source
Availability
REST based client
Configuration
Bootstrapping Service
Component is likely to be series of map reduce jobs.

Component connects to the endpoint REST and
downloads the data to specified location. It must have
following configuration
1. REST service URL
2. Exception message/code configuration
3. Security configuration ( Unsecured/ token/
HTTPS/ username-password authentication)
4. Storage directory (single or multiple)
5. Naming policy
6. Action Error time log. This is a readable log
of when calls are made, how it ended, time
required
Component can reconnect after specified time if
connection attempt fails.
This component loads data which is essential for
smooth running and maintenance of the system. This
includes various configurations and static data.
Example is training data sets in case of machine
learning application. Also algorithm settings
parameters, data lookups etc. may be required to be
loaded at the startup and utilized in all layers of
application.
1. Configuration setup has to work for
distributed systems.
2. Any new processing node joining should have
access to all the static data and configuration
in fast and efficient manner.
3. Single storage and dependability is avoided to
provide high availability.
4. The data load configuration should be in easy
to understand, self-explanatory and operating
system agnostic.
5. Should contain reusable parser for popular
Data Ingestion
Information
integration /
Infrastructure and
Platform management
Big Data Testing web

application
System Notifications
Data Movement
data formats like CSV, TLV, prop, xml.

There should be pluggable web application which lets
user explore various large datasets in efficient
manner. This is a basic web application which takes
the following as configuration parameters.
1. Location of big data source (HDFS file and
HBase supported to start)
2. Set data size to be explored in one go.
This component web service is useful in testing as it
should provide search like interface to search on the
data. This is primarily for analytics and business users
to understand the following points
1. Check the nature of the data
2. Do similar operations as head, tail, and basic
index slicing and searching.
User has to be notified about the estimated time
required for browsing the data.
Component should be designed to generate alerts and
notifications at appropriate stages. This should have
the following configurations
1. Creating events
2. Listening to interesting events in the system
3. Managing events
4. Create customized notification templates for
publishing the events occurred.
The event publication should have predefined
templates for sending events out the intended
recipients.
The vent generation should take into account the
distributed system. For example, If we want to send
email notification after a specific job and data update
is completed then it should be able to hold that.
In Big data applications, there are situations where
Auditing and
Monitoring / Quality
of service
Infrastructure and
Platform management
/ Security
Infrastructure and
Data Purging
Job Scheduler
data is moved without changing format or further

processing within the system. One such instance is
data archiving. The component should have
1. data source configuration
2. data target configuration
3. Naming policy including conflict resolution
strategy
4. Configurable compression strategy for saving
space
5. Configurable tracking of target folder disk
space level.
Component should purge data and have following
capacity
1. Two level purging providing accidental
deletion reversal.
2. Log of deleted data should contain timestamp
of both soft and hard deletion, user who
requested purge, size of purge. This record is
created only after successful purge operation.
Various scripts and jobs need to be scheduled for
successful operation of the system. The Scheduler
component should have the following features
1. Ability to schedule heterogeneous jobs using
the underlying Operating System.
2. Ability to schedule over the calendar.
3. Ability to schedule and prioritize multiple jobs
handling defined interdependency
4. Simple logic based scheduling.
Platform management
/Security / Big data
governance
Platform management
/Big data governance
Platform management
Layers of Big data Application

Layers Addressing Big Data Functional concerns
Data Ingestion
Distributed Storage and search
Data Access
Data Discovery and Analysis
Visualization
Cross cutting layers addressing engineering concerns
Information integration
Infrastructure and Platform management

Security
Cross cutting layers addressing quality and governance
Big data governance
Auditing and Monitoring
Quality of service
Big Data Pipeline (Reference)

Reusable Components

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Reusable Components

Hochgeladen von

Copyright:

Verfügbare Formate

Reusable Component Inventory

Validates data using predefined schema. Checks data

Big Data Functional

REST based client

Component is likely to be series of map reduce jobs.

Big Data Testing web

data formats like CSV, TLV, prop, xml.

data is moved without changing format or further

Layers of Big data Application

Cross cutting layers addressing engineering concerns

Infrastructure and Platform management

Big Data Pipeline (Reference)

Das könnte Ihnen auch gefallen