Sie sind auf Seite 1von 25

Hadoop-based Data Integration

Unstructured File Parsing


Natural Language Processing
Real Time Connectivity
Web Services
Monitoring and Alerting
•Operational uptime
•Enforce development best practices
Data Validation
•Unit and system testing
•Check validity of operational data
Metadata Manager and Data Lineage
Business Glossary
Additional Performance and Scalability
•Workflow on Grid
•Session on Grid
•High Availability
•Partitioning
•Push Down Optimization
Team-based Development and Versioning
Basic XML Processing
Connectors to Data Sources
Administration
Basic Profiling
Rapid Prototyping with Business/ IT Collaboration
Batch Data Integration
Data Analyst
Enterprise Discovery
Discovery Search
Data Profiling
Rule Builder
Data Quality Transformations
Exception Management
Reference Table Management
Data Quality Workflow
Identity Match Option
Data Domain and Enterprise Discovery
Grid Computing and Partioning HADOOP SERVER
Data Quality Web Services
Universal Record ID
Business Glossary
Metadata Manager
Proactive Monitoring for Data Quality
Data Quality Accelerators
Functionality Cat
Data replication and syncronization
Service oriented Arch(SOA)
Denpendecy Analysis Where we have used the object Designer
Statistical Analysis Designer
Matadata Mangement Management Console
user access layers Management Console
Wizard driven data dictionary
automated mapping Parcially achived in 4.0 onwards Designer
Auto save time setter Not implemented Designer
Symmetric multipprocessing Symmetric Multiprocessing (SMP) - Some Hard Architecture
MPP Parallel Processing (MPP) - Known as shared Architecture
Grid Architecture
Partitioning Metadata table level can be implemented Architecture
Common Warehouse Meta model Architecture
Package application read(people soft,SAP r/3)
load balancing Server Group Management Console
fail-over SMTP Designer
Job distribution
Data pipelining Sequential Designer
End-to-end BI infrastructure exchange infrastructure Designer
Version control system Centrel repo Central Repo
Splitting data streams/multiple targets No.of Loaders Designer
Conditional splitting Designer
Key lookups in memory cache lookup Designer
Read non-structured data Designer
Slowly changing dimensions Designer
Scheduler Management Console
Error handling within job Designer
Impact analysis Management Console
Data lineage Management Console
Automatic documentation Management Console
Support for data mining models Management Console
Support for analytical functions Management Console
forecasting Management Console
basket analysis Designer
regression
Task compatibility ETL / EAI
Reusability of components Designer
Decomposition Designer
User-defined functions Designer
Comments on selection of objects Designer
Step-by-step running Designer
Row-by-row running Designer
Breakpoints Designer
Watch Points
Compiler/validater Designer
Integration batch – real-time Management Console
message queuing Message Queuing (also known as MSMQ) enables applications that are running at d
database logs
database triggers
On-demand data integration
Native connections Designer
Real-time connections
Support for joined tables as source Designer
Changed data capture Designer
Built-in functions for data quality Designer
Built-in functions for data validation Designer
Data profiling Designer
Platforms
Version
Engine-based / code generator

Multiplatform
Localization
Technical support
Training
Audit trail management
Security management
Central administration CMC
Analysis and reporting
Dashboarding
Mobile BI capabilities
Advanced data search
BI software gadgets (e.g., Oracle BI beans)
Advanced reporting
Microsoft Office integration
Geospatial capabilities
Standard analytics
Advanced analytics
In-memory analytics
Web and social analytics
OLAP services
Data warehousing
Enterprise data integration
Data quality
Metadata management
Data mining
Operational BI capabilities
Metrics and KPIs
Scorecards
Planning
Strategy management
Budgeting
Financial consolidation
Risk and performance management
Forecasting
Portal integration
Master data management
Data governance
Industry vertical functionality
Third-party application integration
P

P
P

P
P
ons that are running at different times to communicate across heterogeneous networks and systems that might be temporarily offline. App
P
P
might be temporarily offline. Applications send messages to queues and read messages from queues
Multiplatform
Localization
Technical support
Training
Audit trail management
Security management
Central administration
Analysis and reporting
Dashboarding
Mobile BI capabilities
Advanced data search
BI software gadgets (e.g., Oracle BI beans)
Advanced reporting
Microsoft Office integration
Geospatial capabilities
Standard analytics
Advanced analytics
In-memory analytics
Web and social analytics
OLAP services
Data warehousing
Enterprise data integration
Data quality
Metadata management
Data mining
Operational BI capabilities
Metrics and KPIs
Scorecards
Planning
Strategy management
Budgeting
Financial consolidation
Risk and performance management
Forecasting
Portal integration
Master data management
Data governance
Industry vertical functionality
Third-party application integration

1. Availability of Drools Engine support


(execute rules from Excel/CSV/XML files)
2. Command line execution support.
3. Search engine support (to search any
loggs or job like Elastice search UI in
Talend)
4. MongoDB support
5. Hadoop support
6. Hive and Pig job integration support
7. Reporting tool support for Jasper (to
create jasper report if required directly from
job context)
8. Routing tools support (like using Camel
route if required)
9. Notification over SMS
10. FTP support
ADBMS
In-memory option
In-database analytics
Analytic library
Extensibility via UDF
On-demand platform interaction
On-demand data integration
On-demand access to streaming data(2.5M records/second)
Massive scalability(100s of TB)
Fast parallel load(40 TB/hour)
Columnar database
Compression (10+ times)
MPP
Real-time analytics
High availability
SQL access
SQL optimization
Extreme SQL optimization
Execution optimization
Hadoop analytics(interactive, bi-directional,data, and processes)
High-performance analytics for data in NoSQL archives
Discovery analytics
Ad hoc queries
On demand queries
No indexing required
No projections required
Enterprise licensing
Offerings: appliance, cloud,SaaS, commodity hardware
SSD support
Distributed analytic processing
Workload management
Interactive workload management
Big memory
SAN integration
ETL STRATEGY
ETL Strategy
ETL Teams

Product Differentiators
Product Liabilities
Upcoming Enhancements
Industry Focus
Packaged Applications Focus
PRODUCT
Product

Description
Clients
Servers
Engine
Sources
(Interfaces and Adapters)

Targets
(Interfaces and Adapters)
Required Add-on Products that You Sell
Required 3rd Party Products
Optional Products You Sell
Optional Products from 3rd Parties
DESIGN AND TRANSFORMATION
Development Environment

Graphical Interface

Coding

Object Characteristics

Object Library

Transformation Objects
Rejected Records
External Objects

Reuse

Version Control

Wizards and Assists

Debugging

META DATA MANAGEMENT


Self-documentation

Meta Data Repository

Extensibility
Distributed Management
Reverse Engineering
Meta Data Interchange

Impact Analysis Reports

Data Lineage Reports


Other Reports
EXTRACT, CAPTURE, LOAD, UPDATE, TRANSPORT
Extract Adapters

Adapter Development
Extract Processing
Parallel Processing and Scalability

Rule Based Extracts

Loading Targets

Target Transformations

Transport

ADMINISTRATION
Console

Job Monitoring

Scheduler

Job Validation
Error Recovery and Restart

Reporting
Security

PRICING
Client Prices
Server Prices
Adapter Prices

Required Products
Optional Products
Maintenance Fees

Pricing Example

Support

Contact Info
a) Three goals for the ETL product in next 12-18 months?
a) # of dedicated ETL salespeople?
b) # of dedicated ETL marketers?
c) # of dedicated ETL developers?
a) Three things that differentiate the ETL product.
a) Three reasons why some organizations may not find the product appropriate (limitations)
a) Three major enhancements to the next version of the product
a) The top 3 industries that buy the ETL tool, with % of total sales in each industry.
a) List the suites and applications you sell that also embed your ETL tool.

a) Name and version/release number


b) Date this version shipped
c) Date product first shipped (1.0 version)
a) One sentence description
a) List client modules. For each, define the primary purpose and available platforms.
a) List server modules. For each, define the primary purpose and available platforms.
a) What is the underlying engine that executes ETL rules? (SQL? Compiled code?)
a) What source system interfaces does the product bundle for free?
b) What source system interfaces does it charge extra for?
c) Describe the SDK to create custom adapters?
a) What target interfaces does it bundle for free?
b) What target interfaces does it charge extra for?
a) Specify additional products you sell that customers MUST buy to deploy the product
a) Specify 3rd party products that customers MUST buy to deploy the product (e.g. IIS Web Server)
a) Specify separately priced optional ETL products or add-ins that customers can purchase from your company
a) Specify separately priced optional ETL products or add-ins that customers can purchase from 3rd parties.

a) Consistent interface across all modules?


b) Graphical environment?
c) Command-line interface?
a) What functions can be defined visually?
b) What functions must be coded?
c) Can the GUI represent a complex workflow as a single icon? (container)
a) What code does the product generate internally?
b) Can you access and modify the internal code?
c) What scripting languages does it support for custom coding?
d) Can the tool generate SQL?
a) Are your objects blocks of procedural code or OO components with inheritance?
b) Can you depict objects as icons?
c) How do you create custom objects?
a) How many objects are in the library?
b) Can you store custom objects in library?
a) What types of transformation objects does the tool provide out of the box?
a) Built-in logic for defining and managing rejected records?
a) Can you create exits to external objects?
b) What languages can external objects be written in?
c) Can you document external objects from within the tool?
d) What third party objects does the tool support?
a) Can you copy and paste objects and sessions into one or more workflows?
b) Can you nest sessions within other sessions?
c) Are sessions context independent? (Do they automatically configure themselves to work in other workflows with other sources?)
d) Can you automatically update copied objects or sessions by reconfiguring a base template?
a) Is there a development repository?
b) Check-in, check-out functions?
c) Rollback to previous version?
d) Granular access control?
e) Audit trails of all activity and changes?
f) Integration with 3rd party version control software? Which ones?
a) Does tool support wizards? For what?
b) Does tool provide predefined mappings for common tasks and functions?
a) Is there a visual debugger?
b) Can you set checkpoints?
c) Can you check variables at each point?
d) Does tool provide sample test data?

a) What design rules does the tool automatically capture?


b) What operational data does the tool automatically capture?
c) What business meta data does the tool automatically capture?
a) Repository engine (e.g. RDBMS or proprietary)
b) Repository APIs
c) Repository clients supported
d) Standards supported
a) Is there an SDK to extend the repository data model?
a) Can the repository manage and synchronize meta data in multiple distributed ETL tools without coding or customization?
a) What sources can the tool reverse engineer?
a) Can it import/export meta data with third party tools? Which ones?
b) Can it manually synchronize meta data with third party tools? Which ones?
c) Can it automatically synchronize meta data with third party tools without coding? Which ones?
a) Can reports visually depict dependencies among components in multiple ETL workflows?
b) Can reports visually depict dependencies across third party tools? Which ones?
a) Can it generate visual reports that enable business users to view the origins of a components
a) What other reports can the repository generate out of the box?

a) List source adapters that come bundled with the product


b) List source adapters that can be purchased separately
a) What type of SDK exists for building custom adapters?
a) Process records in batch?
b) Process records in near real time?
c) Process records sequentially? (e.g. database cursors)
d) Process records from diverse sources simultaneously?
a) Execute multiple jobs concurrently?
b) Execute single job in parallel using system threads and pipelining?
c) Scale linearly across multiple CPUs?
d) Support load balancing and failover across clustered servers?
e) In-memory cache to avoid creating temporary files
a) Selectively extract data based on complex rules or filters? Which sources?
b) Capture changes that occurred since the last load or update? Which sources?
a) What target systems are supported?
b) What bulk load utilities are supported?
c) Can you turn off referential integrity and indexes?
d) Snapshot data after loading?
e) Load target partitions?
f) Automatically generate DDL?
What type of complex transformations does the tool support out of the box?
a) In-memory surrogate key generation?
b) Incremental dimensional aggregates?
c) Recursive processing or loops?
d) Temporary file generation?
e) Merge records from multiple files
f) Data cleansing?
g) Other
a) What network protocols are supported out of the box?
b) Can it initiate file transfer programs on remote servers?
c) Does it compress and encrypt data?

a) Visual console to run and mange jobs?


b) Command line interface?
c) Can it manage multiple ETL systems?
d) Integrated with 3rd party systems management tools?
a) Real time monitoring?
b) Support SMTP-based alerts?
c) What events are monitored? (e.g. job status, rows processed, time elapsed by task, time elapsed total, CPU and memory consumption, thro
a) Graphical scheduler to define schedules and job dependencies?
b) Schedule by time, event, interval, or condition (e.g. Boolean rules)
c) Integrate with 3rd party schedulers? Which ones?
a) Can the tool validate jobs before running them?
a) Can tool recover to last checkpoint or point of failure without manual work?
b) Can tool restart from point of failure?
c) Can tool restart entire session?
Does tool generate the following reports?
a) Job log, statistics, and diagnostics?
b) Error reports that tell what happened, why, and what to do?
c) Reconciliation reports
a) Access control and authentication
b) Integration with LDAP

a) Client modules by platform


a) Server modules by platform
a) Source adapter prices
b) Target adapter prices
a) Prices of other required products
a) Prices of optional products from you or 3rd parties.
a) Annual maintenance fees
b) Support and other services that come with annual maintenance
a) Typical cost to install a 4 CPU Windows system with 2 developers and 2 source and 1 target adapter
b) Typical cost to install a 4 CPU UNIX system with 2 developers and 2 source and 1 target adapter
a) Help desk locations and hours
b) Standard support services
c) Premium support services
a) Provide your contact information
Data Mining BO Predictive
Planning and Simulaiton BO PC SAP BI IP
Data Analysis BO Pioneer
Ad-hoc reporting BO Web Intelligence
Standard reporting BO Web Intelligence
Dashboad/Scorecard SAP Stretegy Management SAP Visual composer
Legal conslidation BO Financial Consolidation

All Organisation process web application

1.1.1.1. Real time reports access operational data stores


1.1.1.2. Generates reports automatically
1.1.1.3. Web query
1.1.1.4. Ad-hoc queries
1.1.1.5. Column-based indexing for faster data retrieving
1.1.1.6. Navigates all connected relational databases
1.1.1.7. Interactive data exploration with analytics
1.1.1.8. Desktop interface to advanced analytics
1.1.1.9. Multiple data sources and platforms
1.1.1.10. VLDB drivers
1.1.1.11. Multiple table inserts Multiple tab inserts provide the ability to insert into more than one table with a single SQL state
1.1.1.12. Table functions Table functions eliminate the need to stage data into physical objects during complex data transforma
1.1.1.13. Accesses delimited ASCII files
1.1.1.14. Accesses fixed format ASCII
1.1.1.15. Schedules information alerts based on specified conditions
1.1.1.16. 24x7 server availability
1.1.1.17. Real time access to operational data
1.1.1.18. Report drill through from OLAP cubes
1.1.1.19. Report drill through from zero data cubes
1.1.1.20. Tree-style structure that logically organizes database columns into folders
1.1.1.21. Embedded production report writers
1.1.1.22. Batch production and distribution of reports
1.1.1.23. Combines multiple reports into a single "report dashboard"
1.1.1.24. Exports reports to other formats
1.1.1.25. Multiple format support including HTML, Excel, and PDF
1.1.1.26. Integrates with Microsoft Front Page and other HTML development tools
1.1.1.27. Interface with Microsoft Analysis Services and Hyperion Essbase
1.1.1.28. SAP BW
1.1.1.29. Java-based reporting tool
1.1.1.30. Custom Java tag library reduces the amount of coding required to integrate reports into JSP pages
1.1.1.31. Object-oriented 4GL
1.1.1.32. Graphical interface avoids complex languages, such as C or Java
1.1.2. REPORTING CAPABIL
1.1.2.1. Step-by-step report creator
1.1.2.2. Point-and-click graphics tool Point-and-click graphics tool give users a complete look at the data from every possible an
1.1.2.3. Hides or drills down to detailed information
1.1.2.4. Pre-set templates
1.1.2.5. Saves, schedules, and publishes reports
1.1.2.6. Multiblock reports
1.1.2.7. Report catalog and report serving
1.1.2.8. P&L reporting
1.1.2.9. Data-manipulation capabilities for row and cumulative totals
1.1.2.10. Cross-tabs break a report across an unlimited amount of categories
1.1.2.11. Desktop and web reporting interfaces
1.1.2.12. Identifies exceptions in report
1.1.2.13. Automates the generation of reports
able with a single SQL statement
g complex data transformation.

ata from every possible angle.


4.2 Version 4.1 Version
Operational statistics enhancements Adaptive processing server from IP services
Object promotion management RFC Server
User application rights Administrator Service (used for log cleanup)
Manage datastore and substitution param configurations View Data Service
Data Services Workbench Metadata Browsing Service
Replication Job editor Data ServicesWorkbench
Data Flow Editor Design-Time Data Viewer
Quick Replication Wizard Improved XML support through the XML_Map transform
Using data cleansing solutions from Data Cleansing Advisor DSN-less and TNS-less connections
Enhanced SAP HANA support
Big data loading SAP HANA repository support
Spatial data support in Data Services SAP HANA performance improvements
Enhancements to variables and parameters Data Services nowuses a transparent staging table for bulk l
Enhancements to global variables Support for stored procedures
XML_Map transform- nesting and unnesting of hierarchical data Enhanced extraction capabilities for SAP Business Suite
Multithreaded TDP Entity Extraction transform Data streaming in ABAP data flows-RFC
SAP table reader in regular data flows -Execute in backgroun
Parallel reading from business content extractors-multithrea
New ABAP functions for improved security-more granular le
SNC authentication and load-balancing support in SAP datas
Parallel SAP connections for Open Hub tables
Continuous execution type workflow and Single execution
More pushdown of functions
Filtering updated rows in the Map_CDC_Operation transform
Filtering comparison rows in the Table_Comparison transfor
Text Data Processing
Expanded language coverage
Aligned entity type names for TDP and DQ
Added Processing Timeout option
Added process logging for auditing and support resolution-T
Social Media entity type support
Entity Extraction pushdown to Hadoop
Processing of binary source documents-2003,7,10
Emoticon extraction support
Profanity extraction support
4.0 Version
SAP integration
Reading business content extractors for ERP and CRM/SRM
Support for SAP NetWeaver BW 7.3 staging BAPI
SAP NetWeaver BW 7.3 integration
SAP System Landscape Directory (SLD)
Single version local repository
Enhanced Validation
Improved joins in thetransform
Query transform-cache can now be set
he XML_Map transform directly in the FROM tab
Enhanced Hierarchy Flattening transform-handle circular depen
Navigation commands on the menu in the main window
Command equivalents of mouse operations in context menus
TAB key navigation in dialog windows
ent staging table for bulk loading to targets Source and target support
Database
For quickersynonyms are now supported in Oracle and DB2 data
data extraction
for SAP Business Suite from Teradata, the Teradata fast export functionality can now b
File reader enhancements-Blank trimming option
ows -Execute in background Web service datastores-WSDL URLs
ntent extractors-multithreading
d security-more granular level assign authorizations
ncing support in SAP datastores
n Hub tables
flow and Single execution type workflow

p_CDC_Operation transform
Table_Comparison transform

g and support resolution-TextID

ments-2003,7,10
actors for ERP and CRM/SRM
W 7.3 staging BAPI

ansform-cache can now be set

transform-handle circular dependencies


menu in the main window
se operations in context menus

upported in Oracle and DB2 datastores


st export functionality can now be used
nk trimming option

Das könnte Ihnen auch gefallen