Share ETL is a process for performing data extraction, transformation and loading.

Data Pipeline

The process extracts data from a variety of sources and formats, transforms it into a standard structure, and loads it into a database, file, web service, or other system for analysis, visualization, machine learning, etc. ETL tools come in a wide variety of shapes. Some run on your desktop or on-premise servers, while others run as SaaS in the cloud.

Some are code-based, built on standard programming languages that many developers already know. Others are built on a custom DSL domain specific language in an attempt to be more intentional and require less code.

Others still are completely graphical, only offering programming interfaces for complex transformations. Data Pipeline Data Pipeline is our own tool.

This approach also allows it to process both batch and streaming data through the same pipelines. Data Pipeline comes in a range of versions including a free Express edition. The tool allows for a combination of relational and non-relational data sources.

It also includes a business modeler for a non-technical view of the information workflow and a job designer for displaying and editing ETL steps.

A debugger also exists for real-time debugging. The tool has been designed with simplicity in mind. It is a stand-alone package and has no third-party dependencies and notification tools. Its feature set include single-interface project integration, visual job designer for non-developers, bi-directional integration, platform independence and the ability to work with a wide range of applications and data sources such as Oracle, MS SQL and JDBC.

These features not only make it a rival to competing commercial solutions but also make the ETL highly extensible. It runs on top of Hadoop MapReduce and speeds up tasks that would otherwise be very slow on plain MapReduce.

Examples of these tasks include data joining and integration. Apache Crunch is specifically well-suited for data that would not fit into a relational model such as time series, serialized object formats, Avro records and HBase rows and columns.

Cascading Cascading is an open source Java library used for data processing. Examples include sorting, averaging, filtering and merging. Cascading supports reading and writing from a wide range of external sources. It is a reliable and scalable tool which forms a single logical unit by sequentially combining multiple jobs.

Apache Oozie also supports job scheduling for specific systems such as shell scripts and java programs. Datasift Datasift is a powerful data validation and transformation framework. It has been built with the purpose of targeting enterprise software development by providing developers with an extensible architecture.

Talend Open Studio for Data Integration Talend is an open source tool that offers a wide range of data integration solutions. Its graphical user interface allows for a drag-and-drop feature set which lets non-programmers execute complex integration tasks. Other features include the ability to manipulate strings, automatic lookup handling and management of changing dimensions.

The extract-transform-load engine comes with executables for major platforms and supports integration into other applications. These libraries are used for unpacking, transforming and deploying data into programs written in Java, Groovy and other similar languages based on Java classes and objects.

The C# language is an object-oriented language that is aimed at enabling programmers to quickly build a wide range of applications for the platform. Java programming practices and techniques.

