Tuesday, May 30, 2017

What is ETL?



What is ETL?


ETL (Extract, Transform and Load) is a procedure in data warehousing to transform information out of the source systems and putting it into an information distribution center (Data Warehouse).



  • Extract is the process of reading data from a one or multiple source systems.

  • Transform is the process of converting the extracted data from its previous form into the form it needs to be in so that it can be placed into another system. 
Transforming of data means performing any of the below tasks.
    •  Applying business rules (e.g. calculating new measures and dimensions),
    •  Cleaning (e.g. mapping NULL to 0 or "Male" to "M" and "Female" to "F" etc.),
    •  Filtering (e.g. selecting only certain rows/columns to load),
    •  Splitting a column into multiple columns and vice versa,
    •  Joining together data from multiple sources (e.g. lookup, merge),
    • Changing date format (e.g. From YYYY-MM-DD to MM-DD-YYYY)
    •  Transposing rows and columns,
    •  Applying any kind of simple or complex data validation (e.g. if the first 2 columns in a row are empty then reject the row from processing)

  • Load is the process of writing the data into the target systems (Generally Data Warehouse).





Sunday, November 1, 2015

Informatica 9 Architecture Continued...

Integration Service Process

As mentioned earlier, whenever Integration Service receives a request to execute a workflow, it spawns an Integration Service Process. In admin community it is typically called as pmserver process. Please note that Integration Service can start one or multiple Integration Processes to run and monitor Workflows.
Whenever a workflow is scheduled or changed, Workflow Manager interacts with Integration Service Process. This interaction happens over TCP/IP.  Integration Service process locks the Workflow and then knocks the Repository Service to get metadata about Workflow. Repository Service intern connects to Repository Database – fetches desired information – and gives it back to Integration Service process.
Once Integration Service process receives metadata information back from Repository Service, it creates connections to Source and Target databases using ODBC or Native drivers. Integration Service process also creates connections to any lookup or stored procedure databases which are used by the Workflow. Once the data loading process is done, Integration Service process releases and closes all open database connections.
Integration Service process is also responsible to start DTM process that will actually execute Workflow Sessions. Integration Service process passes Parameter File and Session information to DTM process, that helps it to retrieve required metadata from the repository. Please refer to article Data Transformation Manager for details on DTM Process.
See the figure below for details of this interaction.


To summarize, Integration Service process is responsible to carry out below tasks:
  • Workflow Schedule Management
  • Locking Workflow Before Execution
  • Reading Parameter File
  • Creating Workflow Logs
  • Running any Tasks in Workflow (e.g., Email)
  • Creating DTM processes to run Workflow Sessions

Thursday, October 22, 2015

Informatica 9 Architecture

Informatica 9 Architecture

Informatica came up with Service Oriented Architecture during release of Informatica 8.x. With 9.x there are additions and enhancements to existing 8.x architecture. Below diagram details various components used in 9.0.1/9.1.0 and how they interact with each other. There are few points worth noting:
  • Service Oriented Architecture means that various components within Informatica encapsulate core functionalities and these components are composed together to build various Informatica products.
  • Informatica follows a Model Driven Architecture where all the domain objects are modeled. Note that model refers to structure of the objects, their attributes/types, and so on. Also, the core meta-models are shared across all Informatica products which help various tools to interact with each other. For example, with the help of such model Informatica Analyst tool can talk to Informatica Developer.
Informatica 9 Architecture block diagram

Administrator Console: 
This is a web based platform where an administrator can setup and control how Informatica will be set-up. This involves setting up and maintenance of domain, nodes, and various services such as Repository Service. This tool also provides Security Control like adding groups, users and their access to repository.Let’s understand what various building blocks represent in above architecture diagram in brief.
  • Informatica Services Platform (ISP): The ISP is the administrative framework for Informatica services. It allows monitoring, start/stop, and fail over of services. It is extensible through plug-ins for additional services. ISP supports single point of administration. And, it allows integration with LDAP active directory. It provides a set of core services used internally, like Authentication Service, Name Services, and so on.
  • Informatica Modeling Framework (IMF): Supports serialization and cross language model exchange. The MRS provides automated persistence of models. It looks at the model to determine how to persist objects. The default persistence scheme can be customized. Repository capabilities are model agnostic. MRS provides metadata search and import/export. MRS allows you to add the model definitions dynamically. Like, adapter meta models.
  • Model Repository Service (MRS): The MRS provides automated persistence of models. It looks at the model to determine how to persist objects. The default persistence scheme can be customized. Repository capabilities are model agnostic. MRS provides metadata search and import/export. MRS allows you to add the model definitions dynamically. Like, adapter meta models.
  • Data Integration Service (DIS): DIS is the container for all data integration functionalities. DIS plug-ins provide different data integration functionalities. The different plug-ins are as follows:
•        Profiling service plug-in translates profile into mappings
•        SQL Service plug-in translates SQL into mappings
•        Mapping Service executes data quality plans
It provides common services to its plug-ins: request dispatch, thread pooling, and so on. And, it also provides mapping execution using embedded Data Transformation Manager (DTM).
  • Data Transformation Manager (DTM): DTM Interprets and executes mappings. It is re-engineered to be lightweight and embeddable. It allows repository-less execution: In previous versions of PowerCenter, DTM connects to the repository to fetch the plan. However, now the mapping can be run without reading from the repository, like mapping stored to disk. DTM enables concurrent execution semantics: multiple DTM instances can coexist in the same process. It supports logical transforms, which are translated to executable transforms before execution. It allows different execution modes: single threaded and pipelined execution modes. DTM comes up with optimal execution plan for mappings.
  • Web Services: Informatica 9.1 has native Web Services capability in Informatica Data Services. You can publish logical data objects and transformations as Web Service end point. You can also consume externally or internally provided web service as part of a SQL DS, Web Service or a Data Quality mapping.
This is just an overview of various components used in revamped architecture. If you need more details on these components or have any questions, feel free to leave comments.

POWERCENTER CLIENT TOOLS

POWERCENTER CLIENT


The Power Center Client consists of the following applications that we use to manage the repository, design mappings, mapplets, and create sessions to load the data:

  1. Designer
  2. Repository Manager
  3. Workflow Manager
  4. Workflow Monitor
1. Designer:
Use the Designer to create mappings that contain transformation instructions for the Integration Service.
The Designer has the following tools that you use to analyze sources, design target Schemas, and build source-to-target mappings:
  •  Source Analyzer: Import or create source definitions.
  •  Target Designer: Import or create target definitions.
  •  Transformation Developer: Develop transformations to use in mappings.
You can also develop user-defined functions to use in expressions.
  •  Mapplet Designer: Create sets of transformations to use in mappings.
  •  Mapping Designer: Create mappings that the Integration Service uses to Extract, transform, and load data.


2.Repository Manager
Use the Repository Manager to administer repositories. You can navigate through multiple folders and repositories, and complete the following tasks:
  • Manage users and groups: Create, edit, and delete repository users and User groups. We can assign and revoke repository privileges and folder Permissions.
  • Perform folder functions: Create, edit, copy, and delete folders. Work we perform in the Designer and Workflow Manager is stored in folders. If we want to share metadata, you can configure a folder to be shared.
  • View metadata: Analyze sources, targets, mappings, and shortcut dependencies, search by keyword, and view the properties of repository Objects. We create repository objects using the Designer and Workflow Manager Client tools.
We can view the following objects in the Navigator window of the Repository Manager:
  • Source definitions: Definitions of database objects (tables, views, synonyms) or Files that provide source data.
  • Target definitions: Definitions of database objects or files that contain the target data.
  • Mappings: A set of source and target definitions along with transformations containing business logic that you build into the transformation. These are the instructions that the Integration Service uses to transform and move data.
  • Reusable transformations: Transformations that we use in multiple mappings.
  • Mapplets: A set of transformations that you use in multiple mappings.
  • Sessions and workflows: Sessions and workflows store information about how and When the Integration Service moves data. A workflow is a set of instructions that Describes how and when to run tasks related to extracting, transforming, and loading Data. A session is a type of task that you can put in a workflow. Each session Corresponds to a single mapping.

3.Workflow Manager :
Use the Workflow Manager to create, schedule, and run workflows. A workflow is a set of instructions that describes how and when to run tasks related to extracting, transforming, and loading data.
The Workflow Manager has the following tools to help us develop a workflow:
  • Task Developer: Create tasks we want to accomplish in the workflow.
  • Work let Designer: Create a worklet in the worklet Designer. A worklet is an object that groups a set of tasks. A worklet is similar to a workflow, but without scheduling information. We can nest worklets inside a workflow.
  • Workflow Designer: Create a workflow by connecting tasks with links in the Workflow Designer. You can also create tasks in the Workflow Designer as you develop the workflow.
When we create a workflow in the Workflow Designer, we add tasks to the workflow. The Workflow Manager includes tasks, such as the Session task, the Command task, and the Email task so you can design a workflow. The Session task is based on a mapping we build in the Designer.
We then connect tasks with links to specify the order of execution for the tasks we created. Use conditional links and workflow variables to create branches in the workflow.


4.Workflow Monitor
Use the Workflow Monitor to monitor scheduled and running workflows for each Integration Service. We can view details about a workflow or task in Gantt chart view or Task view. We Can run, stop, abort, and resume workflows from the Workflow Monitor. We can view Sessions and workflow log events in the Workflow Monitor Log Viewer.
The Workflow Monitor displays workflows that have run at least once. The Workflow Monitor continuously receives information from the Integration Service and Repository Service. It also fetches information from the repository to display historic Information.

What is Informatica?

Informatica Power Center is a powerful ETL tool from Informatica Corporation.

Informatica Corporation products are:
  • Informatica Power Center
  • Informatica on demand
  • Informatica B2B Data Exchange
  • Informatica Data Quality
  • Informatica Data Explorer
Informatica Power Center is a single, unified enterprise data integration platform for accessing, discovering, and integrating data from virtually any business system, in any format, and delivering that data throughout the enterprise at any speed.

Informatica Power Center Editions :
Because every data integration project is different and includes many variables such as data volumes, latency requirements, IT infrastructure, and methodologies—Informatica offers three Power Center Editions and a suite of Power Center Options to meet your project’s and organization’s specific needs.
  • Standard Edition
  • Real Time Edition
  • Advanced Edition
Informatica Power Center Standard Edition:
Power Center Standard Edition is a single, unified enterprise data integration platform for discovering, accessing, and integrating data from virtually any business system, in any format, and delivering that data throughout the enterprise to improve operational efficiency.
Key features include:
  • A high-performance data integration server
  • A global metadata infrastructure
  • Visual tools for development and centralized administration
  • Productivity tools to facilitate collaboration among architects, analysts, and developers .clip_image021clip_image024
Informatica Power Center Real Time Edition :
Packaged for simplicity and flexibility, Power Center Real Time Edition extends Power Center Standard Edition with additional capabilities for integrating and provisioning transactional or operational data in real-time. Power Center Real Time Edition provides the ideal platform for developing sophisticated data services and delivering timely information as a service, to support all business needs. It provides the perfect real-time data integration complement to service-oriented architectures, application integration approaches, such as enterprise application integration (EAI), enterprise service buses (ESB), and business process management (BPM).
Key features include:
  • Change data capture for relational data sources
  • Integration with messaging systems
  • Built-in support for Web services
  • Dynamic partitioning with data smart parallelism
  • Process orchestration and human workflow capabilities
Informatica Power Center Real Time Edition :
Power Center Advanced Edition addresses requirements for organizations that are Standardizing data integration at an enterprise level, across a number of projects and  departments. It combines all the capabilities of Power Center Standard Edition and features additional capabilities that are ideal for data governance and Integration Competency Centers.
Key features include:
  • Dynamic partitioning with data smart parallelism
  • Powerful metadata analysis capabilities
  • Web-based data profiling and reporting capabilities
Power Center includes the following components:
  • Power Center domain
  • Administration Console
  • Power Center repository
  • Power Center Client
  • Repository Service
  • Integration Service
  • Web Services Hub
  • SAP BW Service
  • Data Analyzer
  • Metadata Manager
  • Power Center Repository Reports