Aggregator Transformation



  • Active and Connected.
  • The Aggregator transformation performs aggregate calculations, such as averages and sums.
  • The Aggregator transformation makes use of aggregator cache to perform aggregations.
  • What is Difference between Aggregator and Expression Transformation: Aggregator transformation to perform calculations on groups. The Expression transformation permits we to perform calculations on a row-by-row basis only.
  • What is incremental aggregation option:
    After we have used the aggregator transformation in our mapping we can can enable the session option, Incremental Aggregation. Note that this is done using session properties option. This means this can be done only through session and not When the Integration Service performs incremental aggregation, it passes new source data through the mapping and uses historical cache data to perform new aggregation calculations incrementally.
    Explain Components of aggregator transformation:

The Aggregator transformation has the following components and options:

  • Aggregate cache. The Integration Service stores data in the aggregate cache until it completes aggregate calculations. It stores group values in an index cache (when we are using group by) and row data in the data cache.

  • Aggregate expression. Enter an expression in an output port. Note that we need to create a new output port and then enter the expression. We cant do that on input-output port.The expression can include nont-aggregate expressions and conditional clauses.

  • Group by port. Indicate how to create groups. The port can be any input, input/output, output, or variable port. When grouping data, the Aggregator transformation outputs the last row of each group unless otherwise specified.

  • Sorted input. Select this option to improve session performance. To use sorted input, we must pass data to the Aggregator transformation sorted by group by port, in ascending or descending order.


Explain Aggregator Transformation properties:



Property
Description.
Cache Directory
    Local directory where the Integration Service creates the index and data cache files. By default: $PMCacheDir.
If we have enabled incremental aggregation, the Integration Service creates a backup of the files each time we run the session. The cache directory must contain enough disk space for two sets of the files.
Tracing Level
Amount of detail displayed in the session log for this transformation.
Sorted Input
Indicates input data is presorted by groups. Select this option only if the mapping passes sorted data to the Aggregator transformation.
Aggregator Data Cache Size
Data cache size for the transformation. Default cache size is 2,000,000 bytes.
we can configure the Integration Service to determine the cache size at run time, or we can configure a numeric value. If we configure the Integration Service to determine the cache size, we can also configure a maximum amount of memory for the Integration Service to allocate to the cache.
Aggregator Index Cache Size:
Index cache size for the transformation. Default cache size is 1,000,000 bytes.
we can configure the Integration Service to determine the cache size at run time, or we can configure a numeric value. If we configure the Integration Service to determine the cache size, we can also configure a maximum amount of memory for the Integration Service to allocate to the cache.
Transformation Scope
Specifies how the Integration Service applies the transformation logic.
Transaction: Applies the transformation logic to all rows in a transaction. Choose
Transaction when a row of data depends on all rows in the same transaction, but
does not depend on rows in other transactions. Traansaction commit points can be created using Transaction control transformation.

All Input: Applies the transformation logic on all incoming data. When we choose All Input, the PowerCenter drops incoming transaction boundaries. Choose All Input when a row of data depends on all rows in the source.
This is what it generally should be.

All the above properties can be further overridden at the session level:




  • How does aggregator use Aggregator cache:

  1. The Integration Service creates index and data caches in memory to process the transformation. If the Integration Service requires more space, it stores overflow values in cache files. Note that only if more space is required cache files are created.
  2. We can configure the index and data caches in the Aggregator transformation or in the session properties.
  3. Why no cache required for Sorted inputs:
    The Integration Service uses memory to process an Aggregator transformation with sorted ports. It does not use cache memory. We do not need to configure cache memory for Aggregator transformations that use sorted ports


  • Explain Aggregate expressions:
  1. The Designer allows aggregate expressions only in the Aggregator transformation. An aggregate expression can include conditional clauses and non-aggregate functions.
  2. It can also include one aggregate function nested within another aggregate function, such as: MAX( COUNT( ITEM ))

  • What Aggregate functions can be used:

  1. Following functions can be used with aggregator:

AVG
COUNT
FIRST
LAST
MAX
MEDIAN
MIN
PERCENTILE
STDDEV
SUM
VARIANCE

  1. We can nest one aggregate function within another aggregate function.

  1. We can include multiple single-level or multiple nested functions in different output ports in an Aggregator transformation. However, we cannot include both single-level and nested functions in an Aggregator transformation. Therefore, if an Aggregator transformation contains a single-level function in any output port, we cannot use a nested function in any other port in that transformation. When we include single-level and nested functions in the same Aggregator transformation, the Designer marks the mapping or mapplet invalid. If we need to create both single-level and nested functions, we should create separate Aggregator transformations.

  1. We can Use conditional clauses in the aggregate expression to reduce the number of rows used in the aggregation. The conditional clause can be any clause that evaluates to TRUE or FALSE.

Ex: SUM( COMMISSION, COMMISSION > QUOTA )

  1. We can also use non-aggregate functions in the aggregate expression.

  1. When we configure the Integration Service, we can choose how we want the Integration Service to handle null values in aggregate functions. we can choose to treat null values in aggregate functions as NULL or zero. By default, the Integration Service treats null values as NULL in aggregate functions.

Define a default value for each port in the group to replace null input values. This allows the Integration Service to include null item groups in the aggregation.


  • Explain how to use Group By Ports:
  1. The Aggregator transformation lets we define groups for aggregations, rather than performing the aggregation across all input data.
  2. To define a group for the aggregate expression, select the appropriate input, input/output, output, and variable ports in the Aggregator transformation. we can select multiple group by ports to create a new group for each unique combination.

  1. When we group values, the Integration Service produces one row for each group. If we do not group values, the Integration Service returns one row for all input rows. The Integration Service typically returns the last row of each group (or the last row received) with the result of the aggregation. (unless we some function like FIRST)

  1. When selecting multiple group by ports in the Aggregator transformation, the Integration Service uses port or der to determine the order by which it groups.

  1. Use non-aggregate expressions in group by ports to modify or replace groups. For example, if we want to replace ‘AAA battery’ before grouping, we can create a new group by output port, named CORRECTED_ITEM, using the following expression:
IIF( ITEM = 'AAA battery', battery, ITEM )


  • Explain how to Use Sorted Input:

  1. We can improve Aggregator transformation performance by using the sorted input option. When we use sorted input, the Integration Service assumes all data is sorted by group and it performs aggregate calculations as it reads rows for a group.
  2. Sorted input uses only memory and no cache is involved and hence it is beneficial as far as performance is concerned.
  3. This prevent aggregator to wait for entire data to be first read.
When we do not use sorted input, the Integration Service performs aggregate calculations as it reads. Since the data is not sorted, the Integration Service stores data for each group until it reads the entire source to ensure all aggregate calculations are accurate.
  1. If we use sorted input and do not presort data correctly, we receive unexpected results.
  2. Do not use sorted input if either of the following conditions are true:
The aggregate expression uses nested aggregate functions.
The session uses incremental aggregation. Very imp to note that when using incremental aggregation we cannot use sorted input.
  1. Data must be sorted in the following ways:
By the Aggregator group by ports, in the order they appear in the Aggregator transformation.
Using the same sort order configured for the session.
  1. For relational and file sources, use the Sorter transformation to sort data in the mapping before passing it to the Aggregator transformation.

If the session uses relational sources, we can also use the Number of Sorted Ports option in the Source Qualifier transformation to sort group by columns in the source database. Group by columns must be in the same order in both the Aggregator and Source Qualifier transformations.

Note: We cannot add expressions to existing input-output ports. So we need to create new output or variable ports.

Tips:
Limit the number of connected input/output or output ports to reduce the amount of data the Aggregator transformation stores in the data cache.


2 comments: