Saturday, January 26, 2013

SUM for strings in Oracle

At times I have needed in an aggregate function to build a list of varchar values in single row inside a group. Something like SUM aggregate function but for strings. I wanted to have for each group a list of strings separated by defined delimiter.
To illustrate this I create NEWBORN table
CREATE TABLE newborn (
  birthday    DATE 
, name        VARCHAR2(32)  
)

It contains name and date of the birth of fictional newborns. Let's fill it up.
BEGIN
  INSERT INTO newborn VALUES (DATE '2012-01-10', 'Andrey');
  INSERT INTO newborn VALUES (DATE '2012-01-19', 'Kira');
  INSERT INTO newborn VALUES (DATE '2012-03-02', 'Anastasia');
  INSERT INTO newborn VALUES (DATE '2012-03-15', 'Ivan');
  INSERT INTO newborn VALUES (DATE '2012-03-21', 'Andrey');
  INSERT INTO newborn VALUES (DATE '2012-04-01', 'Sergey');
  INSERT INTO newborn VALUES (DATE '2012-04-17', 'Roman');
  INSERT INTO newborn VALUES (DATE '2012-04-22', 'Sergey');
 
  COMMIT;
END;
 
So I want to know what names have been used in each month. Something like this
MONTH     LIST_OF_NAMES
-----     -------------
April     Roman, Sergey, Sergey
January   Andrey, Kira
March     Anastasia, Andrey, Ivan

Before Oracle 11gR2 it was not trivial task. To implement such functionality we could
  • create user-defined aggregate function
  • use undocumented function WM_CONCAT
  • use some "exotic" ways based on Oracle SQL features
In Oracle 11R2 LISTAGG function was introduced. So let's start with it and then describe other solutions.

Note 

Varchars in Oracle SQL have length restriction - 4000 symbols. So keep in mind that it's also restriction for a result obtained in any way described below.

LISTAGG function

The minimal form of LISTAGG is
LASTAGG (expr) WITHIN GROUP (order_by_expr)

expr is any expression returns varchar (or converted to varchar implicitly) value. order_by_expr is ORDER BY clause to sort values in returned list.

To implement what I talk about above write the following query
SELECT TO_CHAR(birthday, 'Month') AS MONTH
     , LISTAGG(name) WITHIN GROUP (ORDER BY name) AS list_of_names
  FROM newborn
 GROUP BY TO_CHAR(birthday, 'Month')

The result
MONTH     LIST_OF_NAMES
-------   -------------
April     RomanSergeySergey
January   AndreyKira
March     AnastasiaAndreyIvan

Obviously some delimiter should be to separate names in the list.

To achieve it just use the second parameter of LISTAGG where define a symbol-delimiter.
SELECT TO_CHAR(birthday, 'Month') AS MONTH
     , LISTAGG(name, ', ') WITHIN GROUP (ORDER BY name) AS list_of_names
  FROM newborn
 GROUP BY TO_CHAR(birthday, 'Month')

The result
MONTH     LIST_OF_NAMES
-----     -------------
April     Roman, Sergey, Sergey
January   Andrey, Kira
March     Anastasia, Andrey, Ivan

It's not all. Also LISTAGG is analytic function. To use it as analytic one just add OVER operator and describe needed analytic clause. Unfortunately in analytic clause just PARTITION BY can be used.

For example, to show what names were used in the same month for each newborn we can write
SELECT birthday
     , name
     , LISTAGG(name, ', ') WITHIN GROUP (ORDER BY name) over (PARTITION BY TO_CHAR(birthday, 'Month')) AS list_of_names
  FROM newborn
 ORDER BY birthday

The result
BIRTHDAY NAME        LIST_OF_NAMES
--------   ----        -------------
10.01.2012 Andrey      Andrey, Kira
19.01.2012 Kira        Andrey, Kira
02.03.2012 Anastasia   Anastasia, Andrey, Ivan
15.03.2012 Ivan        Anastasia, Andrey, Ivan
21.03.2012 Andrey      Anastasia, Andrey, Ivan
01.04.2012 Sergey      Roman, Sergey, Sergey
17.04.2012 Roman       Roman, Sergey, Sergey
22.04.2012 Sergey      Roman, Sergey, Sergey

Disadvantage

One thing I regret about is that LISTAGG function does not have an ability to make its result to be unique. Kind of LISTAGG(DISTINCT name)...

And a bit about performance... Here LISTAGG's performance is compared with performance of others possible implementations of SUM for strings function. LISTAGG shew itself as "the fastest technique for string aggregation".

User-defined aggregate function

Before Oracle 11R2 the most recommended way to implement SUM for strings is using of user-defined aggregate function.

All what needs is create a type which implements some methods used in any aggregate function and a function based on this type used to call from SQL statements.

Any aggregate function performs the following four operations in their work
  • initialization (ODCIAggregateInitialize method) - It's static factory method - here an instance of the type is created and returned. Created object is also named as aggregation context. The context is used in all following computations. This step is done just once before all computations.
  • iteration (ODCIAggregateIterate method) - Here there are current aggregation context and the value is currently inputed into aggregation function. The context could be updated depending on the value and then is returned. This step is invoked for every not-null value.
  • merging (ODCIAggregateMerge method) - Here two aggregation context are "gummed up" and one merged context is returned. It's optional step but it's needed, for example, when calculations are made in parallel mode - results of each process should be merged in one.
  • termination (ODCIAggregateTerminate method) - Here any final actions are performed, such as, extra calculations with the context, any cleanups, etc. And the context is returned as completed result of aggregation computation. It's last step and is done just once after all.
All ODCIAggregate functions return number value: ODCIConst.Success on success, or ODCIConst.Error on error.

So a type with implementation of operations described above should be created.

Then we should declare aggregate function with single input parameter and link it with created type. It means that in the declaration AGGREGATE USING clause should be added. After that the function can be used as usual aggregate function, like SUM, AVG, COUNT, etc.

As example I want to create SUMSTR function.

First, I create T_SUMSTR type and its body
CREATE OR REPLACE TYPE t_sumstr AS object (
  str    VARCHAR2(4000)
, delimiter  VARCHAR2(4)  
 
, static FUNCTION ODCIAggregateInitialize (ctx  IN OUT t_sumstr)
    RETURN NUMBER 
 
, member FUNCTION ODCIAggregateIterate (self    IN OUT t_sumstr
                                      , newStr  IN VARCHAR2)
    RETURN NUMBER
 
, member FUNCTION ODCIAggregateMerge(self  IN OUT t_sumstr
                                   , ctx2  IN t_sumstr)
    RETURN NUMBER
 
, member FUNCTION ODCIAggregateTerminate (self       IN t_sumstr
                                        , returnStr  OUT VARCHAR2
                                        , flag       IN NUMBER)
    RETURN NUMBER
);
 
 
CREATE OR REPLACE TYPE BODY t_sumstr 
IS 
 
  static FUNCTION ODCIAggregateInitialize (ctx  IN OUT t_sumstr)
    RETURN NUMBER 
  IS 
  BEGIN
    ctx := t_sumstr(NULL, ', ');
    RETURN ODCIConst.Success;
  END;
 
  member FUNCTION ODCIAggregateIterate (self    IN OUT t_sumstr
                                      , newStr  IN VARCHAR2)
    RETURN NUMBER
  IS 
  BEGIN
    self.str := self.str || newStr || self.delimiter;
    RETURN ODCIConst.Success;
  END;
 
  member FUNCTION ODCIAggregateMerge(self  IN OUT t_sumstr
                                   , ctx2  IN t_sumstr)
    RETURN NUMBER
  IS 
  BEGIN
    self.str := self.str || ctx2.str;
    RETURN ODCIConst.Success;
  END;
 
  member FUNCTION ODCIAggregateTerminate (self       IN t_sumstr
                                        , returnStr  OUT VARCHAR2
                                        , flag       IN NUMBER)
    RETURN NUMBER
  IS 
  BEGIN
    returnStr := RTRIM(self.str, self.delimiter);
    RETURN ODCIConst.Success;
  END;
 
END;

Some comments concerning T_SUMSTR type. Inside I declare STR and DELIMITER fields. STR field accumulates all values in single string (in ODCIAggregateIterate method). DELIMITER field contains a symbol used as delimiter in result string. I set the delimiter when initialize the aggregate context (ODCIAggregateInitialize method). And I use this predefined delimiter through other methods.

Second, I create SUMSTR function and link it with T_SUMSTR type
CREATE OR REPLACE FUNCTION sumstr (str  IN VARCHAR2)
  RETURN VARCHAR2
  deterministic
  parallel_enable
  aggregate using t_sumstr;

Now let's check it out. I use STRSUM function instead of LISTAGG from the previous queries
SELECT TO_CHAR(birthday, 'Month') AS MONTH
     , SUMSTR(name) AS list_of_names
  FROM newborn
 GROUP BY TO_CHAR(birthday, 'Month')

The result
MONTH     LIST_OF_NAMES
-----     -------------
April     Sergey, Sergey, Roman
January   Andrey, Kira
March     Anastasia, Andrey, Ivan

SUMSTR function also supports an ability to make the result list to be unique
SELECT TO_CHAR(birthday, 'Month') AS MONTH
     , SUMSTR(DISTINCT name) AS list_of_names
  FROM newborn
 GROUP BY TO_CHAR(birthday, 'Month')

The result
MONTH     LIST_OF_NAMES
-----     -------------
April     Sergey, Roman
January   Andrey, Kira
March     Anastasia, Andrey, Ivan

And SUMSTR function can be used as analytic one. Without any restriction in analytic clause (unlike LISTAGG function).
SELECT birthday
     , name
     , SUMSTR(name) over (PARTITION BY TO_CHAR(birthday, 'Month') ORDER BY name RANGE BETWEEN unbounded preceding AND unbounded following) AS list_of_names
  FROM newborn
 ORDER BY birthday

The result
BIRTHDAY NAME        LIST_OF_NAMES
--------   ----        -------------
10.01.2012 Andrey      Andrey, Kira
19.01.2012 Kira        Andrey, Kira
02.03.2012 Anastasia   Anastasia, Andrey, Ivan
15.03.2012 Ivan        Anastasia, Andrey, Ivan
21.03.2012 Andrey      Anastasia, Andrey, Ivan
01.04.2012 Sergey      Roman, Sergey, Sergey
17.04.2012 Roman       Roman, Sergey, Sergey
22.04.2012 Sergey      Roman, Sergey, Sergey

Disadvantage

The main thing I don't have enough is only one input parameter in aggregate function. Because of it I can't expand abilities of SUMSTR. For example, to pass desired delimiter or sort order.
As workaround Tom Kyte advices to use sys_context, search here by 'sys_context'.

WM_CONCAT function

Often to achieve SUM for strings behaviour WMSYS.WM_CONCAT function is advised (before 11gR2 of course). But be careful it's undocumented! It's strongly not recommended to use in your production code. Because it's sintax or functionality might be changed in next Oracle versions. For example, in one of minor releases of 10g the returned type was substituted from VARCHAR2 to CLOB.

WM_CONCAT's source code is wrapped, but I suspect it's identical with SUMSTR user-defined function we have created in the previous section. Just as delimiter ',' is used (we used ', ').
Example of using
SELECT TO_CHAR(birthday, 'Month') AS MONTH
     , WMSYS.WM_CONCAT(name) AS list_of_names
  FROM newborn
 GROUP BY TO_CHAR(birthday, 'Month')

The result
MONTH     LIST_OF_NAMES
-----     -------------
April     Sergey, Sergey, Roman
January   Andrey, Kira
March     Anastasia, Andrey, Ivan

And one more time - I would not use WM_CONCAT in product code, the best way is create your own user-defined aggregate function. Nevertheless, nothing prevents to use WM_CONCAT in single-use queries.

Another Oracle SQL tricks to get SUM for strings

Here I show some ways to implement SUM for strings functionality using plain Oracle SQL. It's more like puzzles for the mind than to the recommendations for use. Just to demonstrate the power of Oracle SQL.

I have found all these ideas on different forums and articles and transformed them to use example NEWBORN table.

All queries bellow return the same result (order of strings can be different)
MONTH   LIST_OF_NAMES
-----     -------------
April     Roman, Sergey, Sergey
January   Andrey, Kira
March     Anastasia, Andrey, Ivan

Using hierarchical queries and rownumber analytic function

SELECT t.MONTH
     , LTRIM(SYS_CONNECT_BY_PATH(t.name, ', '), ', ') AS list_of_names
  FROM (SELECT TO_CHAR(birthday, 'Month') AS MONTH
             , name
             , ROW_NUMBER() over (PARTITION BY TO_CHAR(birthday, 'Month') ORDER BY name) AS rn
          FROM newborn) t
 WHERE connect_by_isleaf = 1
 START WITH t.rn = 1
 CONNECT BY PRIOR t.rn = (t.rn -1) 
     AND PRIOR t.MONTH = t.MONTH
 ORDER BY t.MONTH

Using XML related functions

SELECT TO_CHAR(birthday, 'Month') AS MONTH
     , RTRIM(TO_CHAR(xmlcast(XMLAGG(XMLELEMENT("name", name || ', ')) AS clob)), ', ') AS list_of_names
  FROM newborn
 GROUP BY TO_CHAR(birthday, 'Month')
 ORDER BY MONTH

Using MODEL clause

SELECT t.MONTH
     , RTRIM(t.names, ', ') AS list_of_names
  FROM (SELECT *
          FROM newborn
         model 
           PARTITION BY (TO_CHAR(birthday, 'Month') AS MONTH)
           dimension BY (ROW_NUMBER() over (PARTITION BY TO_CHAR(birthday, 'Month') ORDER BY name) AS rn)
           measures (name AS names)
           rules (names[ANY] ORDER BY rn DESC = names[cv()] || ', ' || names[cv() + 1])
       ) t
 WHERE t.rn = 1
 ORDER BY t.MONTH

Code snippets have been tested on Oracle Database 11g Release 11.2.0.1.0

No comments:

Post a Comment