Teradata SQL: PDCR Table Join: может кто-нибудь объяснить разницу в количестве строк


Я думаю, что это может быть частью более старой дискуссии, но вместо того, чтобы продолжать ее в форме форума, я подумал, что это делает честь отвечающему пандиту, если я открою его как отдельный вопрос.

Я пытаюсь понять, почему оба эти запроса дают немного разные результаты и, что важно, один пропускает кандидата imp пользователя. Простой отчет, который подтягивает пользователей с высоким ЦП по базе данных .

Версия 2

SELECT b.objectdatabasename,
    a.username,
    CAST(SUM((((a.AmpCPUTime(DEC(18, 3))) + ZEROIFNULL(a.ParserCPUTime)))) AS DECIMAL(18, 3))
FROM 
    pdcrinfo.dbqlogtbl a
    JOIN 
        (
            SELECT queryid,
                logdate,
                MIN(objectdatabasename) AS objectdatabasename
            FROM pdcrinfo.dbqlobjtbl_hst
            WHERE objectdatabasename IN (
                    SELECT child 
                    FROM dbc.children
                    WHERE parent = 'findb'
                    GROUP BY 1
                    )
            GROUP BY 1, 2
        ) b ON 
            a.queryid = b.queryid
            AND a.loGDATE = b.Logdate
            AND a.logdate BETWEEN x AND y
            AND b.logdate BETWEEN x AND y
GROUP BY 1,2

Это дает на 3 строки больше по сравнению с ниже

Версия 1

SELECT b.objectdatabasename,
    a.username,
    CAST(SUM((((a.AmpCPUTime(DEC(18, 3))) + ZEROIFNULL(a.ParserCPUTime)))) AS DECIMAL(18, 3))
FROM 
    pdcrinfo.dbqlogtbl a
    JOIN 
        (
            SELECT queryid,
                logdate,
                MIN(objectdatabasename) AS objectdatabasename
            FROM pdcrinfo.dbqlobjtbl_hst
            GROUP BY 1, 2
        ) b ON 
        a.queryid = b.queryid
        AND a.loGDATE = b.Logdate
        AND a.logdate BETWEEN x AND y
        AND b.logdate BETWEEN x AND y
WHERE 
    b.objectdatabasename IN 
        (
            SELECT child 
            FROM dbc.children
            WHERE parent = 'findb'
            GROUP BY 1
        )
GROUP BY 1,2

Результат примерно такой

+------------+-----------+-----------+
|  Database  |   User    | Total CPU |
+------------+-----------+-----------+
| FinDB      | PSmith    | 500,000   |
| FinDB_B    | PROgers   | 600,000   |
| ClaimDB_CO | BCRPRDUsr | 700,000   |
+------------+-----------+-----------+

Версия 1-это сохранившаяся версия, которая использовалась все это время ( ну, была использована другая менее эффективная форма), и она пропустила этого пользователя FinDB PSmith 500,000 Я проверяю из запроса ID и logdates для PSmith, что он действительно работал с FinDB, и все же он никогда не попадал в список с версией # 2. Я уверен - я что-то упустил и пытаюсь понять, чем вызван скандал. неравенство.

Версия # 1
Это дает на 3-4 строки меньше, чем версия 2

    Explain SELECT

    b.objectdatabasename   ,
    a.username ,
    CAST( SUM((((a.AmpCPUTime(DEC(18,3)))+  ZEROIFNULL(a.ParserCPUTime)) )) AS DECIMAL(18,3)) (TITLE '')
   FROM pdcrinfo.dbqlogtbl a
            JOIN 
    (
    SELECT queryid,logdate,
        MIN (objectdatabasename) AS                  objectdatabasename
        FROM pdcrinfo.dbqlobjtbl_hst
        GROUP BY 1,2 )
           b
                          ON  ( a.queryid=b.queryid 
    AND a.loGDATE=b.Logdate
        )

and    a.logdate BETWEEN                    '2016-01-01'  AND '2016-01-11'
        and    b.logdate BETWEEN                '2016-01-01'  AND '2016-01-11'

        where    b.objectdatabasename in   ( sel  child  from  dbc.children where parent ='findb'  group by 1  )
    GROUP BY 1,
        2

    ORDER BY 3 desc , 2 asc, 1 asc;

 This query is optimized using type 2 profile insert-sel, profileid
 10001.
  1) First, we lock PDCRDATA.DBQLObjTbl_Hst for access, and we lock
     PDCRDATA.DBQLogTbl_Hst in view pdcrinfo.dbqlogtbl for access.
  2) Next, we lock DBC.dbase for access, and we lock DBC.owners for
     access.
  3) We do an all-AMPs SUM step to aggregate from 11 partitions of
     PDCRDATA.DBQLObjTbl_Hst with a condition of (
     "(PDCRDATA.DBQLObjTbl_Hst.LogDate >= DATE '2016-01-01') AND
     (PDCRDATA.DBQLObjTbl_Hst.LogDate <= DATE '2016-01-11')")
     , grouping by field1 ( PDCRDATA.DBQLObjTbl_Hst.QueryID
     ,PDCRDATA.DBQLObjTbl_Hst.LogDate).  Aggregate Intermediate Results
     are computed locally, then placed in Spool 3.  The input table
     will not be cached in memory, but it is eligible for synchronized
     scanning.  The size of Spool 3 is estimated with low confidence to
     be 44,305,297 rows (5,715,383,313 bytes).  The estimated time for
     this step is 8.52 seconds.
  4) We execute the following steps in parallel.
       1) We do an all-AMPs RETRIEVE step from Spool 3 (Last Use) by
          way of an all-rows scan into Spool 1 (used to materialize
          view, derived table, table function or table operator b)
          (all_amps) (compressed columns allowed), which is built
          locally on the AMPs.  The size of Spool 1 is estimated with
          low confidence to be 44,305,297 rows (5,316,635,640 bytes).
          The estimated time for this step is 0.78 seconds.
       2) We do an all-AMPs RETRIEVE step from DBC.dbase by way of an
          all-rows scan with a condition of (
          "(SUBSTRING((TRANSLATE((DBC.dbase.DatabaseName )USING
          UNICODE_TO_LOCALE WITH ERROR )) FROM (1) FOR (30 ))(CHAR(30),
          CHARACTER SET LATIN, NOT CASESPECIFIC))= 'findb '") into Spool
          9 (all_amps) (compressed columns allowed), which is
          redistributed by the hash code of (DBC.dbase.DatabaseId) to
          all AMPs.  Then we do a SORT to order Spool 9 by row hash.
          The size of Spool 9 is estimated with no confidence to be 348
          rows (5,916 bytes).  The estimated time for this step is 0.01
          seconds.
       3) We do an all-AMPs RETRIEVE step from DBC.dbase by way of an
          all-rows scan with no residual conditions locking for access
          into Spool 10 (all_amps) (compressed columns allowed), which
          is redistributed by the hash code of (DBC.dbase.DatabaseId)
          to all AMPs.  Then we do a SORT to order Spool 10 by row hash.
          The size of Spool 10 is estimated with high confidence to be
          3,478 rows (361,712 bytes).  The estimated time for this step
          is 0.01 seconds.
  5) We do an all-AMPs JOIN step from Spool 9 (Last Use) by way of a
     RowHash match scan, which is joined to DBC.owners by way of a
     RowHash match scan with no residual conditions.  Spool 9 and
     DBC.owners are joined using a merge join, with a join condition of
     ("DBC.owners.OwnerId = DatabaseId").  The result goes into Spool
     11 (all_amps) (compressed columns allowed), which is redistributed
     by the hash code of (DBC.owners.OwneeId) to all AMPs.  Then we do
     a SORT to order Spool 11 by row hash.  The size of Spool 11 is
     estimated with no confidence to be 10,450 rows (177,650 bytes).
     The estimated time for this step is 0.02 seconds.
  6) We do an all-AMPs JOIN step from Spool 10 (Last Use) by way of a
     RowHash match scan, which is joined to Spool 11 (Last Use) by way
     of a RowHash match scan.  Spool 10 and Spool 11 are joined using a
     merge join, with a join condition of ("OwneeId = DatabaseId").
     The result goes into Spool 8 (all_amps), which is redistributed by
     the hash code of (SUBSTRING((TRANSLATE((DBC.dbase.DatabaseName
     )USING UNICODE_TO_LOCALE WITH ERROR )) FROM (1) FOR (30
     ))(CHAR(30), CHARACTER SET LATIN, NOT CASESPECIFIC)) to all AMPs.
     Then we do a SORT to order Spool 8 by row hash and the sort key in
     spool field1 eliminating duplicate rows.  The size of Spool 8 is
     estimated with no confidence to be 3,478 rows (191,290 bytes).
     The estimated time for this step is 0.02 seconds.
  7) We do an all-AMPs RETRIEVE step from Spool 8 (Last Use) by way of
     an all-rows scan into Spool 12 (all_amps) (compressed columns
     allowed), which is duplicated on all AMPs.  The size of Spool 12
     is estimated with no confidence to be 1,752,912 rows (227,878,560
     bytes).  The estimated time for this step is 0.06 seconds.
  8) We do an all-AMPs JOIN step from Spool 1 (Last Use) by way of an
     all-rows scan with a condition of ("(b.LOGDATE <= DATE
     '2016-01-11') AND (b.LOGDATE >= DATE '2016-01-01')"), which is
     joined to Spool 12 (Last Use) by way of an all-rows scan.  Spool 1
     and Spool 12 are joined using a inclusion dynamic hash join, with
     a join condition of ("OBJECTDATABASENAME = (TRANSLATE((Field_2
     )USING LATIN_TO_UNICODE))").  The result goes into Spool 13
     (all_amps) (compressed columns allowed), which is redistributed by
     the rowkey of (PDCRDATA.DBQLObjTbl_Hst.LOGDATE,
     PDCRDATA.DBQLObjTbl_Hst.QUERYID) to all AMPs.  Then we do a SORT
     to partition Spool 13 by rowkey.  The size of Spool 13 is
     estimated with no confidence to be 3,865 rows (432,880 bytes).
     The estimated time for this step is 0.29 seconds.
  9) We do an all-AMPs JOIN step from 11 partitions of
     PDCRDATA.DBQLogTbl_Hst in view pdcrinfo.dbqlogtbl by way of a
     RowHash match scan with a condition of ("(PDCRDATA.DBQLogTbl_Hst
     in view pdcrinfo.dbqlogtbl.LogDate <= DATE '2016-01-11') AND
     (PDCRDATA.DBQLogTbl_Hst in view pdcrinfo.dbqlogtbl.LogDate >= DATE
     '2016-01-01')"), which is joined to Spool 13 (Last Use) by way of
     a RowHash match scan.  PDCRDATA.DBQLogTbl_Hst and Spool 13 are
     joined using a rowkey-based merge join, with a join condition of (
     "(PDCRDATA.DBQLogTbl_Hst.LogDate = LOGDATE) AND
     (PDCRDATA.DBQLogTbl_Hst.QueryID = QUERYID)").  The input table
     PDCRDATA.DBQLogTbl_Hst will not be cached in memory, but it is
     eligible for synchronized scanning.  The result goes into Spool 7
     (all_amps) (compressed columns allowed), which is built locally on
     the AMPs.  The size of Spool 7 is estimated with no confidence to
     be 3,816 rows (782,280 bytes).  The estimated time for this step
     is 0.03 seconds.
 10) We do an all-AMPs SUM step to aggregate from Spool 7 (Last Use) by
     way of an all-rows scan , grouping by field1 (
     PDCRDATA.DBQLObjTbl_Hst.Field_4 ,PDCRDATA.DBQLogTbl_Hst.UserName).
     Aggregate Intermediate Results are computed globally, then placed
     in Spool 15.  The size of Spool 15 is estimated with no confidence
     to be 3,478 rows (2,472,858 bytes).  The estimated time for this
     step is 0.02 seconds.
 11) We do an all-AMPs RETRIEVE step from Spool 15 (Last Use) by way of
     an all-rows scan into Spool 5 (group_amps), which is built locally
     on the AMPs.  Then we do a SORT to order Spool 5 by the sort key
     in spool field1 (SUM((PDCRDATA.DBQLogTbl_Hst.AMPCPUTime
     (DECIMAL(18,3)) )+
     (ZEROIFNULL(PDCRDATA.DBQLogTbl_Hst.ParserCPUTime
     )))(DECIMAL(18,3)), PDCRDATA.DBQLogTbl_Hst.UserName,
     PDCRDATA.DBQLObjTbl_Hst.Field_4).  The size of Spool 5 is
     estimated with no confidence to be 3,478 rows (2,201,574 bytes).
     The estimated time for this step is 0.01 seconds.
 12) Finally, we send out an END TRANSACTION step to all AMPs involved
     in processing the request.
  -> The contents of Spool 5 are sent back to the user as the result of
     statement 1.  The total estimated time is 9.75 seconds.


Версия 2
это объясняет отсутствие пользователя.

    Explain SELECT

    b.objectdatabasename  ,
    a.username   ,
    CAST( SUM((((a.AmpCPUTime(DEC(18,3)))+
         ZEROIFNULL(a.ParserCPUTime)) )) AS DECIMAL(18,3))   
   FROM pdcrinfo.dbqlogtbl a
  JOIN 
    (
    SELECT queryid,logdate,
        MIN (objectdatabasename) AS                  objectdatabasename
        FROM pdcrinfo.dbqlobjtbl_hst  
        where    objectdatabasename in   ( sel  child  from  dbc.children where parent ='findb'  group by 1  )
        GROUP BY 1,2 )
           b
                          ON  ( a.queryid=b.queryid 
        AND a.loGDATE=b.Logdate )

AND  a.logdate BETWEEN                    '2016-01-01'  AND '2016-01-11'
           AND  b.logdate BETWEEN                    '2016-01-01'  AND '2016-01-11'


    GROUP BY
        1,2
order by 
 3 desc, 1 asc, 2 asc;

 This query is optimized using type 2 profile insert-sel, profileid
 10001.
  1) First, we lock PDCRDATA.DBQLObjTbl_Hst for access, and we lock
     PDCRDATA.DBQLogTbl_Hst in view pdcrinfo.dbqlogtbl for access.
  2) Next, we lock DBC.dbase for access, and we lock DBC.owners for
     access.
  3) We execute the following steps in parallel.
       1) We do an all-AMPs RETRIEVE step from DBC.dbase by way of an
          all-rows scan with a condition of (
          "(SUBSTRING((TRANSLATE((DBC.dbase.DatabaseName )USING
          UNICODE_TO_LOCALE WITH ERROR )) FROM (1) FOR (30 ))(CHAR(30),
          CHARACTER SET LATIN, NOT CASESPECIFIC))= 'findb '") into Spool
          5 (all_amps) (compressed columns allowed), which is
          redistributed by the hash code of (DBC.dbase.DatabaseId) to
          all AMPs.  Then we do a SORT to order Spool 5 by row hash.
          The size of Spool 5 is estimated with no confidence to be 348
          rows (5,916 bytes).  The estimated time for this step is 0.01
          seconds.
       2) We do an all-AMPs RETRIEVE step from DBC.dbase by way of an
          all-rows scan with no residual conditions locking for access
          into Spool 6 (all_amps) (compressed columns allowed), which
          is redistributed by the hash code of (DBC.dbase.DatabaseId)
          to all AMPs.  Then we do a SORT to order Spool 6 by row hash.
          The size of Spool 6 is estimated with high confidence to be
          3,478 rows (361,712 bytes).  The estimated time for this step
          is 0.01 seconds.
  4) We do an all-AMPs JOIN step from Spool 5 (Last Use) by way of a
     RowHash match scan, which is joined to DBC.owners by way of a
     RowHash match scan with no residual conditions.  Spool 5 and
     DBC.owners are joined using a merge join, with a join condition of
     ("DBC.owners.OwnerId = DatabaseId").  The result goes into Spool 7
     (all_amps) (compressed columns allowed), which is redistributed by
     the hash code of (DBC.owners.OwneeId) to all AMPs.  Then we do a
     SORT to order Spool 7 by row hash.  The size of Spool 7 is
     estimated with no confidence to be 10,450 rows (177,650 bytes).
     The estimated time for this step is 0.02 seconds.
  5) We execute the following steps in parallel.
       1) We do an all-AMPs JOIN step from Spool 6 (Last Use) by way of
          a RowHash match scan, which is joined to Spool 7 (Last Use)
          by way of a RowHash match scan.  Spool 6 and Spool 7 are
          joined using a merge join, with a join condition of (
          "OwneeId = DatabaseId").  The result goes into Spool 4
          (all_amps), which is redistributed by the hash code of (
          SUBSTRING((TRANSLATE((DBC.dbase.DatabaseName )USING
          UNICODE_TO_LOCALE WITH ERROR )) FROM (1) FOR (30 ))(CHAR(30),
          CHARACTER SET LATIN, NOT CASESPECIFIC)) to all AMPs.  Then we
          do a SORT to order Spool 4 by row hash and the sort key in
          spool field1 eliminating duplicate rows.  The size of Spool 4
          is estimated with no confidence to be 3,478 rows (191,290
          bytes).  The estimated time for this step is 0.02 seconds.
       2) We do an all-AMPs RETRIEVE step from 11 partitions of
          PDCRDATA.DBQLObjTbl_Hst with a condition of (
          "(PDCRDATA.DBQLObjTbl_Hst.LogDate >= DATE '2016-01-01') AND
          (PDCRDATA.DBQLObjTbl_Hst.LogDate <= DATE '2016-01-11')") into
          Spool 8 (all_amps) (compressed columns allowed), which is
          built locally on the AMPs.  The input table will not be
          cached in memory, but it is eligible for synchronized
          scanning.  The size of Spool 8 is estimated with high
          confidence to be 109,751,471 rows (12,292,164,752 bytes).
          The estimated time for this step is 4.29 seconds.
  6) We do an all-AMPs RETRIEVE step from Spool 4 (Last Use) by way of
     an all-rows scan into Spool 9 (all_amps) (compressed columns
     allowed), which is duplicated on all AMPs.  The size of Spool 9 is
     estimated with no confidence to be 1,752,912 rows (227,878,560
     bytes).  The estimated time for this step is 0.06 seconds.
  7) We do an all-AMPs JOIN step from Spool 8 (Last Use) by way of an
     all-rows scan, which is joined to Spool 9 (Last Use) by way of an
     all-rows scan.  Spool 8 and Spool 9 are joined using a single
     partition inclusion hash join, with a join condition of (
     "ObjectDatabaseName = (TRANSLATE((Field_2 )USING
     LATIN_TO_UNICODE))").  The result goes into Spool 3 (all_amps)
     (compressed columns allowed), which is built locally on the AMPs.
     The size of Spool 3 is estimated with no confidence to be
     36,436,341 rows (4,153,742,874 bytes).  The estimated time for
     this step is 1.05 seconds.
  8) We do an all-AMPs SUM step to aggregate from Spool 3 (Last Use) by
     way of an all-rows scan , grouping by field1 (
     PDCRDATA.DBQLObjTbl_Hst.QueryID ,PDCRDATA.DBQLObjTbl_Hst.LogDate).
     Aggregate Intermediate Results are computed locally, then placed
     in Spool 11.  The size of Spool 11 is estimated with no confidence
     to be 36,436,341 rows (4,700,287,989 bytes).  The estimated time
     for this step is 3.10 seconds.
  9) We do an all-AMPs RETRIEVE step from Spool 11 (Last Use) by way of
     an all-rows scan into Spool 1 (used to materialize view, derived
     table, table function or table operator b) (all_amps) (compressed
     columns allowed), which is built locally on the AMPs.  The size of
     Spool 1 is estimated with no confidence to be 36,436,341 rows (
     4,372,360,920 bytes).  The estimated time for this step is 0.65
     seconds.
 10) We do an all-AMPs RETRIEVE step from Spool 1 (Last Use) by way of
     an all-rows scan with a condition of ("(b.LOGDATE <= DATE
     '2016-01-11') AND (b.LOGDATE >= DATE '2016-01-01')") into Spool 16
     (all_amps) (compressed columns allowed), which is redistributed by
     the rowkey of (PDCRDATA.DBQLObjTbl_Hst.QueryID,
     PDCRDATA.DBQLObjTbl_Hst.LogDate) to all AMPs.  Then we do a SORT
     to partition Spool 16 by rowkey.  The size of Spool 16 is
     estimated with no confidence to be 36,436,341 rows (4,080,870,192
     bytes).  The estimated time for this step is 3.86 seconds.
 11) We do an all-AMPs JOIN step from 11 partitions of
     PDCRDATA.DBQLogTbl_Hst in view pdcrinfo.dbqlogtbl by way of a
     RowHash match scan with a condition of ("(PDCRDATA.DBQLogTbl_Hst
     in view pdcrinfo.dbqlogtbl.LogDate <= DATE '2016-01-11') AND
     (PDCRDATA.DBQLogTbl_Hst in view pdcrinfo.dbqlogtbl.LogDate >= DATE
     '2016-01-01')"), which is joined to Spool 16 (Last Use) by way of
     a RowHash match scan.  PDCRDATA.DBQLogTbl_Hst and Spool 16 are
     joined using a rowkey-based merge join, with a join condition of (
     "(PDCRDATA.DBQLogTbl_Hst.QueryID = QUERYID) AND
     (PDCRDATA.DBQLogTbl_Hst.LogDate = LOGDATE)").  The input table
     PDCRDATA.DBQLogTbl_Hst will not be cached in memory, but it is
     eligible for synchronized scanning.  The result goes into Spool 15
     (all_amps) (compressed columns allowed), which is built locally on
     the AMPs.  The size of Spool 15 is estimated with no confidence to
     be 35,969,436 rows (7,373,734,380 bytes).  The estimated time for
     this step is 1.72 seconds.
 12) We do an all-AMPs SUM step to aggregate from Spool 15 (Last Use)
     by way of an all-rows scan , grouping by field1 (
     PDCRDATA.DBQLObjTbl_Hst.ObjectDatabaseName
     ,PDCRDATA.DBQLogTbl_Hst.UserName).  Aggregate Intermediate Results
     are computed globally, then placed in Spool 17.  The size of Spool
     17 is estimated with no confidence to be 6,175,740 rows (
     4,390,951,140 bytes).  The estimated time for this step is 1.61
     seconds.
 13) We do an all-AMPs RETRIEVE step from Spool 17 (Last Use) by way of
     an all-rows scan into Spool 13 (group_amps), which is built
     locally on the AMPs.  Then we do a SORT to order Spool 13 by the
     sort key in spool field1 (SUM((PDCRDATA.DBQLogTbl_Hst.AMPCPUTime
     (DECIMAL(18,3)) )+
     (ZEROIFNULL(PDCRDATA.DBQLogTbl_Hst.ParserCPUTime
     )))(DECIMAL(18,3)), PDCRDATA.DBQLObjTbl_Hst.ObjectDatabaseName,
     PDCRDATA.DBQLogTbl_Hst.UserName).  The size of Spool 13 is
     estimated with no confidence to be 6,175,740 rows (3,909,243,420
     bytes).  The estimated time for this step is 0.43 seconds.
 14) Finally, we send out an END TRANSACTION step to all AMPs involved
     in processing the request.
  -> The contents of Spool 13 are sent back to the user as the result
     of statement 1.  The total estimated time is 16.79 seconds.
1 2

1 ответ:

В #2 вы фильтруете по имени базы данных перед мин, но в #1 после мин.

Предполагая, что запрос обращается к базам данных bold, то подзапрос b в #1 возвращает 'Bla_DB' как MIN, в то время как в #2 он возвращает 'Fin_DB':

  • dbc
    • sysdba
      • FinDB
        • FinDB_B
        • ClaimDB_CO
        • ...
      • ...
      • Bla_DB
    • ...