2014년 12월 30일 화요일

[impala-user] Impala 2.1 and stats

After upgraded to CDH 5.3 and Impala 2.1 I collected incremental statistics for table:

compute incremental stats table3
show table stats table3

Query: show table stats table3
+-----------+------------+-----------+--------+---------+--------------+---------+-------------------+
| part_year | part_month | #Rows     | #Files | Size    | Bytes Cached | Format  | Incremental stats |
+-----------+------------+-----------+--------+---------+--------------+---------+-------------------+
| 2014      | 12         | 0         | 2      | 7.20MB  | 0B           | PARQUET | true              |
| 2015      | 11         | 0         | 2      | 716B    | 0B           | PARQUET | true              |
| 2015      | 12         | 0         | 7      | 2.54KB  | 0B           | PARQUET | true              |
| 2016      | 11         | 0         | 4      | 7.31MB  | 0B           | PARQUET | true              |
| 2099      | 1          | 0         | 1      | 358B    | 0B           | PARQUET | true              |
| 2199      | 1          | 0         | 1      | 358B    | 0B           | PARQUET | true              |
| Total     |            | 134217758 | 17     | 14.51MB | 0B           |         |                   |
+-----------+------------+-----------+--------+---------+--------------+---------+-------------------+

There are Rows for all partitions = 0

Is it ok for new version?



Also explain query on this table:

WARNING: The following tables are missing relevant table and/or column statistics.
default.table3



certainly not expected. The "SHOW TABLE STATS" command should list the correct #rows. Also, the EXPLAIN plan should not issue a warning if stats have been computed.
Can you try running "drop stats table3" and then doing another incremental stats computation?
Does the regular "COMPUTE STATS" produce the same results?
drop stats table3 show table stats table3 +-----------+------------+-------+--------+---------+--------------+---------+-------------------+ | part_year | part_month | #Rows | #Files | Size    | Bytes Cached | Format  | Incremental stats |+-----------+------------+-------+--------+---------+--------------+---------+-------------------+ | 2014      | 12         | -1    | 2      | 7.20MB  | NOT CACHED   | PARQUET | false             | | 2015      | 11         | -1    | 2      | 716B    | NOT CACHED   | PARQUET | false             | | 2015      | 12         | -1    | 7      | 2.54KB  | NOT CACHED   | PARQUET | false             | | 2016      | 11         | -1    | 4      | 7.31MB  | NOT CACHED   | PARQUET | false             | | 2099      | 1          | -1    | 1      | 358B    | NOT CACHED   | PARQUET | false             | | 2199      | 1          | -1    | 1      | 358B    | NOT CACHED   | PARQUET | false             | | Total     |            | -1    | 17     | 14.51MB | 0B           |         |                   | +-----------+------------+-------+--------+---------+--------------+---------+-------------------+ compute incremental stats table3 show table stats table3 +-----------+------------+-----------+--------+---------+--------------+---------+-------------------+ | part_year | part_month | #Rows     | #Files | Size    | Bytes Cached | Format  | Incremental stats | +-----------+------------+-----------+--------+---------+--------------+---------+-------------------+| 2014      | 12         | 0         | 2      | 7.20MB  | NOT CACHED   | PARQUET | true              | | 2015      | 11         | 0         | 2      | 716B    | NOT CACHED   | PARQUET | true              | | 2015      | 12         | 0         | 7      | 2.54KB  | NOT CACHED   | PARQUET | true              | | 2016      | 11         | 0         | 4      | 7.31MB  | NOT CACHED   | PARQUET | true              | | 2099      | 1          | 0         | 1      | 358B    | NOT CACHED   | PARQUET | true              | | 2199      | 1          | 0         | 1      | 358B    | NOT CACHED   | PARQUET | true              | | Total     |            | 134217758 | 17     | 14.51MB | 0B           |         |                   | +-----------+------------+-----------+--------+---------+--------------+---------+-------------------+ drop stats table3 compute stats table3 show table stats table3 +-----------+------------+-----------+--------+---------+--------------+---------+-------------------+ | part_year | part_month | #Rows     | #Files | Size    | Bytes Cached | Format  | Incremental stats | +-----------+------------+-----------+--------+---------+--------------+---------+-------------------+| 2014      | 12         | 0         | 2      | 7.20MB  | NOT CACHED   | PARQUET | false             | | 2015      | 11         | 0         | 2      | 716B    | NOT CACHED   | PARQUET | false             | | 2015      | 12         | 0         | 7      | 2.54KB  | NOT CACHED   | PARQUET | false             | | 2016      | 11         | 0         | 4      | 7.31MB  | NOT CACHED   | PARQUET | false             | | 2099      | 1          | 0         | 1      | 358B    | NOT CACHED   | PARQUET | false             | | 2199      | 1          | 0         | 1      | 358B    | NOT CACHED   | PARQUET | false             |
Ok, good to know it's apparently not specific to "COMPUTE INCREMENTAL STATS".
How about "SELECT COUNT(*) FROM table3 GROUP BY pary_year, part_month"?
[node1:21000] > SELECT COUNT(*) FROM table3 GROUP BY part_year, part_month; Query: select COUNT(*) FROM table3 GROUP BY part_year, part_month +----------+ | count(*) | +----------+ | 2        | | 1        | | 26       | | 1        | | 67108866 | | 67108862 | +----------+ Fetched 6 row(s) in 5.22s
Interesting. Are those two last counts accurate? Lots of rows for such small files.
Are you sure you have upgraded the Statestored and Catalogd and the Hive Metastore? Did you upgrade through CM?
Counts are accurate. I have upgraded through CM. Tomorrow I will check version of Statestored and Catalogd.
Sounds good.
It will also be interesting to see the output of this command run in Hive:
hive> describe extended table3 partition (part_year=2014, part_month=12);
That will show what is stored in the Metastore (I'm looking for the 'numRows' key). That will help hopefully help us narrow down what is happening.
hive> describe extended table3 partition (part_year=2014, part_month=12);OK t1                      int t2                      timestamp part_year               smallint part_month              tinyint # Partition Information # col_name              data_type               comment part_year               smallint part_month              tinyint Detailed Partition Information  Partition(values:[2014, 12], dbName:default, tableName:table3, createTime:1418920079, lastAccessTime:0, sd:StorageDescriptor(cols:[FieldSchema(name:t1, type:int, comment:null), FieldSchema(name:t2, type:timestamp, comment:null)], location:hdfs://node1:8020/user/hive/warehouse/table3/part_year=2014/part_month=12, inputFormat:parquet.hive.DeprecatedParquetInputFormat, outputFormat:parquet.hive.DeprecatedParquetOutputFormat, compressed:false, numBuckets:0, serdeInfo:SerDeInfo(name:null, serializationLib:parquet.hive.serde.ParquetHiveSerDe, parameters:{}), bucketCols:[], sortCols:[], parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], skewedColValueLocationMaps:{}), storedAsSubDirectories:false), parameters:{numFiles=2, impala_intermediate_stats_chunk0=HBYAABsAAA==, transient_lastDdlTime=1418920079, COLUMN_STATS_ACCURATE=true, numRows=0, totalSize=7545894, impala_intermediate_stats_num_chunks=1, rawDataSize=-1}) Time taken: 1.239 seconds, Fetched: 12 row(s)
I checked versions on ports 25010, 25020:

statestored version 2.1.0-cdh5 RELEASE (build e48c2b48c53ea9601b8f47a39373aa83ff7ca6e2)
Built on Tue, 16 Dec 2014 19:26:45 PST

catalogd version 2.1.0-cdh5 RELEASE (build e48c2b48c53ea9601b8f47a39373aa83ff7ca6e2)
Built on Tue, 16 Dec 2014 19:26:45 PST







Thanks for checking and running those commands.
The numRows=0 in the describe output shows that the values in the Metastore and what you see in Impala are at least consistent.
Can you alter the numRows manually in Impala with:
ALTER TABLE table3 PARTITION(year=2014.month=12) SET TBLPROPERTIES('numRows'='10');
Also you may want to inspect the catalogd logs after running a COMPUTE STATS to get more information on what could be going on.
Happy New Year!
I have done your commands and some strange... compute incremental stats table3 show table stats table3 +-----------+------------+-----------+--------+---------+--------------+---------+-------------------+ | part_year | part_month | #Rows     | #Files | Size    | Bytes Cached | Format  | Incremental stats | +-----------+------------+-----------+--------+---------+--------------+---------+-------------------+ | 2014      | 12         | 0         | 2      | 7.20MB  | NOT CACHED   | PARQUET | true              | | 2015      | 11         | 0         | 2      | 716B    | NOT CACHED   | PARQUET | true              | | 2015      | 12         | 0         | 7      | 2.54KB  | NOT CACHED   | PARQUET | true              | | 2016      | 11         | 0         | 4      | 7.31MB  | NOT CACHED   | PARQUET | true              | | 2099      | 1          | 0         | 1      | 358B    | NOT CACHED   | PARQUET | true              | | 2199      | 1          | 0         | 1      | 358B    | NOT CACHED   | PARQUET | true              | | Total     |            | 134217758 | 17     | 14.51MB | 0B           |         |                   | +-----------+------------+-----------+--------+---------+--------------+---------+-------------------+ ALTER TABLE table3 PARTITION(part_year=2014, part_month=12) SET TBLPROPERTIES('numRows'='10') show table stats table3 +-----------+------------+-----------+--------+---------+--------------+---------+-------------------+ | part_year | part_month | #Rows     | #Files | Size    | Bytes Cached | Format  | Incremental stats | +-----------+------------+-----------+--------+---------+--------------+---------+-------------------+| 2014      | 12         | 10        | 2      | 7.20MB  | NOT CACHED   | PARQUET | true              | | 2015      | 11         | 0         | 2      | 716B    | NOT CACHED   | PARQUET | true              | | 2015      | 12         | 0         | 7      | 2.54KB  | NOT CACHED   | PARQUET | true              | | 2016      | 11         | 0         | 4      | 7.31MB  | NOT CACHED   | PARQUET | true              | | 2099      | 1          | 0         | 1      | 358B    | NOT CACHED   | PARQUET | true              | | 2199      | 1          | 0         | 1      | 358B    | NOT CACHED   | PARQUET | true              | | Total     |            | 134217758 | 17     | 14.51MB | 0B           |         |                   | +-----------+------------+-----------+--------+---------+--------------+---------+-------------------+
But if I repeat last Alter command:
ALTER TABLE table3 PARTITION(part_year=2014, part_month=12) SET TBLPROPERTIES('numRows'='10') show table stats table3 +-----------+------------+-----------+--------+---------+--------------+---------+-------------------+ | part_year | part_month | #Rows     | #Files | Size    | Bytes Cached | Format  | Incremental stats | +-----------+------------+-----------+--------+---------+--------------+---------+-------------------+| 2014      | 12         | -1        | 2      | 7.20MB  | NOT CACHED   | PARQUET | true              | | 2015      | 11         | 0         | 2      | 716B    | NOT CACHED   | PARQUET | true              | | 2015      | 12         | 0         | 7      | 2.54KB  | NOT CACHED   | PARQUET | true              | | 2016      | 11         | 0         | 4      | 7.31MB  | NOT CACHED   | PARQUET | true              | | 2099      | 1          | 0         | 1      | 358B    | NOT CACHED   | PARQUET | true              | | 2199      | 1          | 0         | 1      | 358B    | NOT CACHED   | PARQUET | true              | | Total     |            | 134217758 | 17     | 14.51MB | 0B           |         |                   | +-----------+------------+-----------+--------+---------+--------------+---------+-------------------+
I am able to connect to Hive postgres DB.

댓글 없음:

댓글 쓰기