Showing posts with label METHOD_OPT. Show all posts
Showing posts with label METHOD_OPT. Show all posts

Monday, October 19, 2009

How to use histogram in Oracle

I would like to write about Oracle Histogram today. Histogram is very nice feature to help cost based optimizer to make right decision.

What is Histogram? Histograms are feature in CBO and it helps to optimizer to determine how data are skewed(distributed) with in the column. Histogram is good to create for the column which are included in the WHERE clause where the column is highly skewed. Histogram helps to optimizer to decide whether to use an index or full-table scan or help the optimizer determine the fastest table join order.

What are the advantage of Histogram? Histograms are useful in two places.

1. Histograms are useful for Oracle optimizer to choose the right access method in a table.

2. It is also useful for optimizer to decide the correct table join order. When we join multiple tables, histogram helps to minimize the intermediate result set. Since the smaller size of the intermediate result set will improve the performance.

Type of Histograms: Oracle uses two types of histograms for column statistics: height-balanced histograms and frequency histograms.

1. Height - balanced Histograms : The column values are divided into bands so that each band contains approximately the same number of rows. For instances, we have 10 distinct values in the column and only five buckets. It will create height based(Height balanced) histograms and it will evenly spread values through the buckets. A height-based histogram is when there are more distinct values than the number of buckets and the histogram statistics shows a range of rows across the buckets

2. Frequency Histograms : Each value of the column corresponds to a single bucket of the histogram. This is also called value based histogram. Each bucket contains the number of occurrences of that single value. Frequency histograms are automatically created instead of height-balanced histograms when the number of distinct values is less than or equal to the number of histogram buckets specified.

Method_opt Parameter: This is the parameter which tells about creating histogram while collecting the statistics. The default is FOR ALL COLUMNS SIZE AUTO in Oracle10g. But in oracle9i, the default is FOR ALL COLUMN SIZE 1 which will turn off the histogram collection.

FOR ALL [INDEXED HIDDEN] COLUMNS [size_clause]
FOR COLUMNS [size clause] columnattribute [size_clause] [,columnattribute [size_clause]...]

size_clause is defined as size_clause := SIZE {integer REPEAT AUTO SKEWONLY}

integer : Number of histogram buckets. Must be in the range [1,254]

REPEAT : Collects histograms only on the columns that already have histograms.

AUTO : Oracle determines the columns to collect histograms based on data distribution and the workload of the columns. We have a table called sys.col_usage$ that stores information about column usage. dbms_stats use this information to determine whether histogram is required for the columns.

SKEWONLY : Oracle determines the columns to collect histograms based on the data distribution of the columns.

Let me demonstrate how optimizer works with and without histogram as below two scenario. We take the emp table for this demonstration. The table has around 3.6 million records. The table emp_status column is highly skewed. It has two distinct values(Y,N). We have bitmap index on emp_status column.

Scenario 1 Let us generate the statistics without any histogram and see what kind of execution path optimizer is using. Without the histogram, oracle assume that, the data is evenly distributed and optimizer think that, we will have around 1.8 million record for emp_status Y and around another 1.8 million records for emp_status N.

Connected to:
Oracle Database 10g Enterprise Edition Release 10.2.0.3.0 - Production
With the Partitioning, OLAP and Data Mining options

SQL> select count(*),emp_status from scott.emp
2 group by emp_status;

COUNT(*) E
---------- -
1 N
3670016 Y

SQL> execute DBMS_STATS.GATHER_TABLE_STATS(OWNNAME => 'SCOTT', TABNAME => 'EMP',ESTIMATE_PERCENT =>
10, METHOD_OPT => 'FOR ALL COLUMNS SIZE 1',CASCADE => TRUE);

PL/SQL procedure successfully completed.

SQL> select ename from scott.emp where emp_status='Y';

3670016 rows selected.

--------------------------------------------------------------------------
Id Operation Name Rows Bytes Cost (%CPU) Time
--------------------------------------------------------------------------
0 SELECT STATEMENT 1832K 15M 5374 (5) 00:01:05
* 1 TABLE ACCESS FULL EMP 1832K 15M 5374 (5) 00:01:05
--------------------------------------------------------------------------

SQL> select ename from scott.emp where emp_status='N';

--------------------------------------------------------------------------
Id Operation Name Rows Bytes Cost (%CPU) Time
--------------------------------------------------------------------------
0 SELECT STATEMENT 1832K 15M 5374 (5) 00:01:05
* 1 TABLE ACCESS FULL EMP 1832K 15M 5374 (5) 00:01:05
--------------------------------------------------------------------------

Conclusion: Optimizer is using full table scan for the query which returns 3670016 records as well as it using full table scan for query which returns just only one record. This is obvisouly incorrect. This problem will be resolved by collecting histogram. Let us see in the next scenario.

Scenario 2 : Let us generate the statistics with histogram and see what kind of execution path optimizer is using. FOR COLUMN SIZE 2 EMP_STATUS will create two bucket for column emp_status. If we are not sure the distinct number of values in the column, then we can use AUTO option to collect histogram. With this histogram, oracle optimizer knows that, the column emp_status is highly skewed and it has two bucket and one bucket has around 3.6 million records with emp_status Y and another bucket has only one record with emp_status N. Now depends upon the query, optimizer decides whether to use index or Full table scan.

SQL> execute DBMS_STATS.GATHER_TABLE_STATS(OWNNAME => 'SCOTT', TABNAME => 'EMP',ESTIMATE_PERCENT =>
10, METHOD_OPT => 'FOR COLUMNS SIZE 2 EMP_STATUS',CASCADE => TRUE);

PL/SQL procedure successfully completed.

SQL> select ename from scott.emp where emp_status='Y';

3670016 rows selected.

--------------------------------------------------------------------------
Id Operation Name Rows Bytes Cost (%CPU) Time
--------------------------------------------------------------------------
0 SELECT STATEMENT 3681K 31M 5375 (5) 00:01:05
* 1 TABLE ACCESS FULL EMP 3681K 31M 5375 (5) 00:01:05
--------------------------------------------------------------------------

SQL> select ename from scott.emp where emp_status='N';

--------------------------------------------------------------------------
Id Operation Name Rows Bytes Cost (%CPU) Time
--------------------------------------------------------------------------
0 SELECT STATEMENT 1 9 1 (0) 00:00:01
1 TABLE ACCESS BY INDEX ROWID EMP 1 9 1 (0) 00:00:01
2 BITMAP CONVERSION TO ROWIDS
* 3 BITMAP INDEX SINGLE VALUE IDX_EMP
--------------------------------------------------------------------------

Conclusion: Optimizer is using full table scan for the query which returns 3670016 records. At the same time, optimizer is using index scan when for other query which returns one record. This scenario, the optimizer choose the right execution plan based on the query WHERE clause.

Data dictionary objects for Histogram:
user_histograms
user_part_histograms
user_subpart_histograms
user_tab_histograms
user_tab_col_statistics

Thursday, October 1, 2009

Analyze Versus DBMS_STATS

Cost based optimizer is preferred method for oracle optimizer. In order to make good use of the CBO, you need to create accurate statistics. Prior to oracle8i, we use ANALYZE command to gather statistics.

DBMS_STATS package is introduced in oracle8i. Since Oracle8i, Oracle highly recommeds to use DBMS_STATS instead of ANALYZE command. This article is written in oracle10g. I am going to address below topics in this thread....

1. Why oracle recommends to use DBMS_STATS package?
2. What are the advantages of DBMS_STATS compared to ANALYZE?
3. How do we use DBMS_STATS package to analyze the table?
4. What are new features in each version for DBMS_STATS?

Why oracle recommends to use DBMS_STATS since Oracle8i?

1. Gathering statistics can be done in Parallel. This option is not available in ANALYZE command.

2. It is used to collect the stale statistics. I discussed about collecting stale statistics in another topic. Please refer stale statistics to know more about collecting stale statistics.

3. DBMS_STATS is a PLSQL pacakge. So it is easy to call. But ANALYZE does not.

4. It is used to collect statistics for external tables. But ANALYZE does not.

5. DBMS_STATS used to collect system statistics. But ANLAYZE does not.

6. Some time, ANALYZE does not produce accurate statistics. But DBMS_STATS does.

7. We can not use ANLAYZE command to gather statistics for partition or sub partition level. But we can use DBMS_STATS to analyze any specific partition or sub partition. This is especially useful for partition table. We do not need to analyze the Historical data whenever we refresh the current partition.

8. We can transfer statistics from one DB to another DB when we collected statistics through DBMS_STATS. But it can not be done when we use ANALYZE command to collect the statistics. Please refer statistics transfer to know more about trasferring statistics.

What force you to use ANALYZE command in all Oracle versions? ANALYZE can be used to collect the statistics like CHAIN_CNT, AVG_SPACE, and EMPTY_BLOCKS. DBMS_STATS will not collect these statistics. We might need to use ANALYZE in case if we want to see chained rows, average space and empty blocks.

There are several parameter exists for collecting statistics on table level, schema level, database level and system level. But i do not want to explain all the parameters which are already in Oracle help. Still i would like to explain some parameters.

estimate_percent: Percentage of rows or blocks to estimate. The valid rage is 0.000001 to 100. when we pass NULL for this parameter, then it computes. Compute is same as 100% sample. For instance, if we pass 20%, then it takes roughly around 20% of rows or 20% blocks depends on the BLOCK_SAMPLE parameter. This Parameter is used for analyzing on table, index, schema level and database level.

block_sample: This determines whether or not to use random block sampling instead of random row sampling. Block sampling would be slightly less accurate in the case where rows have roughly the same lifecycle and, thus, values are spread non-uniformly throughout the table. In case if you want to drive in deep on this, David Aldridge has a nice article on block sampling. This Parameter is used for analyzing on table, schema level and database level.

method_opt: This parameter tells about histogram in table. It determine which column should have histogram and number of histogram created for the table columns. This Parameter is used for analyzing on table, schema level and database level. Please refer histogram to know more about Histogram in Oracle.

granularity:This parameter is useful when you want to gather statistics on specific partition or sub partition in a table. The valid parameters are ALL, AUTO, GLOBAL, GLOBAL AND PARTITION, PARTITION, SUBPARTITION. This Parameter is used for analyzing on table, index, schema level and database level only if the table or index is partitioned.

no_invalidate: Does not invalidate the dependent cursors or currently parsed SQL statement if set to TRUE. The procedure invalidates the dependent cursors immediately if set to FALSE. Use DBMS_STATS.AUTO_INVALIDATE to have Oracle decide when to invalidate dependent cursors. This Parameter is used for analyzing or deleting statistics on table, index, schema level, database level.

Degree: Degree of parallelism. It has three valid values.

NULL : Oracle takes the value which is specified in degree clause of create or alter table statement.

DBMS_STATS.DEFAULT_DEGREE : It takes the value based on number of CPU's and init parameters.

DBMS_STATS.AUTO_DEGREE : It determines the value automatically. It is either 1 or default degree according to the size of the object.

Options: This parameter is used only for analyzing the data on schema level or DB level. There are multiple values for this parameters. The valid value for this parameters are GATHER, GATHER AUTO, GATHER STALE, GATHER EMPTY, LIST AUTO, LIST STALE, LIST EMPTY. Let me explain these valid values in short. Since these values are important to gather statistics on schema level.

GATHER-Gather statistics for all the objects in a schema or database.

GATHER AUTO-Gather statistics when the statistics are stale or when there is no statistics. It does both GATHER STALE and GATHER EMPTY.

GATHER STALE-Gather statistics only when it is stale. Does not collect when there is no statistics.

GATHER EMPTY-Gather statistics only when no statistics.

LIST AUTO: Returns a list of objects to be processed with GATHER AUTO.

LIST STALE: Returns list of stale objects as determined by looking at the user_tab_modifications

LIST EMPTY: Returns list of objects which currently have no statistics.

Gathering_mode : This parameter is used only for gathering system statistics. The valid modes are NOWORKLOAD, INTERVAL,START and STOP. The default is NOWORKLOAD. The START and STOP is used to stop and start the system statistics.

Example for collecting statistics on table:

DBMS_STATS.GATHER_TABLE_STATS(
OWNNAME => 'TANDEB',
TABNAME => 'CUSTOMER',
PARTNAME => 'PART092009'
GRANULARITY => 'PARTITION',
ESTIMATE_PERCENT => 10,
METHOD_OPT => 'FOR ALL COLUMNS SIZE 1',
CASCADE => TRUE,
NO_INVALIDATE => TRUE);

Example for collecting statistics on Schema:

DBMS_STATS.GATHER_SCHEMA_STATS(
OWNNAME => 'SCOMPPRD',
ESTIMATE_PERCENT => 10,
METHOD_OPT => 'FOR ALL COLUMNS SIZE 1',
OPTIONS => 'GATHER',
CASCADE => TRUE,
NO_VALIDATE => TRUE);


Example for collecting system statistics:

DBMS_STATS.GATHER_SYSTEM_STATS(
GATHERING_MODE => 'INTERVAL',
INTERVAL => 10);

Example for collecting database statistics:

DBMS_STATS.GATHER_DATABASE_STATS(
ESTIMATE_PERCENT => 10,
METHOD_OPT => 'FOR ALL COLUMNS SIZE 1',
CASCADE => TRUE,
NO_VALIDATE => TRUE,
GATHER_SYS => FALSE)


New feature in Oracle9i:

1. Introduced to gather system statistics. Such as I/O and CPU utilization.

2. It can direct the database to select the appropriate sample size to generate accurate statistics. A new value for the ESTIMATE_PERCENT parameter, DBMS_STATS.AUTO_SAMPLE_SIZE will let Oracle decide the sample size necessary to ensure generation of accurate statistics.

3. Oracle9i introduced new values for the size clause in the METHOD_OPT parameter automate the decisions regarding the columns on which histograms need to be created while letting administrators control the factors affecting such decisions. Besides specifying a numeric value for the size clause, administrators have the new options (AUTO, SKEWONLY, REPEAT)

4. Oracle9i introduced new feature to enable or disable the table monitoring in schema level or DB level in one command.

DBMS_STATS.alter_schema_tab_monitoring('MYSCHEMA', TRUE); DBMS_STATS.alter_schema_tab_monitoring('MYSCHEMA', FALSE);

DBMS_STATS.alter_database_tab_monitoring(TRUE); DBMS_STATS.alter_database_tab_monitoring(FALSE);

New feature in Oracle10g:

1. Oracle10g enable table monitoring automatically. Table monitoring is required to collect the stale statisics. We do not need to enable monitoring explicitly. This feature is disabled when statistics_level is BASIC. It enables the table monitoring feature when statistics_level is TYPICAL. ALTER TABLE [NO] MONITORING clauses as well as alter_schema_tab_monitoring and alter_database_tab_monitoring procedures of the dbms_stats package are now obsolete in oracle10g. But still it runs without any error. But there is no effect.

2. Oracle10g introduced two new values for Granularity parameter. These are AUTO and GLOBAL AND PARTITION. This parameter is applicable for analyzing partitioning tables.

AUTO : Oracle collect statistics GLOBAL level, Partition level, and sub-partition level only if sub partition method is LIST. If sub parition is not a LIST, then it collects only GLOBAL, Partition level.

GLOBAL AND PARTITION : Oracle gathers the global and partition level statistics. No sub-partition level statistics are gathered.

3. Oracle10g introduced new value DBMS_STATS.AUTO_DEGREE for Degree parameter. When you specify the auto_degree, Oracle will determine the degree of parallelism automatically. It will be either 1 (serial execution) or default_degree (the system default value based on number of CPUs and initialization parameters), according to the size of the object.

4. Oracle10g has ability to restore the previous statistics. Oracle saves last 31 days statistics by default. We can recover previous days statistics in case, optimizer behaves differently with current statistics. Please refer my another post restoring statistics

5. We can lock the table statistics. This would be helpful if you want to avoid gathering statistics during the maintenance window. Please refer my another post Locking statistics

6. Oracle10g has automatic statistics gathering feature. Oracle gather statistics for the entire database every day during the maintenance window. Please refer my another post automatic statistics gathering

7. The statistics will be collected automatically when we create index. In oracle9i, we need to use compute statistics clause to collect statistics while creating index. Please refer my another post compute index statistics

What is impact when we analyze the tables during the peak hours?

1. Oracle consume more resource when we gather statistics. This will slow down the overall performance in the sever.

2. When statistics are updated for a database object, Oracle invalidates any currently parsed SQL statements that access the object. The next time such a statement executes, the statement is re-parsed and the optimizer automatically chooses a new execution plan based on the new statistics. This will degrade the server performance during the peak hours. But we can control this by using the parameter NO_INVALIDATE. This has three values(TRUE, FALSE, DBMS_STATS.AUTO_INVALIDATE). TRUE will not invalidate the already parsed SQL statement. NO will invalidate parsed SQL statement immediately. AUTO_INVALIDATE decides when to invalidate the already parsed Statement.

Here is the oracle help to know more about dbms_stats procedure. Oracle Help