When a fix breaks something



In a post a couple of weeks ago i described how i fixed a performance problem on cloud control. However i found out this week that the all was not well with this fix.....

Here is a small part of the extract from the tracefile:

DDE: Problem Key 'ORA 600 [13013]' was flood controlled (0x2) (incident: 218235)
ORA-00600: internal error code, arguments: [13013], [5001], [202900], [101068480], [304], [101068480], [17], [], [], [], [], []
*********START PLSQL RUNTIME DUMP************
***Got internal error Exception caught in pl/sql run-time while running PLSQL***
***Got ORA-600 while running PLSQL***
PACKAGE BODY SYSMAN.EM_METRIC:

So that's not looking good (no one likes ORA-00600's...)

It had also created a very large trace file (yep that's 18GB.....)

-rw-r----- 1 oracle oinstall 18697754607 Jan  8 16:18 EETGRID_s020_17691.trc

A quick google of the error reveals that its a corruption of some sort and the 3rd argument of the error is the objectid - we can therefore use this to find the objects

SQL>  Select object_name,object_type,owner from dba_objects where data_object_id=202900;

OBJECT_NAME
--------------------------------------------------------------------------------
OBJECT_TYPE
-----------------------
OWNER
--------------------------------------------------------------------------------
EM_METRIC_ITEMS
TABLE
SYSMAN

SO that's the table i added my magic new index too - lets validate it

SQL> analyze table sysman.em_metric_items validate structure online;

Table analyzed.

So that's OK - let's now validate the table and it's indexes

SQL>  analyze table sysman.em_metric_items validate structure cascade online;
 analyze table sysman.em_metric_items validate structure cascade online
*
ERROR at line 1:
ORA-01499: table/index cross reference failure - see trace file

That's not so good - so there is a mismatch of the index entries and the rows in the table - not nice - lets just double check what indexes there are.

SQL> select index_name from dba_indexes where table_name='EM_METRIC_ITEMS'
  2  /

INDEX_NAME
--------------------------------------------------------------------------------
EM_METRIC_ITEMS_UN1
EM_METRIC_ITEMS_PK
NOT_ORACLE_ADDED_BY_US_IDX

Now the notes i found seem to imply that the index name will be somehow reference in the trace file the analyze produced so i found which file this would be in

SQL> SELECT VALUE FROM V$DIAG_INFO WHERE NAME = 'Default Trace File';

VALUE
--------------------------------------------------------------------------------
/oracle/admin/EETGRID/diag/rdbms/eetgrid/EETGRID/trace/EETGRID_ora_28379.trc

However it didn't seem to reveals the index name though it did have loads of stuff in it.

So i just decided to rebuild all the indexes anyway (in parallel which i then switched off)


SQL> alter index sysman.NOT_ORACLE_ADDED_BY_US_IDX rebuild parallel 8;

Index altered.

SQL>  alter index sysman.NOT_ORACLE_ADDED_BY_US_IDX noparallel;

Index altered.

SQL> alter index sysman.EM_METRIC_ITEMS_UN1 rebuild parallel 8;

Index altered.

SQL>  alter index sysman.EM_METRIC_ITEMS_UN1 noparallel;

Index altered.

SQL>  alter index sysman.EM_METRIC_ITEMS_PK rebuild parallel 8;

Index altered.

SQL>  alter index sysman.EM_METRIC_ITEMS_PK noparallel;

Index altered.

This however did not fix the problem.......

So i decided I'd have to completely drop all the indexes and recreate them - well actually id decided by this point that it must be my added index that was the problem so i'd just drop and recreate that. As this index is pretty much constantly in use (and the fact that i suspected building it online in the first place had caused the issue) i decided to stop cloud control first

 emctl stop oms
Oracle Enterprise Manager Cloud Control 12c Release 5
Copyright (c) 1996, 2015 Oracle Corporation.  All rights reserved.
Stopping WebTier...
WebTier Successfully Stopped
Stopping Oracle Management Server...
Oracle Management Server Successfully Stopped
Oracle Management Server is Down

I then extracted the ddl for the index using dbms_metadata, dropped the inded and then ran the sql to recreate it (adding parallel 8 to speed it up)

  CREATE INDEX "SYSMAN"."NOT_ORACLE_ADDED_BY_US_IDX" ON "SYSMAN"."EM_METRIC_ITEMS" ("TARGET_GUID", "METRIC_GROUP_ID", "IS_CURRENT") parallel 8;
  PCTFREE 10 INITRANS 2 MAXTRANS 255 COMPUTE STATISTICS
  STORAGE(INITIAL 65536 NEXT 1048576 MINEXTENTS 1 MAXEXTENTS 2147483645
  PCTINCREASE 0 FREELISTS 1 FREELIST GROUPS 1
  BUFFER_POOL DEFAULT FLASH_CACHE DEFAULT CELL_FLASH_CACHE DEFAULT)
  TABLESPACE "MGMT_TABLESPACE"

I then switched it back to noparallel after it was built

alter index "SYSMAN"."NOT_ORACLE_ADDED_BY_US_IDX" noparallel;

now we start the OMS back up

# emctl start oms
Oracle Enterprise Manager Cloud Control 12c Release 5
Copyright (c) 1996, 2015 Oracle Corporation.  All rights reserved.
Starting Oracle Management Server...
Starting WebTier...
WebTier Successfully Started
Oracle Management Server Successfully Started
Oracle Management Server is Up
You have new mail in /var/mail/oracle

and....... i hear you ask

It was fixed!

I suspect there is some bug when building an index online on a very highly used system that caused the problem - i couldn't find any hits on this on metalink though.

If you are going to add an index like i did maybe its safer to take an outage in this specific case.

Comments