Postgres Corruption

Symptoms

Application logs showing

double register of key xxx
ERROR: invalid page in block x of relation pg_tblspc/x/x/x/x
ERROR: missing chunk number * for toast value * in pg_toast_*
checksum mismatch: disk has 0, should be xxx
DETAIL  could not open file “$psql/pg_clog/####”: No such file or directory
cp: cannot stat `pg_xlog/
Error “MultiXactId has not been created yet — apparent wraparound”

Purpose

Help engineers resolve corruption cases. We’d like if engineers could attach the SR’s to these cases and also record the errors messages in a SAP for the case.

Cause

Storage issue. APD. Disk full.

Resolution

Double register of key

This indicates that a primary key violation has been inserted into a table.

export pbin=`ps -ea -o cmd | grep ‘[p]ostgres -D’ | awk  ‘{ gsub (“postgres$”,””,$1); print $1}’`
export pdata=`ps -ea -o cmd | grep ‘[p]ostgres -D’ | awk  ‘{print $3}’`

$pbin/reindexdb -d VCDB -U postgres

  • For windows open the command prompt to where postgres binaries are.For example:C:\Program Files\VMware\vCenter Server\vPostgres\bin>reindexdb -U postgres -d VCDBNote: This will error out on a table if duplicate entries exist for a Primary Key.
  1. Login to the database../psql -U secureall -d secureall
    ./psql -U vc -d VCDB
    C:\Program Files\VMware\vCenter Server\vPostgres\bin>psql -U postgres -d VCDB
  2. Try to re index the suspect table to confirm:REINDEX TABLE vpx_resource_pool;   error rebuilding pk_resources
  3. Find the duplicate entry.select Column1, Column2, count(*)
    from yourTable
    group by Column1, Column2
    HAVING count(*) > 1Note: Column1,Column2 — are the columns in the primary key.
  4. To find columns in the primary key.\dt PK_NAME shows the column(s) in the PKFor example:VCDB=> \d “pk_vpx_resource_pool”
    Index “vc.pk_vpx_resource_pool”
    Column |  Type   | Definition
    ——–+———+————
    id     | integer | id
    primary key, btree, for table “vc.vpx_resource_pool”
    select id, count(*)
    from vpx_resource_pool
    group by id

    HAVING count(*) > 1;

  5. The next step is get the unique ID of the 2 same rowsselect ctid, id
    from vpx_resource_pool
    where id = ‘id returned above’;
  6. Now delete the most recent entry, The largest ctid.delete from vpx_resource_pool
    where ctid in (select max(ctid) from
    vpx_resource_pool where id=’id returned above’);
  7. Once this is removed you should be able to reindex the table.reindex table vpx_resource_pool;

ERROR: invalid page in block x of relation pg_tblspc/x/x/x/x

It’s very important to take a snapshot of your appliance before executing this step.

./psql -U secureall -d secureall — see KB 2147249 for help logging in
./psql -U vc -d VCDB
SET zero_damaged_pages = on;
vacuum full verbose;

ERROR: invalid page in block x of index xxxx

For example: ERROR: invalid page in block 0 of index vpx_text_array_m2

./psql -U vc -d VCDB  — see KB 2147249 for help logging in.
reindex index vpx_text_array_m2;

ERROR: missing chunk number * for toast value * in pg_toast_*  or ERROR: unexpected chunk number 0 (expected 1) for toast

The log may have indicated which toast value is corrupt. or toast value 92347756 in pg_toast_29042;

You need to remove the record with the corrupt toast.

  1. Login to the database../psql -U secureall -d secureall
    ./psql -U vc -d VCDB
  2. To identify the table with the corrupt rows.select 29042::regclass;  –29042 is the appendix after the pg_toast
  3. Another way to identify the corruption is by doing a vacuum fullsecureall=# vacuum full verbose;  — This will throw out any corruption.
  4. Once you find hte table you have to do selects on the table to find which row is corrupt and remove it.
  5. To find the row do some selects.For Example:The corruption is on vpx_vm,VCDB=> select count(*) from vpx_vm;
    count
    ——-
    37
    VCDB=> select * from vpx_vm where id < 10; No problem rows returned. VCDB=> select * from vpx_vm where id < 20; Error – unexpected chunk number  — corruption between 10 and 20 VCDB=> select * from vpx_vm where id < 15; Error – unexpected chunk number  — corruption between 10 and 15 VCDB=> select * from vpx_vm where id =12;
    Error – unexpected chunk number  — This is the corrupt vm.
  6. It’s not safe just to delete one row as there will be dependencies in the application.
  7. Look for other KB’s to see if this is safe.
  8. In the example above, see [Internal] Manually removing a virtual machine from the vCenter Server SQL database (2128217).

Checksum mismatch: disk has 0, should be xxx

  1. This will require you to shutdown the application. For more information on the values on variables inside <> see the ‘How to login’ sectionexport pbin=`ps -ea -o cmd | grep ‘[p]ostgres -D’ | awk  ‘{ gsub (“postgres$”,””,$1); print $1}’`
    export pdata=`ps -ea -o cmd | grep ‘[p]ostgres -D’ | awk  ‘{print $3}’`$pbin/pg_ctl stop -m smart -D $pdataNote: If you get permission p problems running the pg_ctl

    su – vpostgres -c “$pbin/pg_ctl stop -m smart -D $pdata”

  2. For further information, see permission problems executing postgres binaries (2147283)pg_ctl stop -m smart -D $pdata
  3. Then fix the corruption.postgres –single -D $pdata -c block_path“For example:su – postgres –c “$pbin/postgres –single -D $pdata -c fix_block_checksum=”1663/1636/1694/0/978″”
  4. Start up the database againpg_ctl start -D $pdataFor example:su – secureall –c “/opt/vmware/vpostgres/9.1/bin/pg_ctl start -D /common/db/vshield”
    su – vpostgres –c  “$pbin/pg_ctl start -D $pdata”

    For further information on permissions, see permission problems executing postgres binaries (2147283).

cp: cannot stat `pg_xlog/

export pbin=`ps -ea -o cmd | grep ‘[p]ostgres -D’ | awk  ‘{ gsub (“postgres$”,””,$1); print $1}’`
export pdata=`ps -ea -o cmd | grep ‘[p]ostgres -D’ | awk  ‘{print $3}’`

Shutdown postgres.

su –secureall -c “$pbin/pg_resetxlog $pdata”

Windows

Find the postgres executables, search for psql.

For Virtual Center this C:\Program Files\VMware\vCenter Server\vPostgres\bin>
Open a command prompt to this directory

pg_resetxlog.exe c:\programdata\vmware\vCenterServer\data\vpostgres\

For further information on permissions, see  permission problems executing postgres binaries (2147283).

Error “MultiXactId has not been created yet — apparent wraparound”

  1. Login to the database, see Instructions for logging into postgres when its embedded in applications or apppliances. Internal (2147249)./psql -U secureall -d secureall
    ./psql -U vc -d VCDB
  2. For the table the application is complaining about.SELECT count(*) from table for update;”
    followed by “VACUUM FREEZE table;”For example:SET LOCAL lock_timeout = ‘4s’;
    SELECT count(*) from vpx_host for update;”
    “VACUUM FREEZE vpx_host;”

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s