Programming Solutions

Your Source for Information

Update Statement Performance Improvement for Large Table

by Maeenul 14. December 2010 15:56

Often we need to use update statement in our relation database coding. If it is a small table containing several thousand or may be several hundred thousand rows, we may not face any problem with the time required for the update statement. But if it is a large table containing several million data and we try to update it, we will surely face a performance problem. To update a column of several million rows will take a long time. In such cases we have to think some alternative to reduce the time of the update operation. Sometimes, we may not be able to reduce the time of the update operation, but we can use some alternative approach which will give us the same result as we would get from our update operation. But of course, taking a much less time that the update operation. First we will see a typical case. We have several tables as below:

IMPORT_UK_IMS_SALES_24M (22.6 million)

UK_RXI_GEO_DIM_NF (1223) UK_RXI_PRODUCT_DIM_NF(181) UK_RXI_PRACTICE_DIM_NF(22 thousand) UK_RXI_PERIOD_DIM(24) What we need is to have some of the columns in dbo.IMPORT_UK_IMS_SALES_24M table updated with columns from the other 4 tables. PCO_ID column will be updated with GEO_CODE from UK_RXI_GEO_DIM_NF table matching PCO_ID = GEO_DESC Product_Pack_ID column will be updated with PROD_CODE from UK_RXI_PRODUCT_DIM_NF table matching Product_Pack_ID = PRODUCT_DESC Binleys_Practice_ID column will be updated with PRACTICE_CODE from UK_RXI_PRACTICE_DIM_NF table matching Binleys_Practice_ID = PRAC_DESC YearMonth column will be updated with PERIOD_CODE from UK_RXI_PERIOD_DIM table matching YearMonth = actual_Code So a very obvious implementation will be as follows:

----------- update geo level ---------

update dbo.IMPORT_UK_IMS_SALES_24M

set PCO_ID = A.GEO_code

from UK_RXI_GEO_DIM_NF as A

where PCO_ID = A.GEO_desc

 

------------ update product/brand level ---------

update dbo.IMPORT_UK_IMS_SALES_24M

set Product_Pack_ID = B.prod_code

from UK_RXI_PRODUCT_DIM_NF as B

where Product_Pack_ID = B.product_desc

 

 

--------- update PRACTICE level-----------------

update dbo.IMPORT_UK_IMS_SALES_24M

set Binleys_Practice_ID = C.PRACTICE_CODE

from UK_RXI_PRACTICE_DIM_NF as C

where Binleys_Practice_ID = C.PRAC_DESC

 

 

------------- update period ------------------

update dbo.IMPORT_UK_IMS_SALES_24M

set YearMonth = D.PERIOD_CODE

from UK_RXI_PERIOD_DIM as D

where YearMonth = D.actual_CODE But a single update in this case takes around 8:37 minutes. All the 4 update statements take around 30 minutes. This is obviously not acceptable. A second alternative: The second alternative to this typical update statements will be to combine all the 4 updates into 1 update statement. We will have to use join with update statements.

update IMPORT_UK_IMS_SALES_24M

set IMPORT_UK_IMS_SALES_24M.PCO_ID = A.GEO_code,

IMPORT_UK_IMS_SALES_24M.Product_Pack_ID = B.prod_code,

IMPORT_UK_IMS_SALES_24M.Binleys_Practice_ID = C.PRACTICE_CODE,

IMPORT_UK_IMS_SALES_24M.YearMonth = D.PERIOD_CODE

from IMPORT_UK_IMS_SALES_24M as IM

leftjoin UK_RXI_GEO_DIM_NF as A

on IM.PCO_ID = A.GEO_desc

leftjoin UK_RXI_PRODUCT_DIM_NF as B

on IM.Product_Pack_ID = B.product_desc

leftjoin UK_RXI_PRACTICE_DIM_NF as C

on IM.Binleys_Practice_ID = C.PRAC_DESC

leftjoin UK_RXI_PERIOD_DIM as D

on IM.YearMonth = D.actual_CODE

 

Typically we think that 4 joins with the large table will take a long time. But although it has 4 joins, it will take the time of a single update statement. To optimize the time, we have used left joins instead of inner joins. The above update statement takes around 8 minutes. So definitely we have reduced the time to one forth. But still 8 minutes is too long. It would better if we could reduce the time even more. A Third alternative: Probably The Best The third and probably the best alternative is to avoid update statement on this large table. How can we avoid this update but still can get the same result? We can create a recordset having the values that would be the value after the update operation. We can use left joins in the same way as we have used in the second alternative. But this time we will not update the table. We will create a full record set and insert it in a temporary table. After that we can truncate table main large table and then reinsert the full record set from the temporary tbale to our main table. Although this process contains 2 very large insertion, but still it is faster than update statement.

select Per.PERIOD_CODE [YearMonth],

P.prod_code [Product_Pack_ID],

G.geo_code [PCO_ID] ,

Prac.PRACTICE_CODE [Binleys_Practice_ID],

A.Sales_value,

A.Rx_value,

A.Count_Units

into #temp

from IMPORT_UK_IMS_SALES_24M A

leftjoin UK_RXI_GEO_DIM_NF G

on A.PCO_ID = G.GEO_desc

leftjoin UK_RXI_PRODUCT_DIM_NF P

on A.Product_Pack_ID = P.product_desc

leftjoin UK_RXI_PRACTICE_DIM_NF Prac

on A.Binleys_Practice_ID = Prac.PRAC_DESC

leftjoin UK_RXI_PERIOD_DIM Per

on A.YearMonth = Per.actual_CODE

 

truncatetable IMPORT_UK_IMS_SALES_24M

 

insertinto IMPORT_UK_IMS_SALES_24M

select*

from #temp Doing this will give us the same result in the IMPORT_UK_IMS_SALES_24M table and we need around 2 minutes to run this process. So we are again reducing the time to one forth. So, to conclude, we can say that if we have to do many updates in a very large table (table with several million rows), we can avoid update statement and use select and insert.

Tags: , ,

Category: SQL



Comments (1) -

  1. Parallelization for queries (SELECT) is covered quite well by the SQL engine itself, but when it comes to large volume data modifications (UPDATE, INSERT, DELETE), the standard engine does parallelize towards best use of all available resources (disk, multiple cpu-cores, etc.).

    Therefore you may have a look into the approach of SQL Parallel Boost at
    www.ibax.ch/.../default.aspx

    This approach can also be used to execute multiple SQL statements in parallel.

    A purely SQL engine related parallelisation solution takes advantage of minimized complexity and has no 'external' components like SSIS involved, Furthermore it's the best performing solution regarding task splitting and synchronization, as it hasn't potential connection and communication overhead. The overall performance gain thru parallelisation with SQL Parallel Boost is up to 10 !

    In case you don't wan't to rebuild your own solution, SQL Parallel Boost provides a self-contained pure T-SQL based solution, which can be easily embedded in existing applications and ETL process tasks.

Add comment

biuquote
  • Comment
  • Preview
Loading

Alpha Tags