I am trying to import data from a .csv file into an Oracle DB using Python. So far it works fine if the .csv file contains 10 records. If I increase the number of records in the .csv file to 1.000.000, the script takes far too long and does not end even after an hour.
Can anyone tell me how I can optimise my source code?
King Regards
Jegor
...
sql_insert = """Merge into TEST_TABLE a
Using (Select
:ID as ID,
:COUNTRY as COUNTRY ,
:DATE as DATE
From Dual) src
on src.ID = a.ID
when matched then
update set
a.COUNTRY = src.COUNTRY,
a.DATE = src.DATE
when not matched then
Insert (a.ID, a.COUNTRY, a.DATE)
Values (src.ID, src.COUNTRY, src.DATE)"""
# Get S3-File
obj = s3.Object(CDH_S3_Bucket, CDH_Path + '/' + s3_filename)
body = obj.get()['Body'].read().decode('utf-8').splitlines()
# ----------------------------------------------------------
csv_reader = csv.reader(body, delimiter=',')
headings = next(csv_reader)
for line in csv_reader:
data.append(line)
if data:
cursor.executemany(sql_insert, data)
connection.commit()
cursor.close()
...
>Solution :
A merge is meant to modify one table based on the data in another table. It is not intended for single-row processing from the client like this. The proper design would be to use a normal bulk-bind insert to load a work table and then you can do a single merge execution to sync the target table with the work table.
Also, when you do use a merge (appropriately), you don’t want to use the where or delete subclauses within the merge when matched then update... clause. For merges, the where clause is different than in other SQL statements: it is a subprogram within the update program which means you already pay the penalty of CR block reads and other concurrency mechanisms, including redo, even if the where clause cancels the update of a row. You want to filter out unchanged rows within the using clause instead (which means pre-joining to the target table within the using clause and filter out unchanged rows there – you can then emit the target table’s ROWID for optimal matching in the outer merge block; but the main benefit is filtering before the DML operation on the target).
The delete subclause also fires after the update is processed and only on the updated row – if it’s being updated, it’s because it’s coming in your new data and it won’t be old by definition, unless you are getting old data in your files. To delete old records you have to use a totally separate archiving SQL that is not the same as your loading SQL.