解決:如何提供UPSERT條件PySpark -磚- 22952

康斯坦丁 · ‎04-13-2022

我有一個表“demo_table_one”我想插入以下值

data =[(11111年,“CA”,“2020-01-26”),(11111年,“CA”,“2020-02-26”),(88888年,“CA”,“2020-06-10”),(88888年,“CA”,“2020-05-10”),(88888年,“佤邦”,“2020-07-10”),(88888年,“佤邦”,“2020-07-15”),(55555年,“佤邦”,“2020-05-15”),(55555年,“CA”,“2020-03-15”),]列= [‘attom_id’,‘state_code’,‘sell_date] df =火花。createDataFrame(數據列)

每個attom_id & state_code的邏輯是,我們隻需要最新的sell_date

所以在我的表的數據

“CA”[(11111年,“2020-02-26”),(88888年,“CA”,“2020-06-10”),(88888年,“佤邦”,“2020-07-15”),(55555年,“CA”, ' 2020-03-15 '))

我有下麵的代碼

從三角洲。表導入DeltaTable DeltaTable = DeltaTable。forName(火花,“demo_table_one”) #執行插入(deltaTable.alias (“orginal_table”) .merge (df.alias (“update_table”)、“orginal_table。state_code = update_table。state_code orginal_table。attom_id = update_table.attom_id”) .whenNotMatchedInsertAll () .whenMatchedUpdateAll (“orginal_table。sell_date < update_table.sell_date”) . execute ())

但這插入表中所有的值

werners1 · ‎04-14-2022

@John康斯坦丁,根據文檔,whenMatched可以有一個可選的條件。

所以我不立即看到這裏的問題。也許whenMatched條件永遠不會真的因為某些原因?

在原帖子查看解決方案

Hubert_Dudek1 · ‎04-13-2022

它不會有目的地在第一次插入的數據,因此,它將執行.whenNotMatchedInsertAll()為每一個記錄。同樣,當兩個新記錄到達一次(使用相同的id和狀態)在接下來的插入,插入兩個。當然,你需要的是聚合數據之前插入(attom_id,‘state_code’,馬克斯(“sell_date”)。

康斯坦丁 · ‎04-13-2022

我不能這樣做在PySpark

deltaTable.as (“orginal_table”) .merge (df.as (“update_table”)、“orginal_table。state_code = update_table。state_code orginal_table。attom_id .whenMatched (“orginal_table = update_table.attom_id”)。.whenNotMatched sell_date < update_table.sell_date”) .updateAll () () .insertAll () . execute ()

werners1 · ‎04-14-2022

@John康斯坦丁,根據文檔,whenMatched可以有一個可選的條件。

所以我不立即看到這裏的問題。也許whenMatched條件永遠不會真的因為某些原因?

Hubert_Dudek1 · ‎04-15-2022

另外@John康斯坦丁,你能分享demo_table_one數據是什麼?我們隻有df(別名update_table)的例子

磚

如何提供PySpark UPSERT條件