解決:Re:需要刪除doubledagger分隔符從c…頁2 -磚- 19908

虛假的 · ‎11-29-2022

我的csv數據看起來是這樣的

‡‡companyId‡‡,‡‡empId‡‡,‡‡regionId‡‡,‡‡companyVersion‡‡,‡‡問題‡‡

我試著這段代碼

dff = spark.read。選項(“頭”,“真正的”)。選項(“inferSchema”,“真正的”)。選項(“分隔符”、“‡”). csv (f“/ mnt /數據/路徑/ datafile.csv”)

但我得到空間在每個字符之間的結果

��!!c o m p n y我d !!,!!e m p I d !!,!! r e g i o n I d ! ! , ! !

請幫助

虛假的 · ‎11-30-2022

嗨@Uma Maheswara Rao Desula沒有數據匹配問題。這是我的錯誤驗證數據。數據是完美的。問題是我有90列。所以有什麼辦法可以減少手動工作類似下麵?

因為我在dffs_headers:

columnLabel =我[0]

newColumnLabel = columnLabel.replace (“‡‡’,”) .replace (“‡‡“,”)

dff = dff。的withColumn (newColumnLabel regexp_replace (columnLabel, ' ^ \ \‡‡| \ \‡‡美元“,”)

如果columnLabel ! = newColumnLabel:

dff = dff.drop (columnLabel)

UmaMahesh1 · ‎11-30-2022

嗨@shamly pt,

作為一個自動化的方式如果你事先不知道模式會是這樣……

dff_1 = spark.read。選項(“頭”,“假”)。選項(“inferSchema”,“真正的”).option (“9”, ", ") . csv (“/ FileStore /表/ Book1.csv”)。withColumnRenamed (“_c0”、“col1”) split_col = pyspark.sql.functions。split (dff_1 [' col1 '], ', ') # header字段名稱header_uncleaned = (spark.read。選項(“頭”,“真正的”)。選項(“inferSchema”,“真正的”).option (“9”, ", ") . csv (“/ FileStore /表/ Book1.csv”) .columns [0]) .split(", ")頭=[]我在header_uncleaned: header.append (”。加入(e, e我如果e.isalnum())) #循環列名和填充數據dff_1 = df1我範圍(len(頭)):打印(i) df1 = df1。withColumn(的頭[我],regexp_replace (split_col.getItem(我),“[^ 0-9a-zA-Z_ \ -] + ", " ")) df1 = df1.drop (col1) .filter (col (“companyId”) ! =“companyId”)顯示(df1)

虛假的 · ‎12-01-2022

嗨@Uma Maheswara Rao Desula

我寫了這段代碼,因為我有很多文件在許多文件夾在同一位置和一切utf - 16。這是給我適當的結果如下

dff = spark.read。選項(“頭”,“真正的”)\

.option (“inferSchema”,“真正的”)\

.option(“編碼”,“utf - 16”) \

.option(“多行”,“真正的”)\

.option(“分隔符”、“‡‡,‡‡”)\

. csv (“/ mnt /數據/ file.csv”)

顯示器(dff)

‡‡CompanyId公司名稱CountryId‡‡

‡‡1234 abc cn‡‡

‡‡2345 def‡‡

‡‡3457 ghi sy‡‡

‡‡7564 lmn英國‡‡

現在,我想刪除下麵的開始和結束雙匕首和我寫的代碼,這是給我的錯誤“IndentationError:預期的一個縮進塊”

從pyspark.sql。功能的進口regexp_replace

dffs_headers = dff.dtypes

因為我在dffs_headers:

columnLabel =我[0]

newColumnLabel = columnLabel.replace (“‡‡’,”) .replace (“‡‡“,”)

dff = dff。的withColumn (newColumnLabel regexp_replace (columnLabel, ' ^ \ \‡‡| \ \‡‡美元“,”)

如果columnLabel ! = newColumnLabel:

dff = dff.drop (columnLabel)

顯示器(dff)

錯誤“IndentationError:預期的一個縮進塊”

UmaMahesh1 · ‎12-01-2022

嗨@shamly pt

這是一個縮進的錯誤。檢查後如果你有適當的縮進。

我猜你的錯誤如果columnLabel ! = newColumnLabel:

dff = dff.drop (columnLabel)。

剛才打了一個選項卡或給縮進空間之前dff = dff.drop (columnLabel)

歡呼聲……

虛假的 · ‎12-01-2022

謝謝它

磚

需要刪除doubledagger分隔符使用磚從csv