使用DBR 10.0
當調用toPandas()職工與IndexOutOfBoundsException失敗。它看起來像ArrowWriter。sizeInBytes(這看起來像一個專有的方法,因為我找不到它在OSS)調用箭頭的getBufferSizeFor失敗錯誤。這個問題的根源是什麼?
這是一個完整的堆棧跟蹤的樣品:
. lang。IndexOutOfBoundsException:指數:16384,長度:4(預期:範圍(0,16384))在org.apache.arrow.memory.ArrowBuf.checkIndexD (ArrowBuf.java: 318) org.apache.arrow.memory.ArrowBuf.chk (ArrowBuf.java: 305) org.apache.arrow.memory.ArrowBuf.getInt (ArrowBuf.java: 424) org.apache.arrow.vector.complex.BaseRepeatedValueVector.getBufferSizeFor (BaseRepeatedValueVector.java: 229) org.apache.arrow.vector.complex.ListVector.getBufferSizeFor (ListVector.java: 621) org.apache.spark.sql.execution.arrow.ArrowFieldWriter.getSizeInBytes (ArrowWriter.scala: 165) org.apache.spark.sql.execution.arrow.ArrowWriter.sizeInBytes (ArrowWriter.scala: 118) org.apache.spark.sql.execution.arrow.ArrowConverters立刻1美元。anonfun美元下1美元(ArrowConverters.scala: 224) scala.runtime.java8.JFunction0專門sp.apply美元(JFunction0專門sp.java美元:23)org.apache.spark.util.Utils .tryWithSafeFinally美元(Utils.scala: 1647) org.apache.spark.sql.execution.arrow.ArrowConverters不久美元1.美元未來(ArrowConverters.scala: 235) org.apache.spark.sql.execution.arrow.ArrowConverters不久美元1.美元未來(ArrowConverters.scala: 199) scala.collection.Iterator不久美元10.美元未來(Iterator.scala: 461) scala.collection.Iterator.foreach (Iterator.scala: 943) scala.collection.Iterator.foreach美元(Iterator.scala: 943)
to_pandas()隻有一個小的數據集。
請使用:
to_pandas_on_spark ()
必須使用熊貓火花而不是普通的熊貓,這樣它將工作在一個分布式的方式。這裏有更多的信息https://spark.apache.org/docs/latest/api/python/user_guide/pandas_on_spark/index.html
所以總是進口熊貓:
進口pyspark。熊貓作為ps