Spark Dataframe cheat sheet

If you are working in spark by using any language like Pyspark, Scala, SparkR or SQL, you need to make your hands dirty with Hive.In this tutorial I will show you

·       How to read Hive table in spark
·       Convert Spark SQL data frame to python pandas data frame
·       Convert pandas data frame to spark SQL data frame
·       Write python pandas data frame to Hive table 


By using SparkSession object we can read data or tables from Hive database.
To read certain Hive table you need to know exact database for the table. So lets print Hive databases.
# Print first 5 Hive databases
sqlContext.sql('show databases').show(5)
+————+
|databaseName|
+————+
|Database_1|
|Database_2|
|Database_3|
|Database_4|
|Database_5|
+————+
only showing top 5 rows

Now lets say the table you want to read is inside Database_1
So lets print available tables in database_1
# Print first 5 tables from Database_1
sqlContext.sql('show tables from Database_1').show(5)
+———-+———+———–+
|  database|tableName|isTemporary|
+———-+———+———–+
|Database_1|  table_1|      false|
|Database_1|  table_2|      false|
|Database_1|  table_3|      false|
|Database_1|  table_4|      false|
|Database_1|  table_5|      false|
+———-+———+———–+

only showing top 5 rows

Now as you can see all table names, lets read table_1 from Database_1
# Read and print hive table
spark.sql("select * from Database_1.table_1").show()
+—-+—+
|name|val|
+—-+—+
|   a|  1|
|   b|  2|
|   c|  3|
|   d|  4|
|   e|  5|

+—-+—+

Now this is a spark SQL dataframe. But what if we want to use this dataframe to do something in python. In simple words to use this spark SQL data frame in python you need to convert it to pandas dataframe.

Convert Spark SQL data frame to python pandas data frame


# Read Hive table and convert to pandas dataframe
df = spark.sql("select * from Database_1.table_1").toPandas()
df.head()
Note:  As we have conveted spark SQL dataframe to pandas dataframe, so now .show() function will not work

Convert pandas data frame to spark SQL data

# Import required library
import pandas as pd

# Create a test pandas dataframe
a = [1,2,3,4,5]
b = ['a','b','c','d','e']
df = pd.DataFrame()
df['name'] = b
df['val'] = a

# Convert pandas dataframe to spark SQL dataframe
spark_df = spark.createDataFrame(df)

Write python pandas data frame to Hive table

Lets say you have one pandas dataframe after doing all staff of analysis in Python, now you want to insert this pandas dataframe into a hive table (existing hive table or by creating a new hive table).
To do that you can not insert pandas dataframe to Hive table as it is. First you need to convert pandas dataframe into spark SQL dataframe and then only you can insert that converted Spark SQL dataframe to Hive table.
# Import required library
import pandas as pd

# Create a test pandas dataframe
a = [1,2,3,4,5]
b = ['a','b','c','d','e']
df = pd.DataFrame()
df['name'] = b
df['val'] = a

# Convert pandas dataframe to spark SQL dataframe
spark_df = spark.createDataFrame(df)
# Create Hive table and write Spark SQL dataframe into it
spark_df.write.mode("overwrite").saveAsTable("Database_1.testTable")

Conclusion

In this tutorial I have shared my experience working with spark by using language Python and Pyspark. One point I want to highlight here is that you can write and execute python code also in Pyspark shell (for the very first time I did not even think of it).

So in this tutorial you learned:

  • How to read Hive table in spark
  • Convert Spark SQL data frame to python pandas data frame
  • Convert pandas data frame to spark SQL data frame
  • Write python pandas data frame to Hive table

If you have any question or suggestion regarding this topic see you in comment section. I will try my best to answer.

Leave a Comment

Your email address will not be published. Required fields are marked *