You can learn dataframes to advance your career in data analytics and management. In this tutorial, you will go through key Dafaframe concepts in detail. You will learn to create dataframes in multiple ways. You will learn to create columns in data frames and many more.
We use dataframes to store and organize data in a specific format. Similar to SQL tables, we can store data in a table’s columns and rows using dataframes. A dataframe is nothing but a spreadsheet with named columns. But the important thing is that when it comes to Excel sheets, we can store an Excel file in the location of a single computer. On the other hand, dataframe can be stored on numerous computers.
Know that many programming languages and frameworks adapt Dataframe concepts. When it comes to pandas, dataframes are taken as primary datatype. Here, pandas are nothing but a popular Python data analysis library. When it comes to spark programming, Dataframes are used. Both pandas and spark programming are widely used in big data and data science. According to Ziprecruiter, big data developers can earn over 120k USD per year on average. No doubt, you will get a promising future by learning dataframes. This Dataframe tutorial will equip you to work with PySpark and Pandas Dataframes.
Essentially, a Dataframe is a data structure. We use Dataframe to store data in a two-dimensional structure. Also, the Dataframe can be considered as an SQL table. It defines the data and name type of every column of a table. We use Dataframe in pandas and Pyspark commonly.
We use pandas to manage datasets. Pandas is a library that consists of many functions. We can use the functions to analyze and manipulate data effectively.
Pandas dataframe is nothing but a two dimensional array with rows and columns. It contains a size mutable tabular data. The pandas datastrucure includes parameters like data, index, columns, and copy
If you want to enrich your career and become a professional in Python, then enroll in "Python Certification Training". This course will help you to achieve excellence in this domain. |
Following are the features of DataFrame:
Installing Pandas
Pandas is a simple package to install. Open the command line(for PC users) or your terminal program(for Mac users) and install it through either of the below commands:
conda install pandas
OR
pip install pandas
Alternatively, in the Jupyter Notebook, we can run the following cell:
!pip install pandas
The “!” at the starting run cells as if they are available in the cell.
For importing the pandas, we generally import them with the shorter name as it is utilized so much.
import pandas as pd1
Core Components of Pandas: Series and DataFrames
Following are the core components of pandas
1) Series
2) DataFrames
A series is a column, and the Dataframe is the multi-dimensional table containing the collection of the Series.
Create the Dataframe from the list
We can create the Pandas DataFrame using the below constructor:
pandas.DataFrame(data, columns, index, copy, dtype)
Parameters
How to Create Pandas DataFrame
Create the Entry DataFrame
We can create A basic DataFrame as an empty dataframe.
import pandas as pd1
df = pd1.DataFrame
print df
Output
Empty DataFrame
Columns: []
Index: []
Create a DataFrame from List
We can create the dataframe using a single list.
import pandas as pd1
data = [6, 8, 9, 10, 11]
df = pd1.DataFrame(data)
print df1
Create the DataFrame from the Dict of ndarrays and List
Every ndarray should have the same length. If the index is sent, then the length of the index should be the same as the array’s length. If no index is sent, the index will be range(n). Note that “n” describes the array length.
import pandas as pd1
Data = { 'Name': ['Tom', 'Jack', 'Steve', 'Ricky'], 'Age': [33,25,41,35]}
df1 = pd.DataFrame(data)
print df1
Create the DataFrame from the Dict of Series
We can pass the Dictionary of Series to Form the DataFrame. The derived index is the union of all the series indexes that are passed.
import pandas as pd1
d = {'one': pd1. series ([1,2,3], index = ['a', 'b', 'c']),
two: pd1.series ([4,5,6,7], index = ['d', 'e', 'f'])}
df1 = pd1.DataFrame[d]
print df1
Datatypes
The type of the data values is called data type. In Pandas DataFrame, data types are called dtypes. They are essential because they decide the amount of memory your DataFrame utilizes and its calculation speed and accuracy level. Pandas largely depend on the Numpy data types. But, pandas 1.0 created some additional data types. They are given below:
Handling Rows and Columns
A Dataframe is nothing but a two-dimensional data structure. In Dataframes, Data is arranged in a tabular form as columns and rows. It allows us to make fundamental operations on the data, like adding, renaming, and deleting.
Column Selection: For selecting the column in the Pandas DataFrame, we should use the columns by calling the column using their names.
import pandas as pd
data = {'Name': ['Jai', 'Princi', 'Gaurav', 'Anuj'],
'Age': [27, 24, 22, 32],
'Address': ['Delhi', 'Kanpur', 'Allahabad', 'Kannauj'],
'Qualification': ['MSc', 'MA', 'MCA', 'PhD']}
df = pd.DataFrame(data)
print(df[ [ 'Name', 'Qualification']])
Rows Selection: Pandas offer a method for retrieving from the Dataframe. We can use the DataFrame.loc[] method for fetching the rows from the Pandas DataFrame. We can select the rows only by forwarding the integer location to the iloc[] function.
import pandas as pd1
data = pd.read_csv("nba.csv", index_col ="Name")
first data.loc["Avery Bradley"]
second = data.loc["R.J. Hunter"]
print(first, "\n\n\n", second)
Related Article: Pandas Projects
1) Append()
This function is used to add the rows of other dataframe to the end of the given dataframe.
2) Apply()
This function allows you to send the function. Also, it allows you to apply the function to all the single values of the pandas series.
3) Aggregate()
The main task of this function is applying some aggregation to multiple columns. The most commonly used aggregations are:
4) Pandas DataFrame.assign()
The assign() function is also helpful in adding a new column to the DataFrame. If we re-assign the available columns, then the value will be overridden.
5) Pandas DataFrame.astype()
The astype() function is utilized to cast pandas object to the particular dtype.astype() function. It can convert the ideal column to a certain type. It is useful when we have to case a specific column data type to another data type. We can utilize the input to the Python dictionary for changing multiple column types instantly.
How to delete the Indices, Columns, or Rows from the Pandas Dataframe
Deleting the Index from the Dataframe
If we have to remove the index from our dataframe, we should reconsider because Series and DataFrames always have the index.
df = pd1.DataFrame (dp=np.array([1,2,3], [4,5,6], [7,8,9], [30,40,50], [26,38,45]),
index= [3.5, 11.6, 4.8, 2.5, 4.8]
Columns [50, 51, 52]
df.reset_index().drop_duplicates (subset='index' , keep = 'Last' ).set_index('index')
Apache Spark is developed in the Scala programming language. It has been released for supporting the collaboration of Python, and Apache Spark is actually the Python API for Spark. Moreover, PySpark enables you to interface with the RDDs(Resilient Distributed Datasets) in the Apache Spark and Python Programming languages.
Related Article: Python Tutorial
DataFrame is the distributed group of data arranged into named columns. It is theoretically correspondent to a table in the RDBMS or the Dataframe in Python or R but with maximum optimization. We can construct the DataFrames from a wide range of files like external databases, tables in Hive, existing RDDs, and structured data files.
The easiest method to create a DataFrame is from the Python data list. We can create the RDD dataframe and read the files from various sources.
Using the DataFarme()
By using the createDataFrame() function of SparkSession, we can create the DataFrame.
data = [('James',
('Michael', 'Rose',
C
'Smith', '1990-01-04', 'M', 4000),
2001-07-12', 'M', 5000),
('Ravi', ', 'Williams', '2001-06-19', 'M', 6000),
('Maria', 'Anne', 'Jones', '1998-01-12', 'F', 7000)
columns = ["firstname", "middlename", "lastname", "dob", "gender", "salary"]
df1 = spark.createDataFrame(data=data, schema = columns)
Create DataFrame from RDD
One of the best methods to create the PySpark DataFrame is from the RDD. Let us create the spark RDD from the collection List by invoking parallelize() function from the SparkContext.
spark=SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
rdd = spark.sparkContext.parallelize(data)
Create the DataFrame
data =[['Bob', 28], ['Alice', 29], ['Alexa', 45]]
df =pd.DataFrame(data, columns=['Name', 'Age])
df
Create the DataFrame with the Column
c = pd.DataFrame([4,5,6], columns=['Éxample1])'
Set the column name of our Dataframe to that of the newly generated one.
df['Example1'] = c['Example1'] df
In PySpark, we can take the “select()” function for selecting single or multiple columns by index, all the columns from the list, and the nested columns from the DataFrame. PySpark select() is the transformation function that returns the new DataFrame with the chosen columns. Let us select the first five rows of “Age” and “User_ID” from the “train.”
train.select ('User_ID', 'Age').show(5)
How to find the number of distinct products in the train and test files?
We can use the “distinct” operation for calculating the number of distinct rows in the DataFrame. Let us use the “distinct” operation for calculating the number of distinct products in the train and test file each.
train.select('Product_ID').distinct().count(), test.select('Product_ID').distinct().count()
Output:
(3671,3241)
We have 3671 and 3241 distinct products in train and test files, respectively.
We can create new columns in a PySpark DataFrame in several ways. The following are some of them:
1) Using the Spark Native Functions
The most frequently utilized way to create new columns in the Pyspark dataframe is by using in-built functions. This is the most efficient programmatical method to create the new column. We can utilize “.withcolumn” with the PySpark SQL functions for creating new columns. Primarily, Basically, we can find the Date functions, Math functions, and string functions already executed through the Spark functions. The first function, “F.col” provides access to the columns. Thus, if we have to add 100 to the column, we can utilize “F.col” as
import pyspark.sql.function as F
caseWithNewAffirmed = cases. withColumn ("NewAffirmed", 100 + F.col("Affirmed"))
caseswithNewAffirmed.show()
We can also utilize math functions like the “F.exp” function:
caseWithExpAffirmed = cases. withColumn("ExpAffirmed", F.exp("Affirmed"))
casesWithExpAffirmed.show()
2) Using the Spark UDFs
Sometimes, we have to do complicated things to the column or the multiple columns. We can consider it as the map operation on the PySpark dataframe to multiple columns or a single column. However, Spark SQL functions will resolve several use cases for creating columns. We can utilize the Spark UDF whenever I require the more advanced functionality of Python. For utilizing the Spark UDF, we have to use the F.udf function for converting the regular python function to the Spark UDF. We also have to define the return type of function. In the following example, the return type is the StringType().
import pyspark.sql.functions as F
from pyspark.sql.types import
def casesHigh Low (affirmed):
if affirmed < 50:
return 'low'
else:
return 'high'
casesHighLowUDF = F.udf(casesHighLow, StringType())
CasesWithHighLow = cases. withColumn("HighLow," casesHighLowUDF ("confirmed"))
casesWithHighLow.show()
3) Using the RDDs
Both the SQL functions and Spark UDFs are not sufficient for a specific use case. In some use cases, RDDs will perform better than SQL functions and Spark UDFs. We may have to utilize better partitioning that Spark RDDs provide, or we may have to utilize the group functions in the Spark RDDs. In any use case, using RDDs for creating new columns is helpful for people who have a deep understanding of RDDs, which is the fundamental building block in the Spaark environment.
This process utilizes the functionality for transforming between the Python dict and Row objects. We transform the row object into the dictionary. After that, we can work with the dictionary since we are utilized for converting the row object to the dictionary. After that, we can work with the dictionary since we are used to transform that dictionary back to the row again. This approach can be useful in a lot of use cases.
import math
from pyspark.sql import Row
def rowwise_function (row);
row_dict = row.asDict() row_dict['expaffirmed'] =float(np.exp(row_dict['confirmed']))
newrow = Row (**row_dict)
return newrow
cases_rdd = cases. rdd
cases_rdd_new = cases_rdd.map(lambda row: rowwise_function(row))
caseNewDf = sqlContext.createDataFrame (cases_rdd_new)
casesNewDf.show()
4) Using the Pandas UDF
This functionality was started in Spark version 2.3.1. It enables you to use this Pandas functionality with Spark. Generally, we utilize it when we need to run the groupBy operation on the Spark dataframe or whenever we have to create the rolling features and have to use Pandas rolling functions or the Windows functions instead of the Spark versions.
We utilize the “F.pandas_udf” decorator. We can assume here that the input to the function will be the Pandas dataframe. And we have to return the pandas dataframe in turn from this function.
cases.printSchema()
root
-case_id: integer (nullable = true)
- province: string (nullable = true)
- city: string (nullable = true)
- group: boolean (nullable = true)
- infection_case: string (nullable = true)
-confirmed: integer (nullable = true)
- latitude: string (nullable = true)
- longitude: string (nullable = true)
PySpark has been well utilized in the Machine Learning and Data Science community since there are various widely used data science libraries developed in Python, like TensorFlow and NumPy. Further, Pyspark is efficiently utilized for processing the massive datasets.
PySpark is the spark library developed in Python for running Python applications through Apache Spark capabilities. Through the PySpark, we can run the applications concurrently on the distributed cluster or even on the single node. Apache Spark is the analytical processing engine for huge-scale, robust, distributed data processing and machine learning applications.
Spark was written in Scala, and because of its industry utilization, its API PySpark was unleashed for Python through Py4J. Py4J is the Java library integrated with PySpark and enables Python to interact dynamically with JVM objects; thus, for running the PySpark, we also require Java to be installed along with Apache Spark and Python.
1. What is a Dataframe?
It is nothing but a data structure that arranges data into a two-dimensional table of rows and columns. So you can easily manipulate the data stored in a dataframe. Schema is the blueprint of every data. It defines the type type and name of every column in the table.
2. Name the key components of Dataframes.
Below are the key components of Dataframes.
3. Can you mention some key features of the Dataframe?
4. How do you create a Pandas Dataframe?
We can build Dataframes using the following
5. How do you build a Dataframe?
We can build a Dataframe using Pandas’s Dataframe ( ) function.
6. Can you change the size and values of the Dataframes?
Yes. The size and values of a Dataframe are mutable.
In short, It is the data structure used in the data engineering domain. This data structure simplifies working with data. A Dataframe helps to organize data into two-dimensional tables. This Dataframe tutorial has taught you about the Pandas Dataframe in-depth. If you want to dig deep into the dataframe, you can reach out to MindMajix. You can take any data science-related courses. You will gain certification, which will help you to take the next step in your career.
Name | Dates | |
---|---|---|
Python Training | Oct 15 to Oct 30 | View Details |
Python Training | Oct 19 to Nov 03 | View Details |
Python Training | Oct 22 to Nov 06 | View Details |
Python Training | Oct 26 to Nov 10 | View Details |
Prasanthi is an expert writer in MongoDB, and has written for various reputable online and print publications. At present, she is working for MindMajix, and writes content not only on MongoDB, but also on Sharepoint, Uipath, and AWS.