DataFrame Tutorial

You can learn dataframes to advance your career in data analytics and management. In this tutorial, you will go through key Dafaframe concepts in detail. You will learn to create dataframes in multiple ways. You will learn to create columns in data frames and many more.

We use dataframes to store and organize data in a specific format. Similar to SQL tables, we can store data in a table’s columns and rows using dataframes. A dataframe is nothing but a spreadsheet with named columns. But the important thing is that when it comes to Excel sheets, we can store an Excel file in the location of a single computer. On the other hand, dataframe can be stored on numerous computers.

Know that many programming languages and frameworks adapt Dataframe concepts. When it comes to pandas, dataframes are taken as primary datatype. Here, pandas are nothing but a popular Python data analysis library. When it comes to spark programming, Dataframes are used. Both pandas and spark programming are widely used in big data and data science. According to Ziprecruiter, big data developers can earn over 120k USD per year on average. No doubt, you will get a promising future by learning dataframes. This Dataframe tutorial will equip you to work with PySpark and Pandas Dataframes.

DataFrame Tutorial Outline

What is a DataFrame?

Essentially, a Dataframe is a data structure. We use Dataframe to store data in a two-dimensional structure. Also, the Dataframe can be considered as an SQL table. It defines the data and name type of every column of a table. We use Dataframe in pandas and Pyspark commonly.

Pandas DataFrame

We use pandas to manage datasets. Pandas is a library that consists of many functions. We can use the functions to analyze and manipulate data effectively. 

Pandas dataframe is nothing but a two dimensional array with rows and columns. It contains a size mutable tabular data. The pandas datastrucure includes parameters like data, index, columns, and copy

If you want to enrich your career and become a professional in Python, then enroll in "Python Certification Training". This course will help you to achieve excellence in this domain.

DataFrame Features

Following are the features of DataFrame:

  1. DataFrames support named columns and rows.
  2. Support Heterogeneous Data.
  3. DataFrames can carry out arithmetic operations on columns and rows.
  4. DatFrame Labeled Axis.
  5. It supports flat files like Excel, JSON, and CSV and reads SQL tables.
  6. It manages missing data.

MindMajix Youtube Channel

Installing Pandas

Pandas is a simple package to install. Open the command line(for PC users) or your terminal program(for Mac users) and install it through either of the below commands:

conda install pandas
OR
pip install pandas

Alternatively, in the Jupyter Notebook, we can run the following cell:

!pip install pandas

The “!” at the starting run cells as if they are available in the cell.

For importing the pandas, we generally import them with the shorter name as it is utilized so much.

import pandas as pd1

Core Components of Pandas: Series and DataFrames

Following are the core components of pandas

1) Series

2) DataFrames

A series is a column, and the Dataframe is the multi-dimensional table containing the collection of the Series.

Create the Dataframe from the list

We can create the Pandas DataFrame using the below constructor:

pandas.DataFrame(data, columns, index, copy, dtype)

Parameters

  1. Index: For row labels, you can use an index when the resulting frames are optional.
  2. Columns: here, the optional default syntax is no.arrabge(n). And this can be possible only when the index is sent.
  3.  Data: Data takes several forms like series, ndarray, lists, maps, constants, dict, and other DataFrame.
  4. Copy: we can use the command to copy the data only when the default is false.
  5. Dtype: Data type of each column.

How to Create Pandas DataFrame

Create the Entry DataFrame

We can create A basic DataFrame as an empty dataframe.

import pandas as pd1
df = pd1.DataFrame
print df
Output
Empty DataFrame
Columns: []
Index: []

Create a DataFrame from List

We can create the dataframe using a single list. 

import pandas as pd1
data = [6, 8, 9, 10, 11]
df = pd1.DataFrame(data)
print df1

Create the DataFrame from the Dict of ndarrays and List

Every ndarray should have the same length. If the index is sent, then the length of the index should be the same as the array’s length. If no index is sent, the index will be range(n). Note that “n” describes the array length.

import pandas as pd1
Data = { 'Name': ['Tom', 'Jack', 'Steve', 'Ricky'], 'Age': [33,25,41,35]}
df1 = pd.DataFrame(data)
print df1

Create the DataFrame from the Dict of Series

We can pass the Dictionary of Series to Form the DataFrame. The derived index is the union of all the series indexes that are passed.

import pandas as pd1
d = {'one': pd1. series ([1,2,3], index = ['a', 'b', 'c']),
two: pd1.series ([4,5,6,7], index = ['d', 'e', 'f'])}
df1 = pd1.DataFrame[d]
print df1

Datatypes

The type of the data values is called data type. In Pandas DataFrame, data types are called dtypes. They are essential because they decide the amount of memory your DataFrame utilizes and its calculation speed and accuracy level. Pandas largely depend on the Numpy data types. But, pandas 1.0 created some additional data types. They are given below:

  • BooleanArray and BooleanDtype will support the missing Boolean values and Kleene three-value logic.
  • StringArray and StringDtype will represent the dedicated string type.

Handling Rows and Columns

A Dataframe is nothing but a two-dimensional data structure. In Dataframes, Data is arranged in a tabular form as columns and rows. It allows us to make fundamental operations on the data, like adding, renaming, and deleting.

Column Selection: For selecting the column in the Pandas DataFrame, we should use the columns by calling the column using their names.

import pandas as pd
data = {'Name': ['Jai', 'Princi', 'Gaurav', 'Anuj'],
'Age': [27, 24, 22, 32],
'Address': ['Delhi', 'Kanpur', 'Allahabad', 'Kannauj'],
'Qualification': ['MSc', 'MA', 'MCA', 'PhD']}
df = pd.DataFrame(data)
print(df[ [ 'Name', 'Qualification']])

Rows Selection: Pandas offer a method for retrieving from the Dataframe. We can use the DataFrame.loc[] method for fetching the rows from the Pandas DataFrame. We can select the rows only by forwarding the integer location to the iloc[] function.

import pandas as pd1
data = pd.read_csv("nba.csv", index_col ="Name")
first data.loc["Avery Bradley"]
second = data.loc["R.J. Hunter"]
print(first, "\n\n\n", second)

Related Article: Pandas Projects

DataFrame Functions

1) Append()

This function is used to add the rows of other dataframe to the end of the given dataframe.

2) Apply()

 This function allows you to send the function. Also, it allows you to apply the function to all the single values of the pandas series.

3) Aggregate()

The main task of this function is applying some aggregation to multiple columns. The most commonly used aggregations are:

  • Sum 
  • Min
  • Max

4) Pandas DataFrame.assign()

The assign() function is also helpful in adding a new column to the DataFrame. If we re-assign the available columns, then the value will be overridden.

5) Pandas DataFrame.astype()

The astype() function is utilized to cast pandas object to the particular dtype.astype() function. It can convert the ideal column to a certain type. It is useful when we have to case a specific column data type to another data type. We can utilize the input to the Python dictionary for changing multiple column types instantly.

How to delete the Indices, Columns, or Rows from the Pandas Dataframe

Deleting the Index from the Dataframe

If we have to remove the index from our dataframe, we should reconsider because Series and DataFrames always have the index.

  • Resetting the index of our dataframe or
  • Remove the Index name, if any, by running “del df.index.name”
  • Remove the duplicate values by resetting the index, dropping duplicates of the Index column added to the dataframe, and restoring that duplicate column again as an index.
  • And ultimately, remove the index and the row. 
df = pd1.DataFrame (dp=np.array([1,2,3], [4,5,6], [7,8,9], [30,40,50], [26,38,45]),
index= [3.5, 11.6, 4.8, 2.5, 4.8]
Columns [50, 51, 52]
df.reset_index().drop_duplicates (subset='index' , keep = 'Last' ).set_index('index')

PySpark DataFrame

What is PySpark?

Apache Spark is developed in the Scala programming language. It has been released for supporting the collaboration of Python, and Apache Spark is actually the Python API for Spark. Moreover, PySpark enables you to interface with the RDDs(Resilient Distributed Datasets) in the Apache Spark and Python Programming languages.

Related Article: Python Tutorial

What is PySpark DataFrame?

DataFrame is the distributed group of data arranged into named columns. It is theoretically correspondent to a table in the RDBMS or the Dataframe in Python or R but with maximum optimization. We can construct the DataFrames from a wide range of files like external databases, tables in Hive, existing RDDs, and structured data files.

DataFrame Creation

The easiest method to create a DataFrame is from the Python data list. We can create the RDD dataframe and read the files from various sources.

Using the DataFarme()

By using the createDataFrame() function of SparkSession, we can create the DataFrame.

data = [('James',
('Michael', 'Rose',
C
'Smith', '1990-01-04', 'M', 4000),
2001-07-12', 'M', 5000),
('Ravi', ', 'Williams', '2001-06-19', 'M', 6000),
('Maria', 'Anne', 'Jones', '1998-01-12', 'F', 7000)
columns = ["firstname", "middlename", "lastname", "dob", "gender", "salary"]
df1 = spark.createDataFrame(data=data, schema = columns)

Create DataFrame from RDD

One of the best methods to create the PySpark DataFrame is from the RDD. Let us create the spark RDD from the collection List by invoking parallelize() function from the SparkContext. 

spark=SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
rdd = spark.sparkContext.parallelize(data)

Create the DataFrame

data =[['Bob', 28], ['Alice', 29], ['Alexa', 45]]
df =pd.DataFrame(data, columns=['Name', 'Age])
df

Create the DataFrame with the Column

c = pd.DataFrame([4,5,6], columns=['Éxample1])'

Set the column name of our Dataframe to that of the newly generated one.

df['Example1'] = c['Example1'] df

PySpark Select Columns from DataFrame

In PySpark, we can take the “select()” function for selecting single or multiple columns by index, all the columns from the list, and the nested columns from the DataFrame. PySpark select() is the transformation function that returns the new DataFrame with the chosen columns. Let us select the first five rows of “Age” and “User_ID” from the “train.”

train.select ('User_ID', 'Age').show(5)

How to find the number of distinct products in the train and test files?

We can use the “distinct” operation for calculating the number of distinct rows in the DataFrame. Let us use the “distinct” operation for calculating the number of distinct products in the train and test file each.

train.select('Product_ID').distinct().count(), test.select('Product_ID').distinct().count()

Output:

(3671,3241)

We have 3671 and 3241 distinct products in train and test files, respectively.

How to Create New Columns in the PySpark DataFarmes

We can create new columns in a PySpark DataFrame in several ways. The following are some of them:

1) Using the Spark Native Functions

The most frequently utilized way to create new columns in the Pyspark dataframe is by using in-built functions. This is the most efficient programmatical method to create the new column. We can utilize “.withcolumn” with the PySpark SQL functions for creating new columns. Primarily, Basically, we can find the Date functions, Math functions, and string functions already executed through the Spark functions. The first function, “F.col” provides access to the columns. Thus, if we have to add 100 to the column, we can utilize “F.col” as

import pyspark.sql.function as F
caseWithNewAffirmed = cases. withColumn ("NewAffirmed", 100 + F.col("Affirmed"))
caseswithNewAffirmed.show()

We can also utilize math functions like the “F.exp” function:

caseWithExpAffirmed = cases. withColumn("ExpAffirmed", F.exp("Affirmed"))
casesWithExpAffirmed.show()

2) Using the Spark UDFs

Sometimes, we have to do complicated things to the column or the multiple columns. We can consider it as the map operation on the PySpark dataframe to multiple columns or a single column.  However, Spark SQL functions will resolve several use cases for creating columns. We can utilize the Spark UDF whenever I require the more advanced functionality of Python. For utilizing the Spark UDF, we have to use the F.udf function for converting the regular python function to the Spark UDF. We also have to define the return type of function. In the following example, the return type is the StringType().

import pyspark.sql.functions as F
from pyspark.sql.types import
def casesHigh Low (affirmed):
if affirmed < 50:
return 'low'
else:
return 'high'
casesHighLowUDF = F.udf(casesHighLow, StringType())
CasesWithHighLow = cases. withColumn("HighLow," casesHighLowUDF ("confirmed"))
casesWithHighLow.show()

3) Using the RDDs

Both the SQL functions and Spark UDFs are not sufficient for a specific use case. In some use cases, RDDs will perform better than SQL functions and Spark UDFs. We may have to utilize better partitioning that Spark RDDs provide, or we may have to utilize the group functions in the Spark RDDs. In any use case, using RDDs for creating new columns is helpful for people who have a deep understanding of RDDs, which is the fundamental building block in the Spaark environment.

This process utilizes the functionality for transforming between the Python dict and Row objects. We transform the row object into the dictionary. After that, we can work with the dictionary since we are utilized for converting the row object to the dictionary. After that, we can work with the dictionary since we are used to transform that dictionary back to the row again. This approach can be useful in a lot of use cases.

import math
from pyspark.sql import Row
def rowwise_function (row);
row_dict = row.asDict() row_dict['expaffirmed'] =float(np.exp(row_dict['confirmed']))
newrow = Row (**row_dict)
return newrow
cases_rdd = cases. rdd
cases_rdd_new = cases_rdd.map(lambda row: rowwise_function(row))
caseNewDf = sqlContext.createDataFrame (cases_rdd_new)
casesNewDf.show()

4) Using the Pandas UDF

This functionality was started in Spark version 2.3.1. It enables you to use this Pandas functionality with Spark. Generally, we utilize it when we need to run the groupBy operation on the Spark dataframe or whenever we have to create the rolling features and have to use Pandas rolling functions or the Windows functions instead of the Spark versions.

We utilize the “F.pandas_udf” decorator. We can assume here that the input to the function will be the Pandas dataframe. And we have to return the pandas dataframe in turn from this function.

cases.printSchema()

root
-case_id: integer (nullable = true)
- province: string (nullable = true)
- city: string (nullable = true)
- group: boolean (nullable = true)
- infection_case: string (nullable = true)
-confirmed: integer (nullable = true)
- latitude: string (nullable = true)
- longitude: string (nullable = true)

PySpark DataFrame and Pandas DataFrame

PySpark has been well utilized in the Machine Learning and Data Science community since there are various widely used data science libraries developed in Python, like TensorFlow and NumPy. Further, Pyspark is efficiently utilized for processing the massive datasets.

PySpark is the spark library developed in Python for running Python applications through Apache Spark capabilities. Through the PySpark, we can run the applications concurrently on the distributed cluster or even on the single node. Apache Spark is the analytical processing engine for huge-scale, robust, distributed data processing and machine learning applications.

Spark was written in Scala, and because of its industry utilization, its API PySpark was unleashed for Python through Py4J. Py4J is the Java library integrated with PySpark and enables Python to interact dynamically with JVM objects; thus, for running the PySpark, we also require Java to be installed along with Apache Spark and Python.

DataFrame FAQs

1. What is a Dataframe?

It is nothing but a data structure that arranges data into a two-dimensional table of rows and columns. So you can easily manipulate the data stored in a dataframe. Schema is the blueprint of every data. It defines the type type and name of every column in the table.

2. Name the key components of Dataframes.

Below are the key components of Dataframes.

  • data
  • column
  • index

3. Can you mention some key features of the Dataframe?

  • The columns of Dataframes are different
  • We can label a table’s rows and columns 
  • We can perform arithmetic operations on the rows and columns.

4. How do you create a Pandas Dataframe?

We can build Dataframes using the following

  • lists
  • NumPy arrays
  • series
  • dict

5. How do you build a Dataframe?

We can build a Dataframe using Pandas’s Dataframe ( ) function. 

6. Can you change the size and values of the Dataframes?

Yes. The size and values of a Dataframe are mutable.

Conclusion

In short, It is the data structure used in the data engineering domain. This data structure simplifies working with data. A Dataframe helps to organize data into two-dimensional tables. This Dataframe tutorial has taught you about the Pandas Dataframe in-depth. If you want to dig deep into the dataframe, you can reach out to MindMajix. You can take any data science-related courses. You will gain certification, which will help you to take the next step in your career.

Course Schedule
NameDates
Python TrainingJun 18 to Jul 03View Details
Python TrainingJun 22 to Jul 07View Details
Python TrainingJun 25 to Jul 10View Details
Python TrainingJun 29 to Jul 14View Details
Last updated: 21 Nov 2023
About Author

Prasanthi is an expert writer in MongoDB, and has written for various reputable online and print publications. At present, she is working for MindMajix, and writes content not only on MongoDB, but also on Sharepoint, Uipath, and AWS.

read less
  1. Share:
Python Articles