Pandas is a open-source library built using NumPy specifically for data analysis purpose. It is a Python package that offers various data manipulation, visualisation, building machine learning models etc. Pandas is fast and it has a high performance and productivity for users. Since pandas is python package to know more about it click here.
- It allows us to analyze big data and make conclusions based on statistical theories.
- It can be used to clean messy data sets, and make them readable and relevant.
- Fast and efficient DataFrame object with default and customized indexing.
- High performance merging and joining of data.
- Time Series functionality.
Installing pandas can be tricky for new or inexperience users. The easiest way to install pandas is to install it as part of the Anaconda distribution, a cross platform distribution for users to do data analysis and other computing works.
The simplest way to install pandas, but also the most popular and useful packages is with Anaconda, a cross-platform (Linux, macOS, Windows) Python. After downloading and running the installer, the user will have access to pandas and the rest of the SciPy stack without needing to install anything else. User can download Aanconda navigator by clicking here.
PyPI is Python Package Index in short PyPI is the official third party software repository for Python
pip install pandas
Series
-Pandas Series is a one-dimensional labeled array capable of holding data of any type (integer, string, float, python objects, etc.). The axis labels are collectively called index.
DataFrames
- A Data frame is a two-dimensional data structure, i.e.,
data is aligned in a tabular fashion in rows and columns.
Series can only contain single list with index, whereas dataframe can be made of more than one series or we can say that a dataframe is a collection of series that can be used to analyse the data.
# Creating a pandas series
s = pd.Series([2, 4, 5, 6, 9])
print(s)
print(type(s))
Creating a DataFrame using dictionary in python. Python dictionary is an unordered collection of items. Each item of a dictionary has a key and value pair seperated by a ':'.
#Dictionary
person = {"first_name":['Harry',"James"],
"last_name" : ['Potter','Witch'],
"email" : ['[email protected]',"[email protected]"]}
print("person dictionary : ""\n",person)
#converting dictionary to DataFrame
df = pd.DataFrame(person)
print("DataFrame :""\n", df)
Creating a DataFrame by importing external files like excel, csv etc.
#From Excel file
pd.read_excel('file.xlsx' )
df.to_excel('dir/myDataFrame.xslx' ,sheet_name='Sheet1' )
#From csv file
df=pd.read_csv('datasets_1026_1855_My Uber Drives - 2016.csv')
print(df)
Pro-tip for settings
#pd.set_option()
#display just 85 columns
pd.set_option('display.max_columns',85)
In order to view the first few rows of the DataFrame we need to use the head() function also to view the last few rows of the DataFrame we need to use the tail() function.
#From the top first 5 rows
df.head()
#From the bottom last 5 rows
df.tail()
Accessing
- Accessing pandas dataframe columns, rows, and cells. At this point you know how to load CSV data in Python.
#For accessing columns in DataFrame
#for sigle column access
print(df['CATEGORY*'])
#for multiple column access use list
print(df[['CATEGORY*','MILES*']])
#TIP - for accessing columns names
# To print the column name in the dataframe
print(df.columns)
In order to select rows in the DataFrame we need to use the .iloc or .loc also know as indexing in pandas simply selecting particular rows from a DataFrame. It is also know as Subset Selection.
- Selecting data by row numbers (. iloc)
- Selecting data by label or by a conditional statement (. loc)
#accessing rows in DataFrame
#by row numbers
#using df.iloc(index)
print("rows 0 and 1" , df.iloc[[0,1]])
To access data using both rows and columns and the syntax looks like df.iloc[row_index,column_index]
#for row 0,1 and column 1,2
print(df.iloc[[0,1] , [1,2]])
#using .loc
#for same row 0,1 and 1,2 but accessing via labels
print(df.loc[[0,1],["END_DATE*","CATEGORY*"]])
#tip - for collecting value counts information
#df['co_name'].value_counts()
print(df['CATEGORY*'].value_counts())
Pandas have got a function name melt() that can be used to reshape a DataFrame from wide to long where one or more columns are used as an identofier.
#Gather columns into rows
reshaped_data = pd.melt(df)
print(reshaped_data)
In order to sort the data frame in pandas, function sort_values() is used. Pandas sort_values() can sort the data frame in Ascending or Descending order.
#sort values according to End Date
df.sort_values('END_DATE*')
#sort values in descending orders
df.sort_values('END_DATE*',ascending=False)
#renaming the CATEGORY* name to C
df.rename(columns = {"CATEGORY*" : "C"},inplace=True)
print(df["C"])
#drop column C
df.drop(columns = ["C"])
#tip - use inplace=True only in order to save the changes
To summarize data we can use inbuilt functions like shape, info, describe.
#to get information about DataFrame
#shape is an attribute
print("Shape: ",df.shape)
#info is a method
print("information: ",df.info())
#to describe the DataFrame
print("Describe : ",df.describe())
df.dropna()
: Drop rows with any column having NA/null data.
df.fillna(value)
: Replace all NA/null data with value.
#drop NA/null values
df.dropna()
#replace NA/null values to 1
df.fillna(1)
In pandas DataFrame we can filter the dataset based on our needs.
#for single filter
my_filter = (df['MILES*'] == 5.0)
print(df.loc[my_filter,'C'])
#& ! for multiple filters
my_filter2 = ((df['MILES*'] == 5.0) & (df["START*"] == "Fort Pierce"))
df.loc[my_filter2]
#is include
my_filter3 = df["START*"].isin(["Fort Pierce","Gampaha"])
df.loc[my_filter3]
In pandas DataFrame groupby() function is used to split the data into groups based on some criteria. Pandas objects can be split on any of their axes. The abstract definition of grouping is to provide a mapping of labels to group names.
#grouping single column
df.groupby("START_DATE*").agg("mean").tail()
#group multiple and reset column names
res=df.groupby("START*").agg({"MILES*":["mean","sum"],"START_DATE*": ["min","max"]})
res.columns=["AVG_DISTANCE","TOTAL_DIST","EARLY","RECENT"]
print(res)
#applying grouping the data
#split-apply-combine operation(sac) known as grouping
#here split-start , apply_coloumn-miles , apply_operation-mean ,then it combine
df.groupby(["START*","STOP*"])["MILES*"].agg(["mean","sum"]).head()
#applying conditions(filter) and reset index
df1= df[( df["MILES*"]>10 ) & (df["START*"].isin(["Cary","New York"]))]
df1.reset_index(inplace=True,drop=True)
df1
We can use the Datetime function to convert a Date column to date datatype same with time column as well.
#convert the date and time
import datetime
df["START_DATE*"]=pd.to_datetime(df["START_DATE*"],errors="coerce")
df.info()