Python is a great language for doing data analysis, primarily because of the fantastic ecosystem of data-centric python packages. Pandas is one of those packages and makes importing and analyzing data much easier.
Importing the Pandas Module
import pandas as pd
DataFrame
Dataframe is a two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). Arithmetic operations align on both row and column labels. It can be thought of as a dict-like container for Series objects. This is the primary data structure of the Pandas.
read_csv is an important pandas function to read csv files and do operations on it.
Parameter | Use |
---|---|
filepath_or_buffer | URL or Dir location of file |
sep | Stands for separator, default is ‘, ‘ as in csv |
index_col | Makes passed column as index instead |
header | Makes passed row/s[int/int list] as header |
use_cols | Only uses the passed col[string list] to make data frame |
squeeze | If true and only one column is passed, returns pandas series |
skiprows | Skips passed rows in new data frame |
data = pd.read_csv("filename.csv")
head() method is used to return top n (5 by default) rows of a data frame or series
Syntax : Dataframe.head(n).
Parameters: (optional) n is integer value, number of rows to be returned.
Return: Dataframe with top n rows .
data.head()
Output :
TAIL
tail() method is used to return bottom n (5 by default) rows of a data frame or series.
Syntax : Dataframe.tail(n)
Parameters: (optional) n is integer value, number of rows to be returned.
Return: Dataframe with bottom n rows .
data.tail()
Output :
dataframe.info() function is used to get a concise summary of the dataframe. It comes really handy when doing exploratory analysis of the data. To get a quick overview of the dataset we use the dataframe.info() function.
Syntax : DataFrame.info(verbose=None, buf=None, max_cols=None, memory_usage=None, null_counts=None)
data.info()
Output :
DataFrame.types attribute returns the dtypes in the DataFrame. It returns a Series with the data type of each column.
Syntax : DataFrame.dtypes
Parameter : None
Returns : dtype of each column
data.dtypes
data
Output :
describe() is used to view some basic statistical details like percentile, mean, std etc. of a data frame or a series of numeric values. When this method is applied to a series of strings, it returns a different output which is shown in the examples below.
Syntax : DataFrame.describe(percentiles=None, include=None, exclude=None)
Return type : Statistical summary of data frame.
data.describe()
Output :
Pandas .size and .shape are used to return size and shape of data frames and series.
Syntax : dataframe.size
Return : Returns size of dataframe/series which is equivalent to total number of elements.
data.size
Output :
10692
Syntax : dataframe.shape
Return : Returns tuple of shape (Rows, columns) of dataframe/series
data.shape
Output :
(891,12)
sample() is used to generate a sample random row or column from the function caller data frame.
Syntax : DataFrame.sample(n=None, frac=None, replace=False, weights=None, random_state=None, axis=None)
Return type : New object of same type as caller
data.sample(n=1)
column is checked for NULL values and a boolean series is returned by the isnull() method which stores True for ever NaN value and False for a Not null value.
Syntax: Pandas.isnull(“DataFrameName”) or DataFrame.isnull()
Parameters: Object to check null values
Return Type: Dataframe of Boolean values which are True for NaN values
data.isnull()
Output :
Pandas dataframe.isna() function is used to detect missing values. It return a boolean same-sized object indicating if the values are NA. NA values(None or numpy.NaN) gets mapped to True values.
Syntax: DataFrame.isna()
Returns: Mask of bool values for each element in DataFrame that indicates whether an element is an NA value or not.
data.isna()
Output :
isnull().sum()- Returns the number of missing values in the data set.
example:
Syntax .isna().sum() # or s.isnull().sum() for older pandas versions
data.isnull().sum()
Output :
The function return number of unique elements in the object. It returns a value which is the count of all the unique values in the Index. By default the NaN values are not included in the count. If dropna parameter is set to be False then it includes NaN value in the count.
Syntax: Index.nunique(dropna=True)
Parameters : dropna : Don’t include NaN in the count.
Returns :int
data.nunique
Output :
Immutable sequence used for indexing and alignment. The basic object storing axis labels for all pandas objects.
Syntax: pandas.Index(data=None, dtype=None, copy=False, name=None, tupleize_cols=True, **kwargs)
An Index instance can only contain hashable objects
data.index
Output :
RangeIndex(start=0, stop=891, step=1)
Pandas DataFrame.columns attribute return the column labels of the given Dataframe.
Syntax: DataFrame.columns
Parameter : None
Returns : column names
data.columns
Output :
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age',SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'], dtype='object')
Pandas dataframe.memory_usage() function return the memory usage of each column in bytes. The memory usage can optionally include the contribution of the index and elements of object dtype. This value is displayed in DataFrame.info by default.
Syntax: DataFrame.memory_usage(index=True, deep=False)
Parameters : index : Specifies whether to include the memory usage of the DataFrame’s index in returned Series. If index=True the memory usage of the index the first item in the output. deep : If True, introspect the data deeply by interrogating object dtypes for system-level memory consumption, and include it in the returned values.
Returns : A Series whose index is the original column names and whose values is the memory usage of each column in bytes
data.memory_usage()
Output :
nsmallest() method is used to get n least values from a data frame or a series.
Syntax : DataFrame.nsmallest(n, columns, keep=’first’)
df = data.nsmallest(5,'Fare')
df
Output :
nlargest()
nlargest() method is used to get n highest values from a data frame or a series.
Syntax : DataFrame.nlargest(n, columns, keep=’first’)
Output :
df = data.nlargest(5,'Fare')
df
Output :
loc() and iloc() are used in slicing of data from the Pandas DataFrame. They help in the convenient selection of data from the DataFrame. They are used in filtering the data according to some conditions.
loc | iloc |
---|---|
Access a group of rows and columns by label(s) or a boolean array. | Purely integer-location based indexing for selection by position. |
Syntax : df.loc[row_indexer,column_indexer]
df = data.loc[10:15,['Fare']]
df
Output :
iloc
df = data.iloc[3:7,:5]
df
Output :
Slicing using the [] operator selects a set of rows and/or columns from a DataFrame. To slice out a set of rows, you use the following.
syntax: data[start:stop]
df = data[1:6]
df
Output :
groupby() function is used to split the data into groups based on some criteria.
Syntax : DataFrame.groupby(by=None, axis=0, level=None, as_index=True, sort=True, group_keys=True, squeeze=False, **kwargs)
Returns : DataFrameGroupBy
df = data[['Fare','Age','Survived']].groupby(['Fare']).mean()
df
Output :
sort_values() function sorts a data frame in Ascending or Descending order of passed Column. It's different than the sorted Python function since it cannot sort a data frame and particular column cannot be selected.
Syntax : DataFrame.sort_values(by, axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last', ignore_index=False, key=None)
Returns : DataFrame or None
df = data.sort_index(axis = 1, ascending = True)
df
Output :
data.sort_index(axis = 1, ascending = False)
Output :
df = data.sort_values(by='Fare')
df
Output :
The dropna() function is used to remove missing values. Determine if rows or columns which contain missing values are removed
Syntax : DataFrame.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)
Returns : DataFrame or None
df = ['Fare']
data.drop(df, axis = 1, inplace = True)
data
Output :
Syntax : DataFrame.query(expr, inplace=False, **kwargs)
Returns : DataFrame or None
query() using “dot syntax”. Basically, type the name of the DataFrame you want to subset, then type a “dot”, and then type the name of the method --> query()
data.query('18 < Age < 23')[:10]
Output :
min() function returns the minimum of the values in the given object. If the input is a series, the method will return a scalar which will be the minimum of the values in the series.
Syntax : DataFrame.min(axis=None, skipna=None, level=None, numeric_only=None, **kwargs)[source]
Returns : Series or DataFrame (if level specified)
data['Age'].min()
Output :
0.42
max() function returns index of first occurrence of maximum over requested axis. While finding the index of the maximum value across any index, all NA/null values are excluded.
Syntax : DataFrame.max(axis=None, skipna=None, level=None, numeric_only=None, **kwargs)[source]
Returns : Series or DataFrame (if level specified)
data['Age'].max()
Output :
80
mean() function is used to return the mean of the values for the requested axis. If we apply this method on a Series object, then it returns a scalar value, which is the mean value of all the observations in the dataframe.
Syntax : DataFrame.mean(axis=None, skipna=None, level=None, numeric_only=None, **kwargs)[source]
Returns : Series or DataFrame (if level specified)
data['Age'].mean()
Output :
29.69911764705882
- Nagashree M S
- Prajakta
- Rammya Dharshini K