Pandas
Library
pandas is an open source, BSD-licensed library
providing high-performance, easy-to-use data structures and data analysis tools
for the Python programming language.
Importing
panda library
import pandas
as pd
Convert
list elements into series of elements ranging from 10—50
import numpy
as np
import pandas
as pd
my_data=[10,20,30,40,50]
pd.Series(data=my_data)
# convert element lists into series
of elements, which have index from 0—5
Convert
dictionary values into series of elements ranging from 10—50
import numpy
as np
import pandas
as pd
d={'a':10,'b':20,'c':30,'d':40}
#dictionary
keys act as index and values with every key act as series values
pd.Series(d)
Addition
of two series
import numpy
as np
import pandas
as pd
ser1=pd.Series([1,2,3,4],['USA','Germany','Japan','USSR’])
#create
series from 1—4 with index as country names
ser2=pd.Series([1,2,5,4],['USA','Germany','Italy','USSR’])
ser1+ser2
#addition of
2 series; values get added where index match but if same index is not available
in either of series, it generates NaN value
Generate
5x4 table filled with random values having labels for rows and columns (similar
to dataset) and pick a particular columns for processing.
from
numpy.random import randn
import pandas
as pd
np.random.seed(101)
df=pd.DataFrame(randn(5,4),['A','B','C','D','E'],['W','X','Y','Z’])
#generate
random number for 5 rows and 4 columns, row labels: A—E, columns labels:
W—Z.
df
df[‘W’]
df[[‘W’,’Z’]]
Modifying
datasets using panda library
df.drop('W',axis=1,inplace=True)
#Droping
selected column from dataset; inplace=True, permanently accepts modification
df.loc['A'] #
rows are also series; fetch particular row from dataset having index ‘A’
df.iloc[3] #
fetch 3rd row from dataset
df.loc[['A','C'],['X','Z’]] #fetch a subset of data from
given dataset; [[row][column]]
df > 0 # all those
locations in dataset which are less than threshold is taken False
df[df>0] #all those positions
in datasets which are less than threshold is taken as NaN
df=pd.DataFrame(randn(5,4),['A','B','C','D','E'],['W','X','Y','Z’])
df[df['W']>0]
#Select the row where True is
available in W column and fetch other elements in same row
df[df['W']>0][['X','Y']]
# fetch out desired frame of X &
Y from dataset, for those rows where value is more than 0 in ‘W’ column;
df.reset_index() #assign natural index
df.set_index(‘Z’) #set ‘Z’ column as index value
Drop
missing elements from dataset
import pandas
as pd
d={'A':[1,2,np.NaN],
'B':[1,np.NaN,np.NaN],'C':[1,2,3]}
# np.NaN is
the missing element in DataFrame
df=pd.DataFrame(d)
df.dropna() #pandas
would drop any row with missing value
df.dropna(axis=1) #drop
column with NULL value
df.dropna(thresh=2)
#thresh is the variable for
threshold which says that it would not drop row until unless it has atleast 2
NonNA values else it would drop row.
Filling
suitable value instead of NaN in dataset
df.fillna(value='FILL VALUE') #NaN
is replaced by value=FILL VALUE
df['A'].fillna(value=df['A'].mean())
#Select column "A" and
fill the missing value with mean value of the column A
Calculating
mean and standard deviation on values in given dataset
data =
{'Company':['GOOG','GOOG','MSFT','MSFT','FB','FB'], #create temporary dataset
'Person':['Sam','Charlie','Amy','Vanessa','Carl','Sarah'],
'Sales':[200,120,340,124,243,350]}
df=pd.DataFrame(data) #connect data as dataframe which
have row and column label
df
comp=df.groupby("Company").mean()
# data in dataset is grouped by label “Company” i.e. elements with similar company names are grouped together and after grouping data, aggregate ‘mean’ function is applied on its value; mean could not be applied on second column "person name" due to string type
# data in dataset is grouped by label “Company” i.e. elements with similar company names are grouped together and after grouping data, aggregate ‘mean’ function is applied on its value; mean could not be applied on second column "person name" due to string type
comp
comp1=df.groupby("Company") #grouping done using label name
“Company”
comp1.std() #apply
standard deviation on grouped data
Fetching
particular data from dataset after applying aggregate function
data = {'Company':['GOOG','GOOG','MSFT','MSFT','FB','FB'], #create temporary dataset
'Person':['Sam','Charlie','Amy','Vanessa','Carl','Sarah'],
'Sales':[200,120,340,124,243,350]}
df=pd.DataFrame(data) #connect data as dataframe which
have row and column label
df.groupby("Company").sum().loc["FB"]
#group data
by ‘company’ label, apply sum function such that all data of same company gets
added and then fetch Company “FB” value after summation
Finding
maximum value in each label in dataset
data =
{'Company':['GOOG','GOOG','MSFT','MSFT','FB','FB'], #create temporary dataset
'Person':['Sam','Charlie','Amy','Vanessa','Carl','Sarah'],
'Sales':[200,120,340,124,243,350]}
df=pd.DataFrame(data) #connect data as dataframe which
have row and column label
df.groupby("Company").max()
#group
dataset based on ‘company’ label and pick maximum value in each label
Find
unique value and number of occurrence of values in dataset
df =
pd.DataFrame({'col1':[1,2,3,4],'col2':[444,555,666,444],'col3':['abc','def','ghi','xyz’]})
# col1, col2
& col3 are column labels, each column have their own values
df['col2'].unique() #fetches the unique values available
in column ‘col2’
df['col2'].value_counts()
# count
number of occurance of every value in column and display its count
Save Data
frame into CSV file or Excel file
df.to_csv('example.csv',index=False)
df.to_excel('Excel_Sample.xlsx',sheet_name='Sheet1')
Hope you have enjoyed doing basic Data Munging with Pandas Libraries.
Stay Calm and learn. Please post your comments below :-)