Thursday, 7 December 2017

Python Programming for Data Analytics: Pandas library Basics

Pandas Library
pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.
Importing panda library
import pandas as pd

Convert list elements into series of elements ranging from 10—50
import numpy as np
import pandas as pd
my_data=[10,20,30,40,50]
pd.Series(data=my_data)           
# convert element lists into series of elements, which have index from 0—5

Convert dictionary values into series of elements ranging from 10—50
import numpy as np
import pandas as pd
d={'a':10,'b':20,'c':30,'d':40}     
#dictionary keys act as index and values with every key act as series values
pd.Series(d)

Addition of two series
import numpy as np
import pandas as pd
ser1=pd.Series([1,2,3,4],['USA','Germany','Japan','USSR’])  
#create series from 1—4 with index as country names
ser2=pd.Series([1,2,5,4],['USA','Germany','Italy','USSR’])
ser1+ser2 
#addition of 2 series; values get added where index match but if same index is not available in either of series, it generates NaN value

Generate 5x4 table filled with random values having labels for rows and columns (similar to dataset) and pick a particular columns for processing.
from numpy.random import randn
import pandas as pd
np.random.seed(101)
df=pd.DataFrame(randn(5,4),['A','B','C','D','E'],['W','X','Y','Z’]) 
#generate random number for 5 rows and 4 columns, row labels: A—E, columns labels: W—Z. 
df
df[‘W’]
df[[‘W’,’Z’]]

Modifying datasets using panda library

df.drop('W',axis=1,inplace=True)                          
#Droping selected column from dataset; inplace=True, permanently accepts modification
df.loc['A']                          # rows are also series; fetch particular row from dataset having index ‘A’
df.iloc[3]                           # fetch 3rd row from dataset
df.loc[['A','C'],['X','Z’]]                  #fetch a subset of data from given dataset; [[row][column]]
df > 0                                 # all those locations in dataset which are less than threshold is taken False
df[df>0]                            #all those positions in datasets which are less than threshold is taken as NaN
df=pd.DataFrame(randn(5,4),['A','B','C','D','E'],['W','X','Y','Z’]) 
df[df['W']>0]                   
#Select the row where True is available in W column and fetch other elements in same row
df[df['W']>0][['X','Y']]                  
# fetch out desired frame of X & Y from dataset, for those rows where value is more than 0 in ‘W’ column;
df.reset_index()   #assign natural index
df.set_index(‘Z’)    #set ‘Z’ column as index value

Drop missing elements from dataset
import pandas as pd
d={'A':[1,2,np.NaN], 'B':[1,np.NaN,np.NaN],'C':[1,2,3]}    
# np.NaN is the missing element in DataFrame
df=pd.DataFrame(d)
df.dropna()                                     #pandas would drop any row with missing value
df.dropna(axis=1)                          #drop column with NULL value
df.dropna(thresh=2)      
#thresh is the variable for threshold which says that it would not drop row until unless it has atleast 2 NonNA values else it would drop row.

Filling suitable value instead of NaN in dataset
df.fillna(value='FILL VALUE')                      #NaN is replaced by value=FILL VALUE
df['A'].fillna(value=df['A'].mean())    
#Select column "A" and fill the missing value with mean value of the column A

Calculating mean and standard deviation on values in given dataset

data = {'Company':['GOOG','GOOG','MSFT','MSFT','FB','FB'],        #create temporary dataset
'Person':['Sam','Charlie','Amy','Vanessa','Carl','Sarah'],
'Sales':[200,120,340,124,243,350]}
df=pd.DataFrame(data)               #connect data as dataframe which have row and column label
df
comp=df.groupby("Company").mean() 
# data in dataset is grouped by label “Company” i.e. elements with similar company names are grouped together and after grouping data, aggregate ‘mean’ function is applied on its value; mean could not be applied on second column "person name" due to string type
comp
comp1=df.groupby("Company")              #grouping done using label name “Company”
comp1.std()                                                  #apply standard deviation on grouped data

Fetching particular data from dataset after applying aggregate function

data = {'Company':['GOOG','GOOG','MSFT','MSFT','FB','FB'],        #create temporary dataset
'Person':['Sam','Charlie','Amy','Vanessa','Carl','Sarah'],
'Sales':[200,120,340,124,243,350]}
df=pd.DataFrame(data)               #connect data as dataframe which have row and column label
df.groupby("Company").sum().loc["FB"]              
#group data by ‘company’ label, apply sum function such that all data of same company gets added and then fetch Company “FB” value after summation

Finding maximum value in each label in dataset
data = {'Company':['GOOG','GOOG','MSFT','MSFT','FB','FB'],        #create temporary dataset
'Person':['Sam','Charlie','Amy','Vanessa','Carl','Sarah'],
'Sales':[200,120,340,124,243,350]}
df=pd.DataFrame(data)               #connect data as dataframe which have row and column label
df.groupby("Company").max()  
#group dataset based on ‘company’ label and pick maximum value in each label

Find unique value and number of occurrence of values in dataset

df = pd.DataFrame({'col1':[1,2,3,4],'col2':[444,555,666,444],'col3':['abc','def','ghi','xyz’]})
# col1, col2 & col3 are column labels, each column have their own values
df['col2'].unique()           #fetches the unique values available in column ‘col2’
df['col2'].value_counts()             
# count number of occurance of every value in column and display its count

Save Data frame into CSV file or Excel file
df.to_csv('example.csv',index=False)

df.to_excel('Excel_Sample.xlsx',sheet_name='Sheet1')

Hope you have enjoyed doing basic Data Munging with Pandas Libraries. 

Stay Calm and learn. Please post your comments below :-) 

No comments:

Post a Comment