Let's learn together :-): data analytics

Python Programming for Data Analytics: Pandas library Basics

Pandas Library

pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.

Importing panda library

import pandas as pd

Convert list elements into series of elements ranging from 10—50

import numpy as np

import pandas as pd

my_data=[10,20,30,40,50]

pd.Series(data=my_data)

# convert element lists into series of elements, which have index from 0—5

Convert dictionary values into series of elements ranging from 10—50

import numpy as np

import pandas as pd

d={'a':10,'b':20,'c':30,'d':40}

#dictionary keys act as index and values with every key act as series values

pd.Series(d)

Addition of two series

import numpy as np

import pandas as pd

ser1=pd.Series([1,2,3,4],['USA','Germany','Japan','USSR’])

#create series from 1—4 with index as country names

ser2=pd.Series([1,2,5,4],['USA','Germany','Italy','USSR’])

ser1+ser2

#addition of 2 series; values get added where index match but if same index is not available in either of series, it generates NaN value

Generate 5x4 table filled with random values having labels for rows and columns (similar to dataset) and pick a particular columns for processing.

from numpy.random import randn

import pandas as pd

np.random.seed(101)

df=pd.DataFrame(randn(5,4),['A','B','C','D','E'],['W','X','Y','Z’])

#generate random number for 5 rows and 4 columns, row labels: A—E, columns labels: W—Z.

df[‘W’]

df[[‘W’,’Z’]]

Modifying datasets using panda library

df.drop('W',axis=1,inplace=True)

#Droping selected column from dataset; inplace=True, permanently accepts modification

df.loc['A'] # rows are also series; fetch particular row from dataset having index ‘A’

df.iloc[3] # fetch 3^rd row from dataset

df.loc[['A','C'],['X','Z’]] #fetch a subset of data from given dataset; [[row][column]]

df > 0 # all those locations in dataset which are less than threshold is taken False

df[df>0] #all those positions in datasets which are less than threshold is taken as NaN

df=pd.DataFrame(randn(5,4),['A','B','C','D','E'],['W','X','Y','Z’])

df[df['W']>0]

#Select the row where True is available in W column and fetch other elements in same row

df[df['W']>0][['X','Y']]

# fetch out desired frame of X & Y from dataset, for those rows where value is more than 0 in ‘W’ column;

df.reset_index() #assign natural index

df.set_index(‘Z’) #set ‘Z’ column as index value

Drop missing elements from dataset

import pandas as pd

d={'A':[1,2,np.NaN], 'B':[1,np.NaN,np.NaN],'C':[1,2,3]}

# np.NaN is the missing element in DataFrame

df=pd.DataFrame(d)

df.dropna() #pandas would drop any row with missing value

df.dropna(axis=1) #drop column with NULL value

df.dropna(thresh=2)

#thresh is the variable for threshold which says that it would not drop row until unless it has atleast 2 NonNA values else it would drop row.

Filling suitable value instead of NaN in dataset

df.fillna(value='FILL VALUE') #NaN is replaced by value=FILL VALUE

df['A'].fillna(value=df['A'].mean())

#Select column "A" and fill the missing value with mean value of the column A

Calculating mean and standard deviation on values in given dataset

data = {'Company':['GOOG','GOOG','MSFT','MSFT','FB','FB'], #create temporary dataset

'Person':['Sam','Charlie','Amy','Vanessa','Carl','Sarah'],

'Sales':[200,120,340,124,243,350]}

df=pd.DataFrame(data) #connect data as dataframe which have row and column label

comp=df.groupby("Company").mean()
# data in dataset is grouped by label “Company” i.e. elements with similar company names are grouped together and after grouping data, aggregate ‘mean’ function is applied on its value; mean could not be applied on second column "person name" due to string type

comp

comp1=df.groupby("Company") #grouping done using label name “Company”

comp1.std() #apply standard deviation on grouped data

Fetching particular data from dataset after applying aggregate function

data = {'Company':['GOOG','GOOG','MSFT','MSFT','FB','FB'], #create temporary dataset

'Person':['Sam','Charlie','Amy','Vanessa','Carl','Sarah'],

'Sales':[200,120,340,124,243,350]}

df=pd.DataFrame(data) #connect data as dataframe which have row and column label

df.groupby("Company").sum().loc["FB"]

#group data by ‘company’ label, apply sum function such that all data of same company gets added and then fetch Company “FB” value after summation

Finding maximum value in each label in dataset

data = {'Company':['GOOG','GOOG','MSFT','MSFT','FB','FB'], #create temporary dataset

'Person':['Sam','Charlie','Amy','Vanessa','Carl','Sarah'],

'Sales':[200,120,340,124,243,350]}

df=pd.DataFrame(data) #connect data as dataframe which have row and column label

df.groupby("Company").max()

#group dataset based on ‘company’ label and pick maximum value in each label

Find unique value and number of occurrence of values in dataset

df = pd.DataFrame({'col1':[1,2,3,4],'col2':[444,555,666,444],'col3':['abc','def','ghi','xyz’]})

# col1, col2 & col3 are column labels, each column have their own values

df['col2'].unique() #fetches the unique values available in column ‘col2’

df['col2'].value_counts()

# count number of occurance of every value in column and display its count

Save Data frame into CSV file or Excel file

df.to_csv('example.csv',index=False)

df.to_excel('Excel_Sample.xlsx',sheet_name='Sheet1')

Hope you have enjoyed doing basic Data Munging with Pandas Libraries.

Stay Calm and learn. Please post your comments below :-)

Python Programming for Data Analytics: Matplotlib library Basics

Matplotlib Tool

Matplotlib is a plotting library for the Python programming language and its numerical mathematics extension NumPy. It provides an object-oriented API for embedding plots into applications using general-purpose GUI toolkits like TKinter, wxPython, Qt, or GTK+.

Source: Wikipedia

Importing matplotlib and numpy to plot line graph

import matplotlib.pyplot as plt

%matplotlib inline

import numpy as np

x = np.linspace(0, 5, 11) #generate 1D array starting from 0—5, total 11 elements

y = x ** 2 # x²

plt.plot(x, y, 'r’) # 'r' is the color red

plt.xlabel('X Axis Title Here') #label x asis

plt.ylabel('Y Axis Title Here') # label y axis

plt.title('String Title Here') # give title to plot

plt.show() # show plot

Create subplot in single plot (like MATLAB)

plt.subplot(1,2,1) # 1 row, 2 columns, 1^st figure in row

plt.plot(x, y, 'r--') # r—red color, -- for dashed graph lines

plt.subplot(1,2,2) # 1 row, 2 columns, 2^nd figure in row

plt.plot(y, x, 'g*-'); # g—green color, *- for graph lines

Use subplot functions to create MATLAB style line color

import matplotlib.pyplot as plt

%matplotlib inline

import numpy as np

x = np.linspace(0, 5, 11)

fig, ax = plt.subplots() #generate 2 objects, fig and ax

ax.plot(x, x**2, 'b.-') # blue line with dots

ax.plot(x, x**3, 'g--') # green dashed line

ax.set_xlabel('x axis’)

ax.set_ylabel('y axis')

ax.set_title(' plot title')

Plot various types of plots (scatter, step, bar and fill) in same figure

import numpy as np

import matplotlib.pyplot as plt

%matplotlib inline

n = np.array([0,1,2,3,4,5])

xx = np.linspace(-0.75, 1., 100)

fig, axes = plt.subplots(1, 4, figsize=(12,3)) #create figure of dimension 12x3 inch, with 4 subplots

axes[0].scatter(xx, xx + 0.25*np.random.randn(len(xx)))

axes[0].set_title("scatter")

axes[1].step(n, n**2, lw=2) #lw=line width of 2

axes[1].set_title("step")

axes[2].bar(n, n**2, align="center", width=0.5, alpha=0.5) #alpha – transparency level

axes[2].set_title("bar")

axes[3].fill_between(xx, xx**2, xx**3, color="green", alpha=0.5);

axes[3].set_title("fill_between");

Saving plot into file format

fig.savefig("filename.png")

fig.savefig("filename.png", dpi=200)

Hope you have enjoyed working with Matplotlib library and plotted beautiful graphs of your data :-)

In next post, I would come up with another popular and important library used for Data Analytics purpose.

Keep learning.Keep Growing :-) Please post your comments below !

Python Programming for Data Analytics: NumPy Library Basics

NumPy Library

NumPy is the core library for scientific computing in Python. It provides a high-performance multidimensional array object, and tools for working with these arrays.

Create 2-D array using numpy from list

import numpy as np

mylist=[[1,2,3],[4,5,6],[7,8,9]]

np.array(mylist)

Create 1-D and 2-D array using random values

np.random.rand(5) #uniform distribution of random values

np.random.rand(4,4) # create 2-D array of random values

np.random.randn(4) #std. normal distribution, centered around 0

Fetch Max on Min value from array

arr1=np.random.randint(0,100,10) #10 random integer elements

arr1.max() or arr1.min()

Playing with the data using numpy library

arr=np.arange(0,11) #create array of 10 elements between 1—10

arr[:5] #fetch 1^st 5 elements of array arr

arr[5:] #fetch elements from 5^th positions till last elements

arr[:2]=100 # replace the 1^st two elements of array with value 100

arr=np.arange(0,25)

arr.reshape(5,5) #reshape method which reshapes 1-D array into 2-D array

arr_2d=np.array([[1,2,3],[4,5,6],[7,8,9]]) #create 2-D array

arr_2d[:2,1:] # select elements in a particular row and column in 2D array

arr=np.arange(0,11)

arr > 5 #return True for elements position which is more than 5

arr[arr<5]

# return True for elements position which is more than 5 and pick array values for True positions

arr+arr # element by element array addition

mat = np.arange(1,26).reshape(5,5) #generate 1D array of 15 elements and convert it into 2D array

mat.std() #finding standard deviation on elements

mat.sum(axis=0) #summing elements column wise in given array mat

Hope you had enjoyed working with these basics of NumPy library. in next post, I would come up with another popular library used in Data Analytics.

Happy Learning! Please post your comments below :-)

Let's learn together :-)

Thursday, 7 December 2017

Python Programming for Data Analytics: Pandas library Basics

Python Programming for Data Analytics: Matplotlib library Basics

Python Programming for Data Analytics: NumPy Library Basics

Search This Blog