Introduction To Pandas

Pandas is one of the widely used Python libraries for working with data. it is built on libraries like Matplotlib and NumPy. Pandas is great for data manipulation, data analysis, and data visualization.

In this tutorial we will see how pandas makes life really easy for a data analysis. Pandas can read and write data from and to CSV files or even databases easily.

Data Structures in Pandas

Series

A series is a one-dimensional object, like an array, list or could be understood as a column in table. similar to the array or list index each element in a series is assigned with a labeled index. By default, each item is given an numerical index label from 0 to N, where N is the length of the Series minus one.

How to create Series The basic method to create a Series is to call .Series()

# import pandas
import pandas as pd

# create a Series with an arbitrary list
X1 = pd.Series([7, 'develbyte', 3.14, 'Happy Learnning'])
X1
0                  7
1          develbyte
2               3.14
3    Happy Learnning
dtype: object

Note:- when the Series contains elements of multiple different datatypes the dtype of the series will be the higher datatype

int32 > int64 > flot64 >.....>object

X2 = pd.Series([7, 5, 4, 3])
print(X2)

X3 = pd.Series([7, 5, 4., 3.])
print(X3)
0    7
1    5
2    4
3    3
dtype: int64
0    7.0
1    5.0
2    4.0
3    3.0
dtype: float64

creating a series with index index of the series elements can also be changes by simply passing a list of indexes, the list of elements and the list of indexes should be of same length or you will end up with error

X1 = pd.Series([7, 'develbyte', 3.14, 'Happy Learnning'],
              index=['A', 'B', 'C', 'D'])

print(X1)
A                  7
B          develbyte
C               3.14
D    Happy Learnning
dtype: object

DataFrame

DataFrame is a two-dimensional labeled data structures with columns of same or different data types. Similar to tables in a database the DataFrame can hold multiple columns with multiple data types. You can also think of a DataFrame as a group of Series objects that share an index.

How to import Data in Dataframe

# Importing the dataset
dataset = pd.read_csv('../data/data.csv')
dataset
Country Age Salary Purchased
0 France 44.0 72000.0 No
1 Spain 27.0 48000.0 Yes
2 Germany 30.0 54000.0 No
3 Spain 38.0 61000.0 No
4 Germany 40.0 NaN Yes
5 France 35.0 58000.0 Yes
6 Spain NaN 52000.0 No
7 France 48.0 79000.0 Yes
8 Germany 50.0 83000.0 No
9 France 37.0 67000.0 Yes

How to inspect Data in Dataframe

  • Very first information what we would like to know in a dataframe are - number of columns - number of records - attribute names - datatype of each attribute
    we can get all these information by calling just one function info() it will give Concise summary of a DataFrame
dataset.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 4 columns):
Country      10 non-null object
Age          9 non-null float64
Salary       9 non-null float64
Purchased    10 non-null object
dtypes: float64(2), object(2)
memory usage: 392.0+ bytes

we can also use dtypes to get the datatypes of each attribute

dataset.dtypes
Country       object
Age          float64
Salary       float64
Purchased     object
dtype: object

for just geting the column names in a dataframe use dataset.columns

# columns.values gives the column names in the DataFrame
dataset.columns.values
array(['Country', 'Age', 'Salary', 'Purchased'], dtype=object)

similarly for index values

# index.values gives the list of row indices
dataset.index.values
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
  • Probably the most useful function for inspecting a data set in a DataFrame is describe() it will return basic statistics about the dataset’s numeric columns
dataset.describe()
Age Salary
count 9.000000 9.000000
mean 38.777778 63777.777778
std 7.693793 12265.579662
min 27.000000 48000.000000
25% 35.000000 54000.000000
50% 38.000000 61000.000000
75% 44.000000 72000.000000
max 50.000000 83000.000000
  • For take a look at the actual data head() and tail() are the most useful function - head method shows first n rows from the DataFrame, default value of n is 5 - tail method shows last n rows from the DataFrame, default value of n is 5
dataset.head()
Country Age Salary Purchased
0 France 44.0 72000.0 No
1 Spain 27.0 48000.0 Yes
2 Germany 30.0 54000.0 No
3 Spain 38.0 61000.0 No
4 Germany 40.0 NaN Yes
dataset.tail()
Country Age Salary Purchased
5 France 35.0 58000.0 Yes
6 Spain NaN 52000.0 No
7 France 48.0 79000.0 Yes
8 Germany 50.0 83000.0 No
9 France 37.0 67000.0 Yes

Select and Index in DataFrames

There are three main options in pandas, which allows us to access the data in DataFrame, these are based on index and location of the rows and column, these options could be confusing for beginners but its quit simple once understood.

The selection methods are:

  • Selecting data by row numbers called integer-location based indexing/selection (.iloc)

    dataset.iloc[<row selection>, <column selection>]

  • Selecting data by label or by a conditional statement (.loc)

    dataset.loc[<row selection>, <column selection>]

  • Selecting in a hybrid approach(.ix)

    dataset.ix[<row selection>, <column selection>]

Single selections using iloc and DataFrame

Rows:

R1 = dataset.iloc[0] # first row of data frame - Note a Series data type output.
R2 = dataset.iloc[1] # second row of data frame
R3 = dataset.iloc[-1] # last row of data frame
print ("---------------------------------------")
print (R1)
print ("---------------------------------------")
print (R2)
print ("---------------------------------------")
print (R3)
---------------------------------------
Country      France
Age              44
Salary        72000
Purchased        No
Name: 0, dtype: object
---------------------------------------
Country      Spain
Age             27
Salary       48000
Purchased      Yes
Name: 1, dtype: object
---------------------------------------
Country      France
Age              37
Salary        67000
Purchased       Yes
Name: 9, dtype: object

Columns:

C1 = dataset.iloc[:,0] # first column of data frame (first_name)
C2 = dataset.iloc[:,1] # second column of data frame (last_name)
C3 = dataset.iloc[:,-1] # last column of data frame (id)
print ("---------------------------------------")
print (C1)
print ("---------------------------------------")
print (C2)
print ("---------------------------------------")
print (C3)
---------------------------------------
0     France
1      Spain
2    Germany
3      Spain
4    Germany
5     France
6      Spain
7     France
8    Germany
9     France
Name: Country, dtype: object
---------------------------------------
0    44.0
1    27.0
2    30.0
3    38.0
4    40.0
5    35.0
6     NaN
7    48.0
8    50.0
9    37.0
Name: Age, dtype: float64
---------------------------------------
0     No
1    Yes
2     No
3     No
4    Yes
5    Yes
6     No
7    Yes
8     No
9    Yes
Name: Purchased, dtype: object

Multiple row and column selections using iloc and DataFrame

MR1 = dataset.iloc[0:5] # first five rows of dataframe
MR2 = dataset.iloc[:, 0:2] # first two columns of data frame with all rows
MR3 = dataset.iloc[[0, 3, 6, 9], [0, 3]] # 1st, 4th, 7th, 9th row + 1st 3th 4th columns.
MR4 = dataset.iloc[0:5, 1:3] # first 5 rows and 4th, 5th columns of data frame.

print ("---------------------------------------")
print (MR1)
print ("---------------------------------------")
print (MR2)
print ("---------------------------------------")
print (MR3)
print ("---------------------------------------")
print (MR4)
---------------------------------------
   Country   Age   Salary Purchased
0   France  44.0  72000.0        No
1    Spain  27.0  48000.0       Yes
2  Germany  30.0  54000.0        No
3    Spain  38.0  61000.0        No
4  Germany  40.0      NaN       Yes
---------------------------------------
   Country   Age
0   France  44.0
1    Spain  27.0
2  Germany  30.0
3    Spain  38.0
4  Germany  40.0
5   France  35.0
6    Spain   NaN
7   France  48.0
8  Germany  50.0
9   France  37.0
---------------------------------------
  Country Purchased
0  France        No
3   Spain        No
6   Spain        No
9  France       Yes
---------------------------------------
    Age   Salary
0  44.0  72000.0
1  27.0  48000.0
2  30.0  54000.0
3  38.0  61000.0
4  40.0      NaN

.iloc returns a Pandas Series when one only row or Column is selected

print(".iloc returns a Pandas Series when one only row or Column is selected")
print(type(dataset.iloc[0]))
print(type(dataset.iloc[:,1]))
.iloc returns a Pandas Series when one only row or Column is selected
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>

.iloc returns a Pandas DataFrame when multiple rows or Columns are selected

print("\n .iloc returns a Pandas DataFrame when multiple rows or Columns are selected")
print(type(dataset.iloc[0:2]))
print(type(dataset.iloc[:,1:3]))
print(type(dataset.iloc[1:2, 3:6]))
 .iloc returns a Pandas DataFrame when multiple rows or Columns are selected
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>

.iloc returns a Pandas Series when multiple rows are selected with only one column

print("\n .iloc returns a Pandas Series when multiple rows are selected with only one column")
print(type(dataset.iloc[1:2, 3]))
 .iloc returns a Pandas Series when multiple rows are selected with only one column
<class 'pandas.core.series.Series'>

above output could easily converted to Pandas DataFrame by passing a single-valued list as column index

print("\n Above output could easly converted to Pandas DataFrame by passing a single-valued list as column index")
print(type(dataset.iloc[1:2, [3]]))
 Above output could easily converted to Pandas DataFrame by passing a single-valued list as column index
<class 'pandas.core.frame.DataFrame'>

some more fun with iloc

X = dataset.iloc[:, :-1]
Y = dataset.iloc[:, 3]

print ("---------------------------------------")
print(type(X))
print(X)
print ("---------------------------------------")
print(type(Y))
print(Y)
---------------------------------------
<class 'pandas.core.frame.DataFrame'>
   Country   Age   Salary
0   France  44.0  72000.0
1    Spain  27.0  48000.0
2  Germany  30.0  54000.0
3    Spain  38.0  61000.0
4  Germany  40.0      NaN
5   France  35.0  58000.0
6    Spain   NaN  52000.0
7   France  48.0  79000.0
8  Germany  50.0  83000.0
9   France  37.0  67000.0
---------------------------------------
<class 'pandas.core.series.Series'>
0     No
1    Yes
2     No
3     No
4    Yes
5    Yes
6     No
7    Yes
8     No
9    Yes
Name: Purchased, dtype: object

Selecting pandas data using loc:- The Pandas loc indexer can be used with DataFrames in two different scenarios:

 a.) Selecting rows by label/index
 b.) Selecting rows with a Boolean/conditional lookup

The loc indexer is used with the same syntax as iloc: data.loc[<row selection>, <column selection>]

Index can be in a DataFrame by using set_index() method

dataset.set_index("Country", inplace=True)
dataset
Age Salary Purchased
Country
France 44.0 72000.0 No
Spain 27.0 48000.0 Yes
Germany 30.0 54000.0 No
Spain 38.0 61000.0 No
Germany 40.0 NaN Yes
France 35.0 58000.0 Yes
Spain NaN 52000.0 No
France 48.0 79000.0 Yes
Germany 50.0 83000.0 No
France 37.0 67000.0 Yes

Selecting rows by index

dataset.loc['France']
Age Salary Purchased
Country
France 44.0 72000.0 No
France 35.0 58000.0 Yes
France 48.0 79000.0 Yes
France 37.0 67000.0 Yes

Selecting rows by label/index

dataset.loc[['France', 'Spain'], ['Age', 'Salary']]
Age Salary
Country
France 44.0 72000.0
France 35.0 58000.0
France 48.0 79000.0
France 37.0 67000.0
Spain 27.0 48000.0
Spain 38.0 61000.0
Spain NaN 52000.0
dataset.loc[['France', 'Spain'], 'Age':'Purchased']
Age Salary Purchased
Country
France 44.0 72000.0 No
France 35.0 58000.0 Yes
France 48.0 79000.0 Yes
France 37.0 67000.0 Yes
Spain 27.0 48000.0 Yes
Spain 38.0 61000.0 No
Spain NaN 52000.0 No

resetting the index and setting a new index

dataset.reset_index(inplace=True)

dataset.set_index('Age', inplace=True)
dataset
Country Salary Purchased
Age
44.0 France 72000.0 No
27.0 Spain 48000.0 Yes
30.0 Germany 54000.0 No
38.0 Spain 61000.0 No
40.0 Germany NaN Yes
35.0 France 58000.0 Yes
NaN Spain 52000.0 No
48.0 France 79000.0 Yes
50.0 Germany 83000.0 No
37.0 France 67000.0 Yes

Selecting rows by label/index

dataset.loc[[44.0, 27.0], ['Country', 'Salary']]
Country Salary
Age
44.0 France 72000.0
27.0 Spain 48000.0

Selecting rows with a Boolean/conditional lookup

dataset.loc[dataset['Country'] == 'France', ['Country', 'Salary']]
Country Salary
Age
44.0 France 72000.0
35.0 France 58000.0
48.0 France 79000.0
37.0 France 67000.0

Selections can be achieved outside of the main .loc for clarity Form a separate variable with your selections: like in the example below Select only the True values in idx and only the 3 columns specified:

idx = dataset['Country'].apply(lambda x: x.lower() == 'france')
dataset.loc[idx, ['Country', 'Salary']]
Country Salary
Age
44.0 France 72000.0
35.0 France 58000.0
48.0 France 79000.0
37.0 France 67000.0

Selecting pandas data using ix:-

  • ix[] indexer is a hybrid of .loc and .iloc,
  • ix is label based indexer, it behave just like .loc, it also supports integer based indexing like .iloc
  • ix indexing works just the same as .loc when passed strings
dataset.ix[['Country']] == dataset.loc[['Country']]
Country Salary Purchased
Age
Country False False False

ix indexing works the same as .iloc when passed integers

dataset.reset_index(inplace=True)
dataset.ix[[2]] == dataset.iloc[[2]]
Age Country Salary Purchased
2 True True True True

Add and Delete in DataFrame

  • Adding row in DataFrame

Note:- General recommendation for adding a row is to use .loc to insert rows in DataFrame If you would use .ix, you might try to reference a numerically valued index with the index value and accidentally overwrite an existing row of your DataFrame.

dataset.loc[10] = ['India', 27, 65000, 'Yes']
dataset.ix[11] = ['India', 26, 60000, 'Yes']
dataset
Age Country Salary Purchased
0 44 France 72000.0 No
1 27 Spain 48000.0 Yes
2 30 Germany 54000.0 No
3 38 Spain 61000.0 No
4 40 Germany NaN Yes
5 35 France 58000.0 Yes
6 NaN Spain 52000.0 No
7 48 France 79000.0 Yes
8 50 Germany 83000.0 No
9 37 France 67000.0 Yes
10 India 27 65000.0 Yes
11 India 26 60000.0 Yes
  • Adding column in DataFrame

Columns in DataFrame is basically a series,so adding a column in a DataFrame is as simple as assigning a new column to a DataFrame

column = pd.Series(range(1,13), dtype=float)
column
0      1.0
1      2.0
2      3.0
3      4.0
4      5.0
5      6.0
6      7.0
7      8.0
8      9.0
9     10.0
10    11.0
11    12.0
dtype: float64
#adding column in DataFrame
dataset['id'] = column
dataset
Age Country Salary Purchased id
0 44 France 72000.0 No 1.0
1 27 Spain 48000.0 Yes 2.0
2 30 Germany 54000.0 No 3.0
3 38 Spain 61000.0 No 4.0
4 40 Germany NaN Yes 5.0
5 35 France 58000.0 Yes 6.0
6 NaN Spain 52000.0 No 7.0
7 48 France 79000.0 Yes 8.0
8 50 Germany 83000.0 No 9.0
9 37 France 67000.0 Yes 10.0
10 India 27 65000.0 Yes 11.0
11 India 26 60000.0 Yes 12.0
  • Delete a column from DataFrame by column name
df = dataset.drop('Purchased', axis=1)
df
Age Country Salary id
0 44 France 72000.0 1.0
1 27 Spain 48000.0 2.0
2 30 Germany 54000.0 3.0
3 38 Spain 61000.0 4.0
4 40 Germany NaN 5.0
5 35 France 58000.0 6.0
6 NaN Spain 52000.0 7.0
7 48 France 79000.0 8.0
8 50 Germany 83000.0 9.0
9 37 France 67000.0 10.0
10 India 27 65000.0 11.0
11 India 26 60000.0 12.0
  • Delete a column from DataFrame by row index

one thing to be noted here is if the in-place is set to be True the deleting happens the the existing dataframe by default in-place is False, which creates new DataFrame with deleted rows

dataset.drop(dataset.index[2:7], inplace=True)
dataset
Age Country Salary Purchased id
0 44 France 72000.0 No 1.0
1 27 Spain 48000.0 Yes 2.0
7 48 France 79000.0 Yes 8.0
8 50 Germany 83000.0 No 9.0
9 37 France 67000.0 Yes 10.0
10 India 27 65000.0 Yes 11.0
11 India 26 60000.0 Yes 12.0