Pandas Basic Analysis 03: Data Frame

Python
Pandas
Jupyter
Data Analytics
Part three of data analysis with Python Pandas: the data frame.
Author

Dennis Chua

Published

May 31, 2025

Open In Colab

03 Pandas Data Structure: Data Frame

Content Outline

  • Introducton
  • Relationship between data frames and series
  • Taking slices columns
  • Taking slices of rows
  • Filtering rows using boolean logic
  • Vector operations on data frames
  • Handling date time types
  • Quick note about Numpy

Introduction

In Pandas, a data frame is a two-dimensional structure analogous to tabulated data enclosing rows and columns. The header of a data frame series is known as the column while the header of a data frame row is known as an index.

import pandas as pd
df = pd.read_csv("sample_data/ww2_leaders.csv")
print(f"Type of df: {type(df)}\n")
df
Type of df: <class 'pandas.core.frame.DataFrame'>
Name Born Died Age Title Country
0 Franklin Roosevelt 1882-01-30 1945-04-12 63 President United States
1 Joseph Stalin 1878-12-06 1953-03-05 74 Great Leader Soviet Union
2 Adolph Hitler 1889-04-20 1945-04-30 56 Fuhrer Germany
3 Michinomiya Hirohito 1901-04-29 1989-01-07 87 Emperor Japan
4 Charles de Gaulle 1890-11-22 1970-11-09 79 President France
5 Winston Churchill 1874-11-30 1965-01-24 90 Prime Minister United Kingdom
6 Manuel Camacho 1897-04-24 1955-10-13 58 President Mexico
7 Jan Smuts 1870-05-24 1950-09-11 80 Prime Minister South Africa
8 Ibn Saud 1875-01-15 1953-11-09 78 King Saudi Arabia
9 Plaek Phibunsongkhram 1897-07-14 1965-06-11 66 Prime Minister Thailand
10 John Curtin 1885-01-08 1945-07-05 60 Prime Minister Australia
11 Haile Selassie 1892-07-23 1975-08-27 83 Emperor Ethiopia

In our input file for heads of state during World War 2, [“Name”, “Born”, “Died”, “Age”, “Title”, “Country”] make up the list of data frame columns. Since we didn’t provide any labels for the row indices, by default Pandas supplies them for us, labeling the row headers from zero onwards.

Our first operation on this data frame is to isolate a single column. We do so using the dictionary-like [ ] operator, supplying the column label.

df['Name']
0        Franklin Roosevelt
1             Joseph Stalin
2             Adolph Hitler
3      Michinomiya Hirohito
4         Charles de Gaulle
5         Winston Churchill
6            Manuel Camacho
7                 Jan Smuts
8                  Ibn Saud
9     Plaek Phibunsongkhram
10              John Curtin
11           Haile Selassie
Name: Name, dtype: object

Similarly we can quickly isolate a data frame row by means of the Panda’s loc[] method.

df.loc[10]
Name          John Curtin
Born           1885-01-08
Died           1945-07-05
Age                    60
Title      Prime Minister
Country         Australia
Name: 10, dtype: object

We’ll have more to say about extracting elements of a data frame in the subsequent sections on rows and columns.

Relationship Between Data Frames and Series

In most Pandas tutorials, an individual column in a data frame is a Pandas series. But we have to more flexible with that definition, as Pandas reports both the type of a row and the type of a column as series. It’s probably best to think of a series as a vector of data – heterogenous data in a row, and homogeneous data in a column – with an attached header, called “index” and “column”.

print(type(df['Name']))
<class 'pandas.core.series.Series'>
print(type(df.loc[0]))
<class 'pandas.core.series.Series'>

We can create a Pandas data frame from using Python dictionaries. Conceptually, the dictionary key becomes the data frame column label and the dictionary value becomes the collection of column values. In practice, we use Pandas series to provison a column, with data and (optionally) the index label that comes with it.

leader_name_series = pd.Series(["Franklin Roosevelt", "Joseph Stalin", "Adolph Hitler"])
leader_age_series = pd.Series([63, 74, 56])
leader_country_series = pd.Series(["United States", "Soviet Union", "Germany"])
leader_df = pd.DataFrame({"Name": leader_name_series, "Age": leader_age_series, "Country": leader_country_series})
leader_df
Name Age Country
0 Franklin Roosevelt 63 United States
1 Joseph Stalin 74 Soviet Union
2 Adolph Hitler 56 Germany
print(f"Type of leader_df: {type(leader_df)}")
Type of leader_df: <class 'pandas.core.frame.DataFrame'>

Pandas handles the cases where the shape of the columnar series do not line up. In our example below, the two series have indices that line up, but their data don’t. When creating the data frame, Pandas makes up for the incongruity by slotting NaN in the missing data for col02.

col01_series = pd.Series(["aaa", "bbb", "ccc", "ddd", "eee"], index = ["row01", "row02", "row03", "row04", "row05"])
col02_series = pd.Series([100, 200, 300], index = ["row02", "row03", "row05"])
string_int_df = pd.DataFrame({"col01": col01_series, "col02": col02_series})
string_int_df
col01 col02
row01 aaa NaN
row02 bbb 100.0
row03 ccc 200.0
row04 ddd NaN
row05 eee 300.0

Taking Slices of Columns

Let’s head back to our World War 2 leaders data frame.


df = pd.read_csv("sample_data/ww2_leaders.csv")
print(f"Type of df: {type(df)}\n")
df
Type of df: <class 'pandas.core.frame.DataFrame'>
Name Born Died Age Title Country
0 Franklin Roosevelt 1882-01-30 1945-04-12 63 President United States
1 Joseph Stalin 1878-12-06 1953-03-05 74 Great Leader Soviet Union
2 Adolph Hitler 1889-04-20 1945-04-30 56 Fuhrer Germany
3 Michinomiya Hirohito 1901-04-29 1989-01-07 87 Emperor Japan
4 Charles de Gaulle 1890-11-22 1970-11-09 79 President France
5 Winston Churchill 1874-11-30 1965-01-24 90 Prime Minister United Kingdom
6 Manuel Camacho 1897-04-24 1955-10-13 58 President Mexico
7 Jan Smuts 1870-05-24 1950-09-11 80 Prime Minister South Africa
8 Ibn Saud 1875-01-15 1953-11-09 78 King Saudi Arabia
9 Plaek Phibunsongkhram 1897-07-14 1965-06-11 66 Prime Minister Thailand
10 John Curtin 1885-01-08 1945-07-05 60 Prime Minister Australia
11 Haile Selassie 1892-07-23 1975-08-27 83 Emperor Ethiopia

Using the column labels we can take a series, one at a time from the data frame.

df['Title']
0          President
1       Great Leader
2             Fuhrer
3            Emperor
4          President
5     Prime Minister
6          President
7     Prime Minister
8               King
9     Prime Minister
10    Prime Minister
11           Emperor
Name: Title, dtype: object
df[['Title', 'Name']]
Title Name
0 President Franklin Roosevelt
1 Great Leader Joseph Stalin
2 Fuhrer Adolph Hitler
3 Emperor Michinomiya Hirohito
4 President Charles de Gaulle
5 Prime Minister Winston Churchill
6 President Manuel Camacho
7 Prime Minister Jan Smuts
8 King Ibn Saud
9 Prime Minister Plaek Phibunsongkhram
10 Prime Minister John Curtin
11 Emperor Haile Selassie

Alternately, we can isolate a data frame column by its attribute name and sort it accordingly.

df.Name.sort_values()
2             Adolph Hitler
4         Charles de Gaulle
0        Franklin Roosevelt
11           Haile Selassie
8                  Ibn Saud
7                 Jan Smuts
10              John Curtin
1             Joseph Stalin
6            Manuel Camacho
3      Michinomiya Hirohito
9     Plaek Phibunsongkhram
5         Winston Churchill
Name: Name, dtype: object

Taking Slices of Rows

A quick way to display n number of rows beginning with the first one is to use the head() method. In its default form it displays the first five rows.

df.head()
Name Born Died Age Title Country
0 Franklin Roosevelt 1882-01-30 1945-04-12 63 President United States
1 Joseph Stalin 1878-12-06 1953-03-05 74 Great Leader Soviet Union
2 Adolph Hitler 1889-04-20 1945-04-30 56 Fuhrer Germany
3 Michinomiya Hirohito 1901-04-29 1989-01-07 87 Emperor Japan
4 Charles de Gaulle 1890-11-22 1970-11-09 79 President France

If we wanted to change the number of rows returned, we’ll need to pass the n parameter.

df.head(n = 3)
Name Born Died Age Title Country
0 Franklin Roosevelt 1882-01-30 1945-04-12 63 President United States
1 Joseph Stalin 1878-12-06 1953-03-05 74 Great Leader Soviet Union
2 Adolph Hitler 1889-04-20 1945-04-30 56 Fuhrer Germany

When we pass a negative n value, Pandas will return all rows except for the last n ones.

df.head(n = -4)
Name Born Died Age Title Country
0 Franklin Roosevelt 1882-01-30 1945-04-12 63 President United States
1 Joseph Stalin 1878-12-06 1953-03-05 74 Great Leader Soviet Union
2 Adolph Hitler 1889-04-20 1945-04-30 56 Fuhrer Germany
3 Michinomiya Hirohito 1901-04-29 1989-01-07 87 Emperor Japan
4 Charles de Gaulle 1890-11-22 1970-11-09 79 President France
5 Winston Churchill 1874-11-30 1965-01-24 90 Prime Minister United Kingdom
6 Manuel Camacho 1897-04-24 1955-10-13 58 President Mexico
7 Jan Smuts 1870-05-24 1950-09-11 80 Prime Minister South Africa

We can also take slices (ranges) of data frame rows using the loc[ ] or iloc[ ] operators. What is the difference between the two?

By default, all rows in a data frame have zero-based indices associated with them. They also have label-based indices. If we supply these labels when the data frame is created, then the series will have the zero-based and the label-based indices. But if we skip the label indices, Pandas will duplicate the default zero-based indices into label indices.

The loc[ ] is easier to grasp but no less idiosyncratic. This operator isolates rows based on labels and includes the last element of the range passed in it – very un-Python like semantics here. Meanwhile the iloc[ ] operator relies on the zero-based indices and does not include the last element of the range passed in it.

df.loc[1]
Name       Joseph Stalin
Born          1878-12-06
Died          1953-03-05
Age                   74
Title       Great Leader
Country     Soviet Union
Name: 1, dtype: object
df.loc['2':'5']
Name Born Died Age Title Country
2 Adolph Hitler 1889-04-20 1945-04-30 56 Fuhrer Germany
3 Michinomiya Hirohito 1901-04-29 1989-01-07 87 Emperor Japan
4 Charles de Gaulle 1890-11-22 1970-11-09 79 President France
5 Winston Churchill 1874-11-30 1965-01-24 90 Prime Minister United Kingdom
df.loc['1':'1']
Name Born Died Age Title Country
1 Joseph Stalin 1878-12-06 1953-03-05 74 Great Leader Soviet Union

Because our data frame of World War 2 leaders did not come with row labels, Pandas supplied default labels based on the zero-based indices. Note that we had to use loc[‘1’:‘1’] and not loc[‘1’]. The later will result in KeyError exception.

The loc[ ] operator also allows us to enumerate combination of rows and columns.

df.loc[[0, 5, 10]]
Name Born Died Age Title Country
0 Franklin Roosevelt 1882-01-30 1945-04-12 63 President United States
5 Winston Churchill 1874-11-30 1965-01-24 90 Prime Minister United Kingdom
10 John Curtin 1885-01-08 1945-07-05 60 Prime Minister Australia
df.loc[0, 'Name']
'Franklin Roosevelt'
df.loc[[0, 5, 10], ['Name', 'Title']]
Name Title
0 Franklin Roosevelt President
5 Winston Churchill Prime Minister
10 John Curtin Prime Minister

As the following examples show, with the iloc[ ] operator we’re closer to familiar ground. It’s syntax for isolating a data frame row and for taking ranges of rows are closer to canonical Python.

df.iloc[2]
Name       Adolph Hitler
Born          1889-04-20
Died          1945-04-30
Age                   56
Title             Fuhrer
Country          Germany
Name: 2, dtype: object
df.iloc[2:5]
Name Born Died Age Title Country
2 Adolph Hitler 1889-04-20 1945-04-30 56 Fuhrer Germany
3 Michinomiya Hirohito 1901-04-29 1989-01-07 87 Emperor Japan
4 Charles de Gaulle 1890-11-22 1970-11-09 79 President France
df.iloc[[2,4]]
Name Born Died Age Title Country
2 Adolph Hitler 1889-04-20 1945-04-30 56 Fuhrer Germany
4 Charles de Gaulle 1890-11-22 1970-11-09 79 President France
df.iloc[4:1:-1]
Name Born Died Age Title Country
4 Charles de Gaulle 1890-11-22 1970-11-09 79 President France
3 Michinomiya Hirohito 1901-04-29 1989-01-07 87 Emperor Japan
2 Adolph Hitler 1889-04-20 1945-04-30 56 Fuhrer Germany
df.iloc[[2, 11]]
Name Born Died Age Title Country
2 Adolph Hitler 1889-04-20 1945-04-30 56 Fuhrer Germany
11 Haile Selassie 1892-07-23 1975-08-27 83 Emperor Ethiopia

We can also use iloc[ ] to list combination of rows and columns, but we have to keep in mind the iloc[ ] can not handle label-based indices.

df.iloc[[2, 11], [4, 0]]
Title Name
2 Fuhrer Adolph Hitler
11 Emperor Haile Selassie

Filtering Rows Using Boolean Logic

We can pick a column to filter the data frame rows. For example, say we want to isolate the rows of leaders whose age is less than or equal to the average age for the whole data set. We do so also by means of the [ ] operator, but this time passing a filtering statement within.

df[df['Title'] == 'President']
Name Born Died Age Title Country
0 Franklin Roosevelt 1882-01-30 1945-04-12 63 President United States
4 Charles de Gaulle 1890-11-22 1970-11-09 79 President France
6 Manuel Camacho 1897-04-24 1955-10-13 58 President Mexico

In the example above, we filter rows that match the string ‘President’. Let’s turn to another example that involves the data on age.

df.describe()
Age
count 12.000000
mean 72.833333
std 11.784684
min 56.000000
25% 62.250000
50% 76.000000
75% 80.750000
max 90.000000
df[df['Age'] <= df['Age'].mean()]
Name Born Died Age Title Country
0 Franklin Roosevelt 1882-01-30 1945-04-12 63 President United States
2 Adolph Hitler 1889-04-20 1945-04-30 56 Fuhrer Germany
6 Manuel Camacho 1897-04-24 1955-10-13 58 President Mexico
9 Plaek Phibunsongkhram 1897-07-14 1965-06-11 66 Prime Minister Thailand
10 John Curtin 1885-01-08 1945-07-05 60 Prime Minister Australia

Behind the scenes the statement df[‘Age’] <= df[‘Age’].mean() evaluates to a list of boolean values which Pandas uses to filter the data frame rows.

We can manually create the vector of booleans ourselves to filter the data frame without referencing any column.

potentates = [False, True, True, True, False, False, False, False, True, False, False, True]
df[potentates]
Name Born Died Age Title Country
1 Joseph Stalin 1878-12-06 1953-03-05 74 Great Leader Soviet Union
2 Adolph Hitler 1889-04-20 1945-04-30 56 Fuhrer Germany
3 Michinomiya Hirohito 1901-04-29 1989-01-07 87 Emperor Japan
8 Ibn Saud 1875-01-15 1953-11-09 78 King Saudi Arabia
11 Haile Selassie 1892-07-23 1975-08-27 83 Emperor Ethiopia
democratic_leaders = list(map(lambda x: not x, potentates))
df[democratic_leaders]
Name Born Died Age Title Country
0 Franklin Roosevelt 1882-01-30 1945-04-12 63 President United States
4 Charles de Gaulle 1890-11-22 1970-11-09 79 President France
5 Winston Churchill 1874-11-30 1965-01-24 90 Prime Minister United Kingdom
6 Manuel Camacho 1897-04-24 1955-10-13 58 President Mexico
7 Jan Smuts 1870-05-24 1950-09-11 80 Prime Minister South Africa
9 Plaek Phibunsongkhram 1897-07-14 1965-06-11 66 Prime Minister Thailand
10 John Curtin 1885-01-08 1945-07-05 60 Prime Minister Australia

Vector Operations on Data Frames

We can also perform vectorized operations on data frames. In the following example, multiplying by two doubles the values in every numeric cell. Similarly, strings are concatenated with each other, as per the rules in Python.

df * 2
Name Born Died Age Title Country
0 Franklin RooseveltFranklin Roosevelt 1882-01-301882-01-30 1945-04-121945-04-12 126 PresidentPresident United StatesUnited States
1 Joseph StalinJoseph Stalin 1878-12-061878-12-06 1953-03-051953-03-05 148 Great LeaderGreat Leader Soviet UnionSoviet Union
2 Adolph HitlerAdolph Hitler 1889-04-201889-04-20 1945-04-301945-04-30 112 FuhrerFuhrer GermanyGermany
3 Michinomiya HirohitoMichinomiya Hirohito 1901-04-291901-04-29 1989-01-071989-01-07 174 EmperorEmperor JapanJapan
4 Charles de GaulleCharles de Gaulle 1890-11-221890-11-22 1970-11-091970-11-09 158 PresidentPresident FranceFrance
5 Winston ChurchillWinston Churchill 1874-11-301874-11-30 1965-01-241965-01-24 180 Prime MinisterPrime Minister United KingdomUnited Kingdom
6 Manuel CamachoManuel Camacho 1897-04-241897-04-24 1955-10-131955-10-13 116 PresidentPresident MexicoMexico
7 Jan SmutsJan Smuts 1870-05-241870-05-24 1950-09-111950-09-11 160 Prime MinisterPrime Minister South AfricaSouth Africa
8 Ibn SaudIbn Saud 1875-01-151875-01-15 1953-11-091953-11-09 156 KingKing Saudi ArabiaSaudi Arabia
9 Plaek PhibunsongkhramPlaek Phibunsongkhram 1897-07-141897-07-14 1965-06-111965-06-11 132 Prime MinisterPrime Minister ThailandThailand
10 John CurtinJohn Curtin 1885-01-081885-01-08 1945-07-051945-07-05 120 Prime MinisterPrime Minister AustraliaAustralia
11 Haile SelassieHaile Selassie 1892-07-231892-07-23 1975-08-271975-08-27 166 EmperorEmperor EthiopiaEthiopia

Handling Date and Time Types

Our information about World War 2 leaders presents a good opportunity to discuss date and time types in Pandas. Right now the Born and Died columns are of type object, which is essentially of type string. To carry out operations on these, we first need to convert them to Timestamp types. We accomplish this using the Pandas to_datetime() method.

born_dt = pd.to_datetime(df.Born, format = '%Y-%m-%d')
print(f"Type of born_dt: {type(born_dt)}\nType of each born_dt element: {type(born_dt[0])}")
Type of born_dt: <class 'pandas.core.series.Series'>
Type of each born_dt element: <class 'pandas._libs.tslibs.timestamps.Timestamp'>

The to_datetime() method returns a series of time stamps. The dtype is datetime64 which allows us to apply arithmetic operations on these time stamps.

born_dt
0    1882-01-30
1    1878-12-06
2    1889-04-20
3    1901-04-29
4    1890-11-22
5    1874-11-30
6    1897-04-24
7    1870-05-24
8    1875-01-15
9    1897-07-14
10   1885-01-08
11   1892-07-23
Name: Born, dtype: datetime64[ns]

Up till now we haven’t showed how to update data frames. Let’s now create a new recalculate the age of a leader and append these as a new column to our data frame. To preserve our original data frame, lets make a distinct copy before we proceed.

df_copy = df.copy()
df_copy['Born_DT'] = born_dt
died_dt = pd.to_datetime(df.Died, format = '%Y-%m-%d')
df_copy['Died_DT'] = died_dt
df_copy
Name Born Died Age Title Country Born_DT Died_DT
0 Franklin Roosevelt 1882-01-30 1945-04-12 63 President United States 1882-01-30 1945-04-12
1 Joseph Stalin 1878-12-06 1953-03-05 74 Great Leader Soviet Union 1878-12-06 1953-03-05
2 Adolph Hitler 1889-04-20 1945-04-30 56 Fuhrer Germany 1889-04-20 1945-04-30
3 Michinomiya Hirohito 1901-04-29 1989-01-07 87 Emperor Japan 1901-04-29 1989-01-07
4 Charles de Gaulle 1890-11-22 1970-11-09 79 President France 1890-11-22 1970-11-09
5 Winston Churchill 1874-11-30 1965-01-24 90 Prime Minister United Kingdom 1874-11-30 1965-01-24
6 Manuel Camacho 1897-04-24 1955-10-13 58 President Mexico 1897-04-24 1955-10-13
7 Jan Smuts 1870-05-24 1950-09-11 80 Prime Minister South Africa 1870-05-24 1950-09-11
8 Ibn Saud 1875-01-15 1953-11-09 78 King Saudi Arabia 1875-01-15 1953-11-09
9 Plaek Phibunsongkhram 1897-07-14 1965-06-11 66 Prime Minister Thailand 1897-07-14 1965-06-11
10 John Curtin 1885-01-08 1945-07-05 60 Prime Minister Australia 1885-01-08 1945-07-05
11 Haile Selassie 1892-07-23 1975-08-27 83 Emperor Ethiopia 1892-07-23 1975-08-27

Just as with adding new entries in Python dictionaries, we’ve now extended our data frame with the two new columns with Timestamp types. From here we can calculate the leader’s age and apply it to a third new column, first as days then finally recalculated as years.

age_dt = died_dt - born_dt
print(age_dt)
0    23082 days
1    27117 days
2    20463 days
3    32030 days
4    29206 days
5    32927 days
6    21355 days
7    29329 days
8    28787 days
9    24803 days
10   22092 days
11   30349 days
dtype: timedelta64[ns]
df_copy["Age_DT"] = age_dt // pd.Timedelta('365 days')
df_copy
Name Born Died Age Title Country Born_DT Died_DT Age_DT
0 Franklin Roosevelt 1882-01-30 1945-04-12 63 President United States 1882-01-30 1945-04-12 63
1 Joseph Stalin 1878-12-06 1953-03-05 74 Great Leader Soviet Union 1878-12-06 1953-03-05 74
2 Adolph Hitler 1889-04-20 1945-04-30 56 Fuhrer Germany 1889-04-20 1945-04-30 56
3 Michinomiya Hirohito 1901-04-29 1989-01-07 87 Emperor Japan 1901-04-29 1989-01-07 87
4 Charles de Gaulle 1890-11-22 1970-11-09 79 President France 1890-11-22 1970-11-09 80
5 Winston Churchill 1874-11-30 1965-01-24 90 Prime Minister United Kingdom 1874-11-30 1965-01-24 90
6 Manuel Camacho 1897-04-24 1955-10-13 58 President Mexico 1897-04-24 1955-10-13 58
7 Jan Smuts 1870-05-24 1950-09-11 80 Prime Minister South Africa 1870-05-24 1950-09-11 80
8 Ibn Saud 1875-01-15 1953-11-09 78 King Saudi Arabia 1875-01-15 1953-11-09 78
9 Plaek Phibunsongkhram 1897-07-14 1965-06-11 66 Prime Minister Thailand 1897-07-14 1965-06-11 67
10 John Curtin 1885-01-08 1945-07-05 60 Prime Minister Australia 1885-01-08 1945-07-05 60
11 Haile Selassie 1892-07-23 1975-08-27 83 Emperor Ethiopia 1892-07-23 1975-08-27 83

Quick Note About Numpy

Finally we can create Pandas data frames from numPy array.

import numpy as np
np_data = np.random.random((4,5))
df = pd.DataFrame(np_data, index = ['Sample_01', 'Sample_02', 'Sample_03', 'Sample_04'], columns = np.arange(1,6))
df
1 2 3 4 5
Sample_01 0.963639 0.712412 0.760183 0.962716 0.711379
Sample_02 0.606878 0.190531 0.024942 0.951223 0.337302
Sample_03 0.916357 0.655245 0.402755 0.986677 0.074729
Sample_04 0.694711 0.036303 0.994629 0.515983 0.456669