Pandas Basic Analysis 03: Data Frame

Python

Pandas

Jupyter

Data Analytics

Part three of data analysis with Python Pandas: the data frame.

Author

Dennis Chua

Published

May 31, 2025

03 Pandas Data Structure: Data Frame

Content Outline

Introducton
Relationship between data frames and series
Taking slices columns
Taking slices of rows
Filtering rows using boolean logic
Vector operations on data frames
Handling date time types
Quick note about Numpy

Jupyter notebook

WW2 Leaders CSV

Introduction

In Pandas, a data frame is a two-dimensional structure analogous to tabulated data enclosing rows and columns. The header of a data frame series is known as the column while the header of a data frame row is known as an index.

import pandas as pd

df = pd.read_csv("sample_data/ww2_leaders.csv")
print(f"Type of df: {type(df)}\n")
df

Type of df: <class 'pandas.core.frame.DataFrame'>

	Name	Born	Died	Age	Title	Country
0	Franklin Roosevelt	1882-01-30	1945-04-12	63	President	United States
1	Joseph Stalin	1878-12-06	1953-03-05	74	Great Leader	Soviet Union
2	Adolph Hitler	1889-04-20	1945-04-30	56	Fuhrer	Germany
3	Michinomiya Hirohito	1901-04-29	1989-01-07	87	Emperor	Japan
4	Charles de Gaulle	1890-11-22	1970-11-09	79	President	France
5	Winston Churchill	1874-11-30	1965-01-24	90	Prime Minister	United Kingdom
6	Manuel Camacho	1897-04-24	1955-10-13	58	President	Mexico
7	Jan Smuts	1870-05-24	1950-09-11	80	Prime Minister	South Africa
8	Ibn Saud	1875-01-15	1953-11-09	78	King	Saudi Arabia
9	Plaek Phibunsongkhram	1897-07-14	1965-06-11	66	Prime Minister	Thailand
10	John Curtin	1885-01-08	1945-07-05	60	Prime Minister	Australia
11	Haile Selassie	1892-07-23	1975-08-27	83	Emperor	Ethiopia

In our input file for heads of state during World War 2, [“Name”, “Born”, “Died”, “Age”, “Title”, “Country”] make up the list of data frame columns. Since we didn’t provide any labels for the row indices, by default Pandas supplies them for us, labeling the row headers from zero onwards.

Our first operation on this data frame is to isolate a single column. We do so using the dictionary-like [ ] operator, supplying the column label.

df['Name']

0        Franklin Roosevelt
1             Joseph Stalin
2             Adolph Hitler
3      Michinomiya Hirohito
4         Charles de Gaulle
5         Winston Churchill
6            Manuel Camacho
7                 Jan Smuts
8                  Ibn Saud
9     Plaek Phibunsongkhram
10              John Curtin
11           Haile Selassie
Name: Name, dtype: object

Similarly we can quickly isolate a data frame row by means of the Panda’s loc[] method.

df.loc[10]

Name          John Curtin
Born           1885-01-08
Died           1945-07-05
Age                    60
Title      Prime Minister
Country         Australia
Name: 10, dtype: object

We’ll have more to say about extracting elements of a data frame in the subsequent sections on rows and columns.

Relationship Between Data Frames and Series

In most Pandas tutorials, an individual column in a data frame is a Pandas series. But we have to more flexible with that definition, as Pandas reports both the type of a row and the type of a column as series. It’s probably best to think of a series as a vector of data – heterogenous data in a row, and homogeneous data in a column – with an attached header, called “index” and “column”.

print(type(df['Name']))

<class 'pandas.core.series.Series'>

print(type(df.loc[0]))

<class 'pandas.core.series.Series'>

We can create a Pandas data frame from using Python dictionaries. Conceptually, the dictionary key becomes the data frame column label and the dictionary value becomes the collection of column values. In practice, we use Pandas series to provison a column, with data and (optionally) the index label that comes with it.

leader_name_series = pd.Series(["Franklin Roosevelt", "Joseph Stalin", "Adolph Hitler"])
leader_age_series = pd.Series([63, 74, 56])
leader_country_series = pd.Series(["United States", "Soviet Union", "Germany"])
leader_df = pd.DataFrame({"Name": leader_name_series, "Age": leader_age_series, "Country": leader_country_series})
leader_df

	Name	Age	Country
0	Franklin Roosevelt	63	United States
1	Joseph Stalin	74	Soviet Union
2	Adolph Hitler	56	Germany

print(f"Type of leader_df: {type(leader_df)}")

Type of leader_df: <class 'pandas.core.frame.DataFrame'>

Pandas handles the cases where the shape of the columnar series do not line up. In our example below, the two series have indices that line up, but their data don’t. When creating the data frame, Pandas makes up for the incongruity by slotting NaN in the missing data for col02.

col01_series = pd.Series(["aaa", "bbb", "ccc", "ddd", "eee"], index = ["row01", "row02", "row03", "row04", "row05"])
col02_series = pd.Series([100, 200, 300], index = ["row02", "row03", "row05"])
string_int_df = pd.DataFrame({"col01": col01_series, "col02": col02_series})
string_int_df

	col01	col02
row01	aaa	NaN
row02	bbb	100.0
row03	ccc	200.0
row04	ddd	NaN
row05	eee	300.0

Taking Slices of Columns

Let’s head back to our World War 2 leaders data frame.

df = pd.read_csv("sample_data/ww2_leaders.csv")
print(f"Type of df: {type(df)}\n")
df

Type of df: <class 'pandas.core.frame.DataFrame'>

	Name	Born	Died	Age	Title	Country
0	Franklin Roosevelt	1882-01-30	1945-04-12	63	President	United States
1	Joseph Stalin	1878-12-06	1953-03-05	74	Great Leader	Soviet Union
2	Adolph Hitler	1889-04-20	1945-04-30	56	Fuhrer	Germany
3	Michinomiya Hirohito	1901-04-29	1989-01-07	87	Emperor	Japan
4	Charles de Gaulle	1890-11-22	1970-11-09	79	President	France
5	Winston Churchill	1874-11-30	1965-01-24	90	Prime Minister	United Kingdom
6	Manuel Camacho	1897-04-24	1955-10-13	58	President	Mexico
7	Jan Smuts	1870-05-24	1950-09-11	80	Prime Minister	South Africa
8	Ibn Saud	1875-01-15	1953-11-09	78	King	Saudi Arabia
9	Plaek Phibunsongkhram	1897-07-14	1965-06-11	66	Prime Minister	Thailand
10	John Curtin	1885-01-08	1945-07-05	60	Prime Minister	Australia
11	Haile Selassie	1892-07-23	1975-08-27	83	Emperor	Ethiopia

Using the column labels we can take a series, one at a time from the data frame.

df['Title']

0          President
1       Great Leader
2             Fuhrer
3            Emperor
4          President
5     Prime Minister
6          President
7     Prime Minister
8               King
9     Prime Minister
10    Prime Minister
11           Emperor
Name: Title, dtype: object

df[['Title', 'Name']]

	Title	Name
0	President	Franklin Roosevelt
1	Great Leader	Joseph Stalin
2	Fuhrer	Adolph Hitler
3	Emperor	Michinomiya Hirohito
4	President	Charles de Gaulle
5	Prime Minister	Winston Churchill
6	President	Manuel Camacho
7	Prime Minister	Jan Smuts
8	King	Ibn Saud
9	Prime Minister	Plaek Phibunsongkhram
10	Prime Minister	John Curtin
11	Emperor	Haile Selassie

Alternately, we can isolate a data frame column by its attribute name and sort it accordingly.

df.Name.sort_values()

2             Adolph Hitler
4         Charles de Gaulle
0        Franklin Roosevelt
11           Haile Selassie
8                  Ibn Saud
7                 Jan Smuts
10              John Curtin
1             Joseph Stalin
6            Manuel Camacho
3      Michinomiya Hirohito
9     Plaek Phibunsongkhram
5         Winston Churchill
Name: Name, dtype: object

Taking Slices of Rows

A quick way to display n number of rows beginning with the first one is to use the head() method. In its default form it displays the first five rows.

df.head()

	Name	Born	Died	Age	Title	Country
0	Franklin Roosevelt	1882-01-30	1945-04-12	63	President	United States
1	Joseph Stalin	1878-12-06	1953-03-05	74	Great Leader	Soviet Union
2	Adolph Hitler	1889-04-20	1945-04-30	56	Fuhrer	Germany
3	Michinomiya Hirohito	1901-04-29	1989-01-07	87	Emperor	Japan
4	Charles de Gaulle	1890-11-22	1970-11-09	79	President	France

If we wanted to change the number of rows returned, we’ll need to pass the n parameter.

df.head(n = 3)

	Name	Born	Died	Age	Title	Country
0	Franklin Roosevelt	1882-01-30	1945-04-12	63	President	United States
1	Joseph Stalin	1878-12-06	1953-03-05	74	Great Leader	Soviet Union
2	Adolph Hitler	1889-04-20	1945-04-30	56	Fuhrer	Germany

When we pass a negative n value, Pandas will return all rows except for the last n ones.

df.head(n = -4)

	Name	Born	Died	Age	Title	Country
0	Franklin Roosevelt	1882-01-30	1945-04-12	63	President	United States
1	Joseph Stalin	1878-12-06	1953-03-05	74	Great Leader	Soviet Union
2	Adolph Hitler	1889-04-20	1945-04-30	56	Fuhrer	Germany
3	Michinomiya Hirohito	1901-04-29	1989-01-07	87	Emperor	Japan
4	Charles de Gaulle	1890-11-22	1970-11-09	79	President	France
5	Winston Churchill	1874-11-30	1965-01-24	90	Prime Minister	United Kingdom
6	Manuel Camacho	1897-04-24	1955-10-13	58	President	Mexico
7	Jan Smuts	1870-05-24	1950-09-11	80	Prime Minister	South Africa

We can also take slices (ranges) of data frame rows using the loc[ ] or iloc[ ] operators. What is the difference between the two?

By default, all rows in a data frame have zero-based indices associated with them. They also have label-based indices. If we supply these labels when the data frame is created, then the series will have the zero-based and the label-based indices. But if we skip the label indices, Pandas will duplicate the default zero-based indices into label indices.

The loc[ ] is easier to grasp but no less idiosyncratic. This operator isolates rows based on labels and includes the last element of the range passed in it – very un-Python like semantics here. Meanwhile the iloc[ ] operator relies on the zero-based indices and does not include the last element of the range passed in it.

df.loc[1]

Name       Joseph Stalin
Born          1878-12-06
Died          1953-03-05
Age                   74
Title       Great Leader
Country     Soviet Union
Name: 1, dtype: object

df.loc['2':'5']

	Name	Born	Died	Age	Title	Country
2	Adolph Hitler	1889-04-20	1945-04-30	56	Fuhrer	Germany
3	Michinomiya Hirohito	1901-04-29	1989-01-07	87	Emperor	Japan
4	Charles de Gaulle	1890-11-22	1970-11-09	79	President	France
5	Winston Churchill	1874-11-30	1965-01-24	90	Prime Minister	United Kingdom

df.loc['1':'1']

	Name	Born	Died	Age	Title	Country
1	Joseph Stalin	1878-12-06	1953-03-05	74	Great Leader	Soviet Union

Because our data frame of World War 2 leaders did not come with row labels, Pandas supplied default labels based on the zero-based indices. Note that we had to use loc[‘1’:‘1’] and not loc[‘1’]. The later will result in KeyError exception.

The loc[ ] operator also allows us to enumerate combination of rows and columns.

df.loc[[0, 5, 10]]

	Name	Born	Died	Age	Title	Country
0	Franklin Roosevelt	1882-01-30	1945-04-12	63	President	United States
5	Winston Churchill	1874-11-30	1965-01-24	90	Prime Minister	United Kingdom
10	John Curtin	1885-01-08	1945-07-05	60	Prime Minister	Australia

df.loc[0, 'Name']

'Franklin Roosevelt'

df.loc[[0, 5, 10], ['Name', 'Title']]

	Name	Title
0	Franklin Roosevelt	President
5	Winston Churchill	Prime Minister
10	John Curtin	Prime Minister

As the following examples show, with the iloc[ ] operator we’re closer to familiar ground. It’s syntax for isolating a data frame row and for taking ranges of rows are closer to canonical Python.

df.iloc[2]

Name       Adolph Hitler
Born          1889-04-20
Died          1945-04-30
Age                   56
Title             Fuhrer
Country          Germany
Name: 2, dtype: object

df.iloc[2:5]

	Name	Born	Died	Age	Title	Country
2	Adolph Hitler	1889-04-20	1945-04-30	56	Fuhrer	Germany
3	Michinomiya Hirohito	1901-04-29	1989-01-07	87	Emperor	Japan
4	Charles de Gaulle	1890-11-22	1970-11-09	79	President	France

df.iloc[[2,4]]

	Name	Born	Died	Age	Title	Country
2	Adolph Hitler	1889-04-20	1945-04-30	56	Fuhrer	Germany
4	Charles de Gaulle	1890-11-22	1970-11-09	79	President	France

df.iloc[4:1:-1]

	Name	Born	Died	Age	Title	Country
4	Charles de Gaulle	1890-11-22	1970-11-09	79	President	France
3	Michinomiya Hirohito	1901-04-29	1989-01-07	87	Emperor	Japan
2	Adolph Hitler	1889-04-20	1945-04-30	56	Fuhrer	Germany

df.iloc[[2, 11]]

	Name	Born	Died	Age	Title	Country
2	Adolph Hitler	1889-04-20	1945-04-30	56	Fuhrer	Germany
11	Haile Selassie	1892-07-23	1975-08-27	83	Emperor	Ethiopia

We can also use iloc[ ] to list combination of rows and columns, but we have to keep in mind the iloc[ ] can not handle label-based indices.

df.iloc[[2, 11], [4, 0]]

	Title	Name
2	Fuhrer	Adolph Hitler
11	Emperor	Haile Selassie

Filtering Rows Using Boolean Logic

We can pick a column to filter the data frame rows. For example, say we want to isolate the rows of leaders whose age is less than or equal to the average age for the whole data set. We do so also by means of the [ ] operator, but this time passing a filtering statement within.

df[df['Title'] == 'President']

	Name	Born	Died	Age	Title	Country
0	Franklin Roosevelt	1882-01-30	1945-04-12	63	President	United States
4	Charles de Gaulle	1890-11-22	1970-11-09	79	President	France
6	Manuel Camacho	1897-04-24	1955-10-13	58	President	Mexico

In the example above, we filter rows that match the string ‘President’. Let’s turn to another example that involves the data on age.

df.describe()

	Age
count	12.000000
mean	72.833333
std	11.784684
min	56.000000
25%	62.250000
50%	76.000000
75%	80.750000
max	90.000000

df[df['Age'] <= df['Age'].mean()]

	Name	Born	Died	Age	Title	Country
0	Franklin Roosevelt	1882-01-30	1945-04-12	63	President	United States
2	Adolph Hitler	1889-04-20	1945-04-30	56	Fuhrer	Germany
6	Manuel Camacho	1897-04-24	1955-10-13	58	President	Mexico
9	Plaek Phibunsongkhram	1897-07-14	1965-06-11	66	Prime Minister	Thailand
10	John Curtin	1885-01-08	1945-07-05	60	Prime Minister	Australia

Behind the scenes the statement df[‘Age’] <= df[‘Age’].mean() evaluates to a list of boolean values which Pandas uses to filter the data frame rows.

We can manually create the vector of booleans ourselves to filter the data frame without referencing any column.

potentates = [False, True, True, True, False, False, False, False, True, False, False, True]
df[potentates]

	Name	Born	Died	Age	Title	Country
1	Joseph Stalin	1878-12-06	1953-03-05	74	Great Leader	Soviet Union
2	Adolph Hitler	1889-04-20	1945-04-30	56	Fuhrer	Germany
3	Michinomiya Hirohito	1901-04-29	1989-01-07	87	Emperor	Japan
8	Ibn Saud	1875-01-15	1953-11-09	78	King	Saudi Arabia
11	Haile Selassie	1892-07-23	1975-08-27	83	Emperor	Ethiopia

democratic_leaders = list(map(lambda x: not x, potentates))
df[democratic_leaders]

	Name	Born	Died	Age	Title	Country
0	Franklin Roosevelt	1882-01-30	1945-04-12	63	President	United States
4	Charles de Gaulle	1890-11-22	1970-11-09	79	President	France
5	Winston Churchill	1874-11-30	1965-01-24	90	Prime Minister	United Kingdom
6	Manuel Camacho	1897-04-24	1955-10-13	58	President	Mexico
7	Jan Smuts	1870-05-24	1950-09-11	80	Prime Minister	South Africa
9	Plaek Phibunsongkhram	1897-07-14	1965-06-11	66	Prime Minister	Thailand
10	John Curtin	1885-01-08	1945-07-05	60	Prime Minister	Australia

Vector Operations on Data Frames

We can also perform vectorized operations on data frames. In the following example, multiplying by two doubles the values in every numeric cell. Similarly, strings are concatenated with each other, as per the rules in Python.

df * 2

	Name	Born	Died	Age	Title	Country
0	Franklin RooseveltFranklin Roosevelt	1882-01-301882-01-30	1945-04-121945-04-12	126	PresidentPresident	United StatesUnited States
1	Joseph StalinJoseph Stalin	1878-12-061878-12-06	1953-03-051953-03-05	148	Great LeaderGreat Leader	Soviet UnionSoviet Union
2	Adolph HitlerAdolph Hitler	1889-04-201889-04-20	1945-04-301945-04-30	112	FuhrerFuhrer	GermanyGermany
3	Michinomiya HirohitoMichinomiya Hirohito	1901-04-291901-04-29	1989-01-071989-01-07	174	EmperorEmperor	JapanJapan
4	Charles de GaulleCharles de Gaulle	1890-11-221890-11-22	1970-11-091970-11-09	158	PresidentPresident	FranceFrance
5	Winston ChurchillWinston Churchill	1874-11-301874-11-30	1965-01-241965-01-24	180	Prime MinisterPrime Minister	United KingdomUnited Kingdom
6	Manuel CamachoManuel Camacho	1897-04-241897-04-24	1955-10-131955-10-13	116	PresidentPresident	MexicoMexico
7	Jan SmutsJan Smuts	1870-05-241870-05-24	1950-09-111950-09-11	160	Prime MinisterPrime Minister	South AfricaSouth Africa
8	Ibn SaudIbn Saud	1875-01-151875-01-15	1953-11-091953-11-09	156	KingKing	Saudi ArabiaSaudi Arabia
9	Plaek PhibunsongkhramPlaek Phibunsongkhram	1897-07-141897-07-14	1965-06-111965-06-11	132	Prime MinisterPrime Minister	ThailandThailand
10	John CurtinJohn Curtin	1885-01-081885-01-08	1945-07-051945-07-05	120	Prime MinisterPrime Minister	AustraliaAustralia
11	Haile SelassieHaile Selassie	1892-07-231892-07-23	1975-08-271975-08-27	166	EmperorEmperor	EthiopiaEthiopia

Handling Date and Time Types

Our information about World War 2 leaders presents a good opportunity to discuss date and time types in Pandas. Right now the Born and Died columns are of type object, which is essentially of type string. To carry out operations on these, we first need to convert them to Timestamp types. We accomplish this using the Pandas to_datetime() method.

born_dt = pd.to_datetime(df.Born, format = '%Y-%m-%d')
print(f"Type of born_dt: {type(born_dt)}\nType of each born_dt element: {type(born_dt[0])}")

Type of born_dt: <class 'pandas.core.series.Series'>
Type of each born_dt element: <class 'pandas._libs.tslibs.timestamps.Timestamp'>

The to_datetime() method returns a series of time stamps. The dtype is datetime64 which allows us to apply arithmetic operations on these time stamps.

born_dt

0    1882-01-30
1    1878-12-06
2    1889-04-20
3    1901-04-29
4    1890-11-22
5    1874-11-30
6    1897-04-24
7    1870-05-24
8    1875-01-15
9    1897-07-14
10   1885-01-08
11   1892-07-23
Name: Born, dtype: datetime64[ns]

Up till now we haven’t showed how to update data frames. Let’s now create a new recalculate the age of a leader and append these as a new column to our data frame. To preserve our original data frame, lets make a distinct copy before we proceed.

df_copy = df.copy()
df_copy['Born_DT'] = born_dt
died_dt = pd.to_datetime(df.Died, format = '%Y-%m-%d')
df_copy['Died_DT'] = died_dt
df_copy

	Name	Born	Died	Age	Title	Country	Born_DT	Died_DT
0	Franklin Roosevelt	1882-01-30	1945-04-12	63	President	United States	1882-01-30	1945-04-12
1	Joseph Stalin	1878-12-06	1953-03-05	74	Great Leader	Soviet Union	1878-12-06	1953-03-05
2	Adolph Hitler	1889-04-20	1945-04-30	56	Fuhrer	Germany	1889-04-20	1945-04-30
3	Michinomiya Hirohito	1901-04-29	1989-01-07	87	Emperor	Japan	1901-04-29	1989-01-07
4	Charles de Gaulle	1890-11-22	1970-11-09	79	President	France	1890-11-22	1970-11-09
5	Winston Churchill	1874-11-30	1965-01-24	90	Prime Minister	United Kingdom	1874-11-30	1965-01-24
6	Manuel Camacho	1897-04-24	1955-10-13	58	President	Mexico	1897-04-24	1955-10-13
7	Jan Smuts	1870-05-24	1950-09-11	80	Prime Minister	South Africa	1870-05-24	1950-09-11
8	Ibn Saud	1875-01-15	1953-11-09	78	King	Saudi Arabia	1875-01-15	1953-11-09
9	Plaek Phibunsongkhram	1897-07-14	1965-06-11	66	Prime Minister	Thailand	1897-07-14	1965-06-11
10	John Curtin	1885-01-08	1945-07-05	60	Prime Minister	Australia	1885-01-08	1945-07-05
11	Haile Selassie	1892-07-23	1975-08-27	83	Emperor	Ethiopia	1892-07-23	1975-08-27

Just as with adding new entries in Python dictionaries, we’ve now extended our data frame with the two new columns with Timestamp types. From here we can calculate the leader’s age and apply it to a third new column, first as days then finally recalculated as years.

age_dt = died_dt - born_dt
print(age_dt)

0    23082 days
1    27117 days
2    20463 days
3    32030 days
4    29206 days
5    32927 days
6    21355 days
7    29329 days
8    28787 days
9    24803 days
10   22092 days
11   30349 days
dtype: timedelta64[ns]

df_copy["Age_DT"] = age_dt // pd.Timedelta('365 days')
df_copy

	Name	Born	Died	Age	Title	Country	Born_DT	Died_DT	Age_DT
0	Franklin Roosevelt	1882-01-30	1945-04-12	63	President	United States	1882-01-30	1945-04-12	63
1	Joseph Stalin	1878-12-06	1953-03-05	74	Great Leader	Soviet Union	1878-12-06	1953-03-05	74
2	Adolph Hitler	1889-04-20	1945-04-30	56	Fuhrer	Germany	1889-04-20	1945-04-30	56
3	Michinomiya Hirohito	1901-04-29	1989-01-07	87	Emperor	Japan	1901-04-29	1989-01-07	87
4	Charles de Gaulle	1890-11-22	1970-11-09	79	President	France	1890-11-22	1970-11-09	80
5	Winston Churchill	1874-11-30	1965-01-24	90	Prime Minister	United Kingdom	1874-11-30	1965-01-24	90
6	Manuel Camacho	1897-04-24	1955-10-13	58	President	Mexico	1897-04-24	1955-10-13	58
7	Jan Smuts	1870-05-24	1950-09-11	80	Prime Minister	South Africa	1870-05-24	1950-09-11	80
8	Ibn Saud	1875-01-15	1953-11-09	78	King	Saudi Arabia	1875-01-15	1953-11-09	78
9	Plaek Phibunsongkhram	1897-07-14	1965-06-11	66	Prime Minister	Thailand	1897-07-14	1965-06-11	67
10	John Curtin	1885-01-08	1945-07-05	60	Prime Minister	Australia	1885-01-08	1945-07-05	60
11	Haile Selassie	1892-07-23	1975-08-27	83	Emperor	Ethiopia	1892-07-23	1975-08-27	83

Quick Note About Numpy

Finally we can create Pandas data frames from numPy array.

import numpy as np
np_data = np.random.random((4,5))
df = pd.DataFrame(np_data, index = ['Sample_01', 'Sample_02', 'Sample_03', 'Sample_04'], columns = np.arange(1,6))
df

	1	2	3	4	5
Sample_01	0.748195	0.611704	0.304087	0.119463	0.598989
Sample_02	0.995958	0.949418	0.965212	0.977375	0.772594
Sample_03	0.375183	0.268562	0.189580	0.596075	0.397792
Sample_04	0.825419	0.020657	0.463567	0.731586	0.669615