import pandas as pd
Pandas Basic Analysis 03: Data Frame
03 Pandas Data Structure: Data Frame
Content Outline
- Introducton
- Relationship between data frames and series
- Taking slices columns
- Taking slices of rows
- Filtering rows using boolean logic
- Vector operations on data frames
- Handling date time types
- Quick note about Numpy
Introduction
In Pandas, a data frame is a two-dimensional structure analogous to tabulated data enclosing rows and columns. The header of a data frame series is known as the column while the header of a data frame row is known as an index.
= pd.read_csv("sample_data/ww2_leaders.csv")
df print(f"Type of df: {type(df)}\n")
df
Type of df: <class 'pandas.core.frame.DataFrame'>
Name | Born | Died | Age | Title | Country | |
---|---|---|---|---|---|---|
0 | Franklin Roosevelt | 1882-01-30 | 1945-04-12 | 63 | President | United States |
1 | Joseph Stalin | 1878-12-06 | 1953-03-05 | 74 | Great Leader | Soviet Union |
2 | Adolph Hitler | 1889-04-20 | 1945-04-30 | 56 | Fuhrer | Germany |
3 | Michinomiya Hirohito | 1901-04-29 | 1989-01-07 | 87 | Emperor | Japan |
4 | Charles de Gaulle | 1890-11-22 | 1970-11-09 | 79 | President | France |
5 | Winston Churchill | 1874-11-30 | 1965-01-24 | 90 | Prime Minister | United Kingdom |
6 | Manuel Camacho | 1897-04-24 | 1955-10-13 | 58 | President | Mexico |
7 | Jan Smuts | 1870-05-24 | 1950-09-11 | 80 | Prime Minister | South Africa |
8 | Ibn Saud | 1875-01-15 | 1953-11-09 | 78 | King | Saudi Arabia |
9 | Plaek Phibunsongkhram | 1897-07-14 | 1965-06-11 | 66 | Prime Minister | Thailand |
10 | John Curtin | 1885-01-08 | 1945-07-05 | 60 | Prime Minister | Australia |
11 | Haile Selassie | 1892-07-23 | 1975-08-27 | 83 | Emperor | Ethiopia |
In our input file for heads of state during World War 2, [“Name”, “Born”, “Died”, “Age”, “Title”, “Country”] make up the list of data frame columns. Since we didn’t provide any labels for the row indices, by default Pandas supplies them for us, labeling the row headers from zero onwards.
Our first operation on this data frame is to isolate a single column. We do so using the dictionary-like [ ] operator, supplying the column label.
'Name'] df[
0 Franklin Roosevelt
1 Joseph Stalin
2 Adolph Hitler
3 Michinomiya Hirohito
4 Charles de Gaulle
5 Winston Churchill
6 Manuel Camacho
7 Jan Smuts
8 Ibn Saud
9 Plaek Phibunsongkhram
10 John Curtin
11 Haile Selassie
Name: Name, dtype: object
Similarly we can quickly isolate a data frame row by means of the Panda’s loc[] method.
10] df.loc[
Name John Curtin
Born 1885-01-08
Died 1945-07-05
Age 60
Title Prime Minister
Country Australia
Name: 10, dtype: object
We’ll have more to say about extracting elements of a data frame in the subsequent sections on rows and columns.
Relationship Between Data Frames and Series
In most Pandas tutorials, an individual column in a data frame is a Pandas series. But we have to more flexible with that definition, as Pandas reports both the type of a row and the type of a column as series. It’s probably best to think of a series as a vector of data – heterogenous data in a row, and homogeneous data in a column – with an attached header, called “index” and “column”.
print(type(df['Name']))
<class 'pandas.core.series.Series'>
print(type(df.loc[0]))
<class 'pandas.core.series.Series'>
We can create a Pandas data frame from using Python dictionaries. Conceptually, the dictionary key becomes the data frame column label and the dictionary value becomes the collection of column values. In practice, we use Pandas series to provison a column, with data and (optionally) the index label that comes with it.
= pd.Series(["Franklin Roosevelt", "Joseph Stalin", "Adolph Hitler"])
leader_name_series = pd.Series([63, 74, 56])
leader_age_series = pd.Series(["United States", "Soviet Union", "Germany"])
leader_country_series = pd.DataFrame({"Name": leader_name_series, "Age": leader_age_series, "Country": leader_country_series})
leader_df leader_df
Name | Age | Country | |
---|---|---|---|
0 | Franklin Roosevelt | 63 | United States |
1 | Joseph Stalin | 74 | Soviet Union |
2 | Adolph Hitler | 56 | Germany |
print(f"Type of leader_df: {type(leader_df)}")
Type of leader_df: <class 'pandas.core.frame.DataFrame'>
Pandas handles the cases where the shape of the columnar series do not line up. In our example below, the two series have indices that line up, but their data don’t. When creating the data frame, Pandas makes up for the incongruity by slotting NaN in the missing data for col02.
= pd.Series(["aaa", "bbb", "ccc", "ddd", "eee"], index = ["row01", "row02", "row03", "row04", "row05"])
col01_series = pd.Series([100, 200, 300], index = ["row02", "row03", "row05"])
col02_series = pd.DataFrame({"col01": col01_series, "col02": col02_series})
string_int_df string_int_df
col01 | col02 | |
---|---|---|
row01 | aaa | NaN |
row02 | bbb | 100.0 |
row03 | ccc | 200.0 |
row04 | ddd | NaN |
row05 | eee | 300.0 |
Taking Slices of Columns
Let’s head back to our World War 2 leaders data frame.
= pd.read_csv("sample_data/ww2_leaders.csv")
df print(f"Type of df: {type(df)}\n")
df
Type of df: <class 'pandas.core.frame.DataFrame'>
Name | Born | Died | Age | Title | Country | |
---|---|---|---|---|---|---|
0 | Franklin Roosevelt | 1882-01-30 | 1945-04-12 | 63 | President | United States |
1 | Joseph Stalin | 1878-12-06 | 1953-03-05 | 74 | Great Leader | Soviet Union |
2 | Adolph Hitler | 1889-04-20 | 1945-04-30 | 56 | Fuhrer | Germany |
3 | Michinomiya Hirohito | 1901-04-29 | 1989-01-07 | 87 | Emperor | Japan |
4 | Charles de Gaulle | 1890-11-22 | 1970-11-09 | 79 | President | France |
5 | Winston Churchill | 1874-11-30 | 1965-01-24 | 90 | Prime Minister | United Kingdom |
6 | Manuel Camacho | 1897-04-24 | 1955-10-13 | 58 | President | Mexico |
7 | Jan Smuts | 1870-05-24 | 1950-09-11 | 80 | Prime Minister | South Africa |
8 | Ibn Saud | 1875-01-15 | 1953-11-09 | 78 | King | Saudi Arabia |
9 | Plaek Phibunsongkhram | 1897-07-14 | 1965-06-11 | 66 | Prime Minister | Thailand |
10 | John Curtin | 1885-01-08 | 1945-07-05 | 60 | Prime Minister | Australia |
11 | Haile Selassie | 1892-07-23 | 1975-08-27 | 83 | Emperor | Ethiopia |
Using the column labels we can take a series, one at a time from the data frame.
'Title'] df[
0 President
1 Great Leader
2 Fuhrer
3 Emperor
4 President
5 Prime Minister
6 President
7 Prime Minister
8 King
9 Prime Minister
10 Prime Minister
11 Emperor
Name: Title, dtype: object
'Title', 'Name']] df[[
Title | Name | |
---|---|---|
0 | President | Franklin Roosevelt |
1 | Great Leader | Joseph Stalin |
2 | Fuhrer | Adolph Hitler |
3 | Emperor | Michinomiya Hirohito |
4 | President | Charles de Gaulle |
5 | Prime Minister | Winston Churchill |
6 | President | Manuel Camacho |
7 | Prime Minister | Jan Smuts |
8 | King | Ibn Saud |
9 | Prime Minister | Plaek Phibunsongkhram |
10 | Prime Minister | John Curtin |
11 | Emperor | Haile Selassie |
Alternately, we can isolate a data frame column by its attribute name and sort it accordingly.
df.Name.sort_values()
2 Adolph Hitler
4 Charles de Gaulle
0 Franklin Roosevelt
11 Haile Selassie
8 Ibn Saud
7 Jan Smuts
10 John Curtin
1 Joseph Stalin
6 Manuel Camacho
3 Michinomiya Hirohito
9 Plaek Phibunsongkhram
5 Winston Churchill
Name: Name, dtype: object
Taking Slices of Rows
A quick way to display n number of rows beginning with the first one is to use the head() method. In its default form it displays the first five rows.
df.head()
Name | Born | Died | Age | Title | Country | |
---|---|---|---|---|---|---|
0 | Franklin Roosevelt | 1882-01-30 | 1945-04-12 | 63 | President | United States |
1 | Joseph Stalin | 1878-12-06 | 1953-03-05 | 74 | Great Leader | Soviet Union |
2 | Adolph Hitler | 1889-04-20 | 1945-04-30 | 56 | Fuhrer | Germany |
3 | Michinomiya Hirohito | 1901-04-29 | 1989-01-07 | 87 | Emperor | Japan |
4 | Charles de Gaulle | 1890-11-22 | 1970-11-09 | 79 | President | France |
If we wanted to change the number of rows returned, we’ll need to pass the n parameter.
= 3) df.head(n
Name | Born | Died | Age | Title | Country | |
---|---|---|---|---|---|---|
0 | Franklin Roosevelt | 1882-01-30 | 1945-04-12 | 63 | President | United States |
1 | Joseph Stalin | 1878-12-06 | 1953-03-05 | 74 | Great Leader | Soviet Union |
2 | Adolph Hitler | 1889-04-20 | 1945-04-30 | 56 | Fuhrer | Germany |
When we pass a negative n value, Pandas will return all rows except for the last n ones.
= -4) df.head(n
Name | Born | Died | Age | Title | Country | |
---|---|---|---|---|---|---|
0 | Franklin Roosevelt | 1882-01-30 | 1945-04-12 | 63 | President | United States |
1 | Joseph Stalin | 1878-12-06 | 1953-03-05 | 74 | Great Leader | Soviet Union |
2 | Adolph Hitler | 1889-04-20 | 1945-04-30 | 56 | Fuhrer | Germany |
3 | Michinomiya Hirohito | 1901-04-29 | 1989-01-07 | 87 | Emperor | Japan |
4 | Charles de Gaulle | 1890-11-22 | 1970-11-09 | 79 | President | France |
5 | Winston Churchill | 1874-11-30 | 1965-01-24 | 90 | Prime Minister | United Kingdom |
6 | Manuel Camacho | 1897-04-24 | 1955-10-13 | 58 | President | Mexico |
7 | Jan Smuts | 1870-05-24 | 1950-09-11 | 80 | Prime Minister | South Africa |
We can also take slices (ranges) of data frame rows using the loc[ ] or iloc[ ] operators. What is the difference between the two?
By default, all rows in a data frame have zero-based indices associated with them. They also have label-based indices. If we supply these labels when the data frame is created, then the series will have the zero-based and the label-based indices. But if we skip the label indices, Pandas will duplicate the default zero-based indices into label indices.
The loc[ ] is easier to grasp but no less idiosyncratic. This operator isolates rows based on labels and includes the last element of the range passed in it – very un-Python like semantics here. Meanwhile the iloc[ ] operator relies on the zero-based indices and does not include the last element of the range passed in it.
1] df.loc[
Name Joseph Stalin
Born 1878-12-06
Died 1953-03-05
Age 74
Title Great Leader
Country Soviet Union
Name: 1, dtype: object
'2':'5'] df.loc[
Name | Born | Died | Age | Title | Country | |
---|---|---|---|---|---|---|
2 | Adolph Hitler | 1889-04-20 | 1945-04-30 | 56 | Fuhrer | Germany |
3 | Michinomiya Hirohito | 1901-04-29 | 1989-01-07 | 87 | Emperor | Japan |
4 | Charles de Gaulle | 1890-11-22 | 1970-11-09 | 79 | President | France |
5 | Winston Churchill | 1874-11-30 | 1965-01-24 | 90 | Prime Minister | United Kingdom |
'1':'1'] df.loc[
Name | Born | Died | Age | Title | Country | |
---|---|---|---|---|---|---|
1 | Joseph Stalin | 1878-12-06 | 1953-03-05 | 74 | Great Leader | Soviet Union |
Because our data frame of World War 2 leaders did not come with row labels, Pandas supplied default labels based on the zero-based indices. Note that we had to use loc[‘1’:‘1’] and not loc[‘1’]. The later will result in KeyError exception.
The loc[ ] operator also allows us to enumerate combination of rows and columns.
0, 5, 10]] df.loc[[
Name | Born | Died | Age | Title | Country | |
---|---|---|---|---|---|---|
0 | Franklin Roosevelt | 1882-01-30 | 1945-04-12 | 63 | President | United States |
5 | Winston Churchill | 1874-11-30 | 1965-01-24 | 90 | Prime Minister | United Kingdom |
10 | John Curtin | 1885-01-08 | 1945-07-05 | 60 | Prime Minister | Australia |
0, 'Name'] df.loc[
'Franklin Roosevelt'
0, 5, 10], ['Name', 'Title']] df.loc[[
Name | Title | |
---|---|---|
0 | Franklin Roosevelt | President |
5 | Winston Churchill | Prime Minister |
10 | John Curtin | Prime Minister |
As the following examples show, with the iloc[ ] operator we’re closer to familiar ground. It’s syntax for isolating a data frame row and for taking ranges of rows are closer to canonical Python.
2] df.iloc[
Name Adolph Hitler
Born 1889-04-20
Died 1945-04-30
Age 56
Title Fuhrer
Country Germany
Name: 2, dtype: object
2:5] df.iloc[
Name | Born | Died | Age | Title | Country | |
---|---|---|---|---|---|---|
2 | Adolph Hitler | 1889-04-20 | 1945-04-30 | 56 | Fuhrer | Germany |
3 | Michinomiya Hirohito | 1901-04-29 | 1989-01-07 | 87 | Emperor | Japan |
4 | Charles de Gaulle | 1890-11-22 | 1970-11-09 | 79 | President | France |
2,4]] df.iloc[[
Name | Born | Died | Age | Title | Country | |
---|---|---|---|---|---|---|
2 | Adolph Hitler | 1889-04-20 | 1945-04-30 | 56 | Fuhrer | Germany |
4 | Charles de Gaulle | 1890-11-22 | 1970-11-09 | 79 | President | France |
4:1:-1] df.iloc[
Name | Born | Died | Age | Title | Country | |
---|---|---|---|---|---|---|
4 | Charles de Gaulle | 1890-11-22 | 1970-11-09 | 79 | President | France |
3 | Michinomiya Hirohito | 1901-04-29 | 1989-01-07 | 87 | Emperor | Japan |
2 | Adolph Hitler | 1889-04-20 | 1945-04-30 | 56 | Fuhrer | Germany |
2, 11]] df.iloc[[
Name | Born | Died | Age | Title | Country | |
---|---|---|---|---|---|---|
2 | Adolph Hitler | 1889-04-20 | 1945-04-30 | 56 | Fuhrer | Germany |
11 | Haile Selassie | 1892-07-23 | 1975-08-27 | 83 | Emperor | Ethiopia |
We can also use iloc[ ] to list combination of rows and columns, but we have to keep in mind the iloc[ ] can not handle label-based indices.
2, 11], [4, 0]] df.iloc[[
Title | Name | |
---|---|---|
2 | Fuhrer | Adolph Hitler |
11 | Emperor | Haile Selassie |
Filtering Rows Using Boolean Logic
We can pick a column to filter the data frame rows. For example, say we want to isolate the rows of leaders whose age is less than or equal to the average age for the whole data set. We do so also by means of the [ ] operator, but this time passing a filtering statement within.
'Title'] == 'President'] df[df[
Name | Born | Died | Age | Title | Country | |
---|---|---|---|---|---|---|
0 | Franklin Roosevelt | 1882-01-30 | 1945-04-12 | 63 | President | United States |
4 | Charles de Gaulle | 1890-11-22 | 1970-11-09 | 79 | President | France |
6 | Manuel Camacho | 1897-04-24 | 1955-10-13 | 58 | President | Mexico |
In the example above, we filter rows that match the string ‘President’. Let’s turn to another example that involves the data on age.
df.describe()
Age | |
---|---|
count | 12.000000 |
mean | 72.833333 |
std | 11.784684 |
min | 56.000000 |
25% | 62.250000 |
50% | 76.000000 |
75% | 80.750000 |
max | 90.000000 |
'Age'] <= df['Age'].mean()] df[df[
Name | Born | Died | Age | Title | Country | |
---|---|---|---|---|---|---|
0 | Franklin Roosevelt | 1882-01-30 | 1945-04-12 | 63 | President | United States |
2 | Adolph Hitler | 1889-04-20 | 1945-04-30 | 56 | Fuhrer | Germany |
6 | Manuel Camacho | 1897-04-24 | 1955-10-13 | 58 | President | Mexico |
9 | Plaek Phibunsongkhram | 1897-07-14 | 1965-06-11 | 66 | Prime Minister | Thailand |
10 | John Curtin | 1885-01-08 | 1945-07-05 | 60 | Prime Minister | Australia |
Behind the scenes the statement df[‘Age’] <= df[‘Age’].mean() evaluates to a list of boolean values which Pandas uses to filter the data frame rows.
We can manually create the vector of booleans ourselves to filter the data frame without referencing any column.
= [False, True, True, True, False, False, False, False, True, False, False, True]
potentates df[potentates]
Name | Born | Died | Age | Title | Country | |
---|---|---|---|---|---|---|
1 | Joseph Stalin | 1878-12-06 | 1953-03-05 | 74 | Great Leader | Soviet Union |
2 | Adolph Hitler | 1889-04-20 | 1945-04-30 | 56 | Fuhrer | Germany |
3 | Michinomiya Hirohito | 1901-04-29 | 1989-01-07 | 87 | Emperor | Japan |
8 | Ibn Saud | 1875-01-15 | 1953-11-09 | 78 | King | Saudi Arabia |
11 | Haile Selassie | 1892-07-23 | 1975-08-27 | 83 | Emperor | Ethiopia |
= list(map(lambda x: not x, potentates))
democratic_leaders df[democratic_leaders]
Name | Born | Died | Age | Title | Country | |
---|---|---|---|---|---|---|
0 | Franklin Roosevelt | 1882-01-30 | 1945-04-12 | 63 | President | United States |
4 | Charles de Gaulle | 1890-11-22 | 1970-11-09 | 79 | President | France |
5 | Winston Churchill | 1874-11-30 | 1965-01-24 | 90 | Prime Minister | United Kingdom |
6 | Manuel Camacho | 1897-04-24 | 1955-10-13 | 58 | President | Mexico |
7 | Jan Smuts | 1870-05-24 | 1950-09-11 | 80 | Prime Minister | South Africa |
9 | Plaek Phibunsongkhram | 1897-07-14 | 1965-06-11 | 66 | Prime Minister | Thailand |
10 | John Curtin | 1885-01-08 | 1945-07-05 | 60 | Prime Minister | Australia |
Vector Operations on Data Frames
We can also perform vectorized operations on data frames. In the following example, multiplying by two doubles the values in every numeric cell. Similarly, strings are concatenated with each other, as per the rules in Python.
* 2 df
Name | Born | Died | Age | Title | Country | |
---|---|---|---|---|---|---|
0 | Franklin RooseveltFranklin Roosevelt | 1882-01-301882-01-30 | 1945-04-121945-04-12 | 126 | PresidentPresident | United StatesUnited States |
1 | Joseph StalinJoseph Stalin | 1878-12-061878-12-06 | 1953-03-051953-03-05 | 148 | Great LeaderGreat Leader | Soviet UnionSoviet Union |
2 | Adolph HitlerAdolph Hitler | 1889-04-201889-04-20 | 1945-04-301945-04-30 | 112 | FuhrerFuhrer | GermanyGermany |
3 | Michinomiya HirohitoMichinomiya Hirohito | 1901-04-291901-04-29 | 1989-01-071989-01-07 | 174 | EmperorEmperor | JapanJapan |
4 | Charles de GaulleCharles de Gaulle | 1890-11-221890-11-22 | 1970-11-091970-11-09 | 158 | PresidentPresident | FranceFrance |
5 | Winston ChurchillWinston Churchill | 1874-11-301874-11-30 | 1965-01-241965-01-24 | 180 | Prime MinisterPrime Minister | United KingdomUnited Kingdom |
6 | Manuel CamachoManuel Camacho | 1897-04-241897-04-24 | 1955-10-131955-10-13 | 116 | PresidentPresident | MexicoMexico |
7 | Jan SmutsJan Smuts | 1870-05-241870-05-24 | 1950-09-111950-09-11 | 160 | Prime MinisterPrime Minister | South AfricaSouth Africa |
8 | Ibn SaudIbn Saud | 1875-01-151875-01-15 | 1953-11-091953-11-09 | 156 | KingKing | Saudi ArabiaSaudi Arabia |
9 | Plaek PhibunsongkhramPlaek Phibunsongkhram | 1897-07-141897-07-14 | 1965-06-111965-06-11 | 132 | Prime MinisterPrime Minister | ThailandThailand |
10 | John CurtinJohn Curtin | 1885-01-081885-01-08 | 1945-07-051945-07-05 | 120 | Prime MinisterPrime Minister | AustraliaAustralia |
11 | Haile SelassieHaile Selassie | 1892-07-231892-07-23 | 1975-08-271975-08-27 | 166 | EmperorEmperor | EthiopiaEthiopia |
Handling Date and Time Types
Our information about World War 2 leaders presents a good opportunity to discuss date and time types in Pandas. Right now the Born and Died columns are of type object, which is essentially of type string. To carry out operations on these, we first need to convert them to Timestamp types. We accomplish this using the Pandas to_datetime() method.
= pd.to_datetime(df.Born, format = '%Y-%m-%d')
born_dt print(f"Type of born_dt: {type(born_dt)}\nType of each born_dt element: {type(born_dt[0])}")
Type of born_dt: <class 'pandas.core.series.Series'>
Type of each born_dt element: <class 'pandas._libs.tslibs.timestamps.Timestamp'>
The to_datetime() method returns a series of time stamps. The dtype is datetime64 which allows us to apply arithmetic operations on these time stamps.
born_dt
0 1882-01-30
1 1878-12-06
2 1889-04-20
3 1901-04-29
4 1890-11-22
5 1874-11-30
6 1897-04-24
7 1870-05-24
8 1875-01-15
9 1897-07-14
10 1885-01-08
11 1892-07-23
Name: Born, dtype: datetime64[ns]
Up till now we haven’t showed how to update data frames. Let’s now create a new recalculate the age of a leader and append these as a new column to our data frame. To preserve our original data frame, lets make a distinct copy before we proceed.
= df.copy()
df_copy 'Born_DT'] = born_dt
df_copy[= pd.to_datetime(df.Died, format = '%Y-%m-%d')
died_dt 'Died_DT'] = died_dt
df_copy[ df_copy
Name | Born | Died | Age | Title | Country | Born_DT | Died_DT | |
---|---|---|---|---|---|---|---|---|
0 | Franklin Roosevelt | 1882-01-30 | 1945-04-12 | 63 | President | United States | 1882-01-30 | 1945-04-12 |
1 | Joseph Stalin | 1878-12-06 | 1953-03-05 | 74 | Great Leader | Soviet Union | 1878-12-06 | 1953-03-05 |
2 | Adolph Hitler | 1889-04-20 | 1945-04-30 | 56 | Fuhrer | Germany | 1889-04-20 | 1945-04-30 |
3 | Michinomiya Hirohito | 1901-04-29 | 1989-01-07 | 87 | Emperor | Japan | 1901-04-29 | 1989-01-07 |
4 | Charles de Gaulle | 1890-11-22 | 1970-11-09 | 79 | President | France | 1890-11-22 | 1970-11-09 |
5 | Winston Churchill | 1874-11-30 | 1965-01-24 | 90 | Prime Minister | United Kingdom | 1874-11-30 | 1965-01-24 |
6 | Manuel Camacho | 1897-04-24 | 1955-10-13 | 58 | President | Mexico | 1897-04-24 | 1955-10-13 |
7 | Jan Smuts | 1870-05-24 | 1950-09-11 | 80 | Prime Minister | South Africa | 1870-05-24 | 1950-09-11 |
8 | Ibn Saud | 1875-01-15 | 1953-11-09 | 78 | King | Saudi Arabia | 1875-01-15 | 1953-11-09 |
9 | Plaek Phibunsongkhram | 1897-07-14 | 1965-06-11 | 66 | Prime Minister | Thailand | 1897-07-14 | 1965-06-11 |
10 | John Curtin | 1885-01-08 | 1945-07-05 | 60 | Prime Minister | Australia | 1885-01-08 | 1945-07-05 |
11 | Haile Selassie | 1892-07-23 | 1975-08-27 | 83 | Emperor | Ethiopia | 1892-07-23 | 1975-08-27 |
Just as with adding new entries in Python dictionaries, we’ve now extended our data frame with the two new columns with Timestamp types. From here we can calculate the leader’s age and apply it to a third new column, first as days then finally recalculated as years.
= died_dt - born_dt
age_dt print(age_dt)
0 23082 days
1 27117 days
2 20463 days
3 32030 days
4 29206 days
5 32927 days
6 21355 days
7 29329 days
8 28787 days
9 24803 days
10 22092 days
11 30349 days
dtype: timedelta64[ns]
"Age_DT"] = age_dt // pd.Timedelta('365 days')
df_copy[ df_copy
Name | Born | Died | Age | Title | Country | Born_DT | Died_DT | Age_DT | |
---|---|---|---|---|---|---|---|---|---|
0 | Franklin Roosevelt | 1882-01-30 | 1945-04-12 | 63 | President | United States | 1882-01-30 | 1945-04-12 | 63 |
1 | Joseph Stalin | 1878-12-06 | 1953-03-05 | 74 | Great Leader | Soviet Union | 1878-12-06 | 1953-03-05 | 74 |
2 | Adolph Hitler | 1889-04-20 | 1945-04-30 | 56 | Fuhrer | Germany | 1889-04-20 | 1945-04-30 | 56 |
3 | Michinomiya Hirohito | 1901-04-29 | 1989-01-07 | 87 | Emperor | Japan | 1901-04-29 | 1989-01-07 | 87 |
4 | Charles de Gaulle | 1890-11-22 | 1970-11-09 | 79 | President | France | 1890-11-22 | 1970-11-09 | 80 |
5 | Winston Churchill | 1874-11-30 | 1965-01-24 | 90 | Prime Minister | United Kingdom | 1874-11-30 | 1965-01-24 | 90 |
6 | Manuel Camacho | 1897-04-24 | 1955-10-13 | 58 | President | Mexico | 1897-04-24 | 1955-10-13 | 58 |
7 | Jan Smuts | 1870-05-24 | 1950-09-11 | 80 | Prime Minister | South Africa | 1870-05-24 | 1950-09-11 | 80 |
8 | Ibn Saud | 1875-01-15 | 1953-11-09 | 78 | King | Saudi Arabia | 1875-01-15 | 1953-11-09 | 78 |
9 | Plaek Phibunsongkhram | 1897-07-14 | 1965-06-11 | 66 | Prime Minister | Thailand | 1897-07-14 | 1965-06-11 | 67 |
10 | John Curtin | 1885-01-08 | 1945-07-05 | 60 | Prime Minister | Australia | 1885-01-08 | 1945-07-05 | 60 |
11 | Haile Selassie | 1892-07-23 | 1975-08-27 | 83 | Emperor | Ethiopia | 1892-07-23 | 1975-08-27 | 83 |
Quick Note About Numpy
Finally we can create Pandas data frames from numPy array.
import numpy as np
= np.random.random((4,5))
np_data = pd.DataFrame(np_data, index = ['Sample_01', 'Sample_02', 'Sample_03', 'Sample_04'], columns = np.arange(1,6))
df df
1 | 2 | 3 | 4 | 5 | |
---|---|---|---|---|---|
Sample_01 | 0.748195 | 0.611704 | 0.304087 | 0.119463 | 0.598989 |
Sample_02 | 0.995958 | 0.949418 | 0.965212 | 0.977375 | 0.772594 |
Sample_03 | 0.375183 | 0.268562 | 0.189580 | 0.596075 | 0.397792 |
Sample_04 | 0.825419 | 0.020657 | 0.463567 | 0.731586 | 0.669615 |