import pandas as pd
import matplotlib.pyplot as plt
Pandas Basic Analysis 02: Series
02 Pandas Data Structure: Series
Content Outline
- Introduction
- Create a series from a Pandas data frame
- Create a series from a Python collection
- Taking sections of a series
- Statistical operations on a series
- Vector operation on a series
Introduction
In Pandas the series is one of the core data structures for computation. A series is a one-dimensional array with labeled index. Like a Python array, a series is an ordered data type: it’s elements can be indexed with the [ ] notation. Element types can be heterogeneous; the index must be a hashable type.While each element of a series is mutable, the length of the series itself can never be updated.
For this notebook, we’re going to demonstrate Pandas series using “ww2_leaders.csv” file that we load before hand to Google Colab.
Create a Series From a Data Frame
= pd.read_csv("sample_data/ww2_leaders.csv")
df df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12 entries, 0 to 11
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Name 12 non-null object
1 Born 12 non-null object
2 Died 12 non-null object
3 Age 12 non-null int64
4 Title 12 non-null object
5 Country 12 non-null object
dtypes: int64(1), object(5)
memory usage: 708.0+ bytes
df
Name | Born | Died | Age | Title | Country | |
---|---|---|---|---|---|---|
0 | Franklin Roosevelt | 1882-01-30 | 1945-04-12 | 63 | President | United States |
1 | Joseph Stalin | 1878-12-06 | 1953-03-05 | 74 | Great Leader | Soviet Union |
2 | Adolph Hitler | 1889-04-20 | 1945-04-30 | 56 | Fuhrer | Germany |
3 | Michinomiya Hirohito | 1901-04-29 | 1989-01-07 | 87 | Emperor | Japan |
4 | Charles de Gaulle | 1890-11-22 | 1970-11-09 | 79 | President | France |
5 | Winston Churchill | 1874-11-30 | 1965-01-24 | 90 | Prime Minister | United Kingdom |
6 | Manuel Camacho | 1897-04-24 | 1955-10-13 | 58 | President | Mexico |
7 | Jan Smuts | 1870-05-24 | 1950-09-11 | 80 | Prime Minister | South Africa |
8 | Ibn Saud | 1875-01-15 | 1953-11-09 | 78 | King | Saudi Arabia |
9 | Plaek Phibunsongkhram | 1897-07-14 | 1965-06-11 | 66 | Prime Minister | Thailand |
10 | John Curtin | 1885-01-08 | 1945-07-05 | 60 | Prime Minister | Australia |
11 | Haile Selassie | 1892-07-23 | 1975-08-27 | 83 | Emperor | Ethiopia |
Recall that a Pandas data frame is a two-dimensional collection made up of rows and columns. Conceptually a data frame row is equivalent to a Pandas series. Using the loc[] method, we can index a row and provision a new series with it.
= 4
deGaule = pd.Series(df.loc[deGaule])
s print(f"{s}\n\nType of s: {type(s)}")
Name Charles de Gaulle
Born 1890-11-22
Died 1970-11-09
Age 79
Title President
Country France
Name: 4, dtype: object
Type of s: <class 'pandas.core.series.Series'>
A series is made up of a collection of labels (column indices) and a collection of elements (rows). We use the keys() method and the values attribute to retrieve each list accordingly. Just like arrays in Python, we can index into specific elements in the keys and values of a series.
print(f"{s.keys()}\n\n{s.keys()[0]}\n{s.keys()[3]}")
Index(['Name', 'Born', 'Died', 'Age', 'Title', 'Country'], dtype='object')
Name
Age
print(s.values)
['Charles de Gaulle' '1890-11-22' '1970-11-09' np.int64(79) 'President'
'France']
print(f"Name:\t\t{s.values[0]}\nCountry:\t{s.values[5]}\nAge:\t\t{s.values[3]}")
Name: Charles de Gaulle
Country: France
Age: 79
Create a Series From a Python Collection
Can we create our own series object on the fly? Yes, by using Panda’s Series() method and passing as parameters a list of values, of homogeneous or heterogenous types.
= pd.Series(range(100, 120, 5)) s
print(f"{s}\n\nType of s: {type(s)}")
0 100
1 105
2 110
3 115
dtype: int64
Type of s: <class 'pandas.core.series.Series'>
= pd.Series(['AAA', 32.4907, 100]) s
print(f"{s}\n\nType of s: {type(s)}")
0 AAA
1 32.4907
2 100
dtype: object
Type of s: <class 'pandas.core.series.Series'>
print(f"{s}\n\nType of s: {type(s)}")
0 AAA
1 32.4907
2 100
dtype: object
Type of s: <class 'pandas.core.series.Series'>
By default Pandas will assign integers as indices or labels to the series we’ve just created. If we wanted to provision a series and specify the labels, we do so by passing a second parameter to Series().
= pd.Series(['AAA', 32.4907, 100], index=['word', 'float', 'integer']) s
print(f"{s}\n\nType of s: {type(s)}")
word AAA
float 32.4907
integer 100
dtype: object
Type of s: <class 'pandas.core.series.Series'>
Alternately we can pass a Python dictionary to create a Pandas series.
= {"Alabama": "1819-12-04", "Illinois": "1818-12-03", "Nevada": "1864-10-31"}
us_states_admission = pd.Series(us_states_admission) s
print(f"{s}\n\nType of s: {type(s)}")
Alabama 1819-12-04
Illinois 1818-12-03
Nevada 1864-10-31
dtype: object
Type of s: <class 'pandas.core.series.Series'>
Taking Sections of a Series
Conceptually a Pandas series is a list. As with ordered Python collections, we can take slices of a Pandas series. For starters, we can treat a series like a Python array, indexing elements using the [ ] operator.
= pd.Series(range(1,10))
s s
0 1
1 2
2 3
3 4
4 5
5 6
6 7
7 8
8 9
dtype: int64
Again, without supplying any index, Pandas supplies default zero-based indices for our series, as if it were a Python array. Now we can access individual elements.
print(s[5])
6
We can also take slices.
2:6] s[
2 3
3 4
4 5
5 6
dtype: int64
3] s[:
0 1
1 2
2 3
dtype: int64
1:7:2] s[
1 2
3 4
5 6
dtype: int64
If our series came with indices, we can access elements with the array-like syntax shown above. Alternately, we can treat it like a Python dictionary, with the series indices analogous to dictionary keys, and access elements accordingly.
= pd.Series(['AAA', 32.4907, 100], index=['word', 'float', 'integer'])
s 'float'] s[
32.4907
'word':'float'] s[
word AAA
float 32.4907
dtype: object
Statistical Operations on Series
Let’s go back to our table of WW2 leaders.
= pd.read_csv("sample_data/ww2_leaders.csv")
df df
Name | Born | Died | Age | Title | Country | |
---|---|---|---|---|---|---|
0 | Franklin Roosevelt | 1882-01-30 | 1945-04-12 | 63 | President | United States |
1 | Joseph Stalin | 1878-12-06 | 1953-03-05 | 74 | Great Leader | Soviet Union |
2 | Adolph Hitler | 1889-04-20 | 1945-04-30 | 56 | Fuhrer | Germany |
3 | Michinomiya Hirohito | 1901-04-29 | 1989-01-07 | 87 | Emperor | Japan |
4 | Charles de Gaulle | 1890-11-22 | 1970-11-09 | 79 | President | France |
5 | Winston Churchill | 1874-11-30 | 1965-01-24 | 90 | Prime Minister | United Kingdom |
6 | Manuel Camacho | 1897-04-24 | 1955-10-13 | 58 | President | Mexico |
7 | Jan Smuts | 1870-05-24 | 1950-09-11 | 80 | Prime Minister | South Africa |
8 | Ibn Saud | 1875-01-15 | 1953-11-09 | 78 | King | Saudi Arabia |
9 | Plaek Phibunsongkhram | 1897-07-14 | 1965-06-11 | 66 | Prime Minister | Thailand |
10 | John Curtin | 1885-01-08 | 1945-07-05 | 60 | Prime Minister | Australia |
11 | Haile Selassie | 1892-07-23 | 1975-08-27 | 83 | Emperor | Ethiopia |
In our earlier example, we used the data frame loc() method to isolate a row and provision a new series from this. Pandas also allows us to create series from a data frame column.
= df['Age']
age print(f"Type of age: {type(age)}")
Type of age: <class 'pandas.core.series.Series'>
We can Panda’s describe() operation to report descriptive statistics for this series.
age.describe()
count 12.000000
mean 72.833333
std 11.784684
min 56.000000
25% 62.250000
50% 76.000000
75% 80.750000
max 90.000000
Name: Age, dtype: float64
Furthermore, if we just need the average age, we can use Panda’s mean() function.
age.mean()
np.float64(72.83333333333333)
Vector Operations on Series
Let’s say we’re interested in the age values that are less than or equal to the average age in this series. Pandas let’s use use the [ ] notation to apply a filtering logic, screening out the values that evaluate to false.
<= age.mean()] age[age
0 63
2 56
6 58
9 66
10 60
Name: Age, dtype: int64
Behind the scenes, the filter statement expands to a list of true or false values. Pandas applies this list to return only the elements corresponding to true.
print(age <= age.mean())
0 True
1 False
2 True
3 False
4 False
5 False
6 True
7 False
8 False
9 True
10 True
11 False
Name: Age, dtype: bool
We can see how this works using our own ad-hoc true-and-false list as a mask. In Pandas lingo this list is also known as a vector of boolean values.
= pd.Series(df['Name'])
names = [False, True, True, True, False, False, False, False, True, False, False, True]
potentates names[potentates]
1 Joseph Stalin
2 Adolph Hitler
3 Michinomiya Hirohito
8 Ibn Saud
11 Haile Selassie
Name: Name, dtype: object
= list(map(lambda x: not x, potentates))
democratic_leaders names[democratic_leaders]
0 Franklin Roosevelt
4 Charles de Gaulle
5 Winston Churchill
6 Manuel Camacho
7 Jan Smuts
9 Plaek Phibunsongkhram
10 John Curtin
Name: Name, dtype: object
Now it’s time to look at vector operations applied to series. We can add or multiply a series by a scalar value and Pandas will apply the operation to each element individually.
= df['Age'] age
+ 1000 age
0 1063
1 1074
2 1056
3 1087
4 1079
5 1090
6 1058
7 1080
8 1078
9 1066
10 1060
11 1083
Name: Age, dtype: int64
* 2 age
0 126
1 148
2 112
3 174
4 158
5 180
6 116
7 160
8 156
9 132
10 120
11 166
Name: Age, dtype: int64
Taking the age series itself as a parameter, we can carry out vector addition to the series. Pandas follows a one-to-one correspondence. In the example below, the outcome is the same as doubling each age, similar to the notebook cell above.
+ age age
0 126
1 148
2 112
3 174
4 158
5 180
6 116
7 160
8 156
9 132
10 120
11 166
Name: Age, dtype: int64
What happens when pass as a parameter a series which doesn’t share the same shape?
+ pd.Series([100, 100]) age
0 163.0
1 174.0
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
9 NaN
10 NaN
11 NaN
dtype: float64
As we see here, Pandas carries out the vector operation element by element, but leaves the result undefined for elements with no matching parameters