import pandas as pd
Pandas Prep of DnD5e Monsters Data
D&D 5th Edition Monsters Preparation (Kaggle Dataset)
Content Outline
- Introduction
- Loading the Data
- Data Reorganization and Cleanup
- Split ‘Race + alignment’ column
- Remove duplicate row
- Split ‘Challenge rating (XP)’ column
- Split ‘Armor’ column
Introduction
Dungeons and Dragons (DnD) is a cooperative role-playing game with roots in the tabletop war gaming community. At fifty years old as of this writing, DnD has enjoyed phenomenal success among players world wide. Of all the many challenges player face in the game, the most deadly ones are monsters. Originally drawn from mythology and works of fiction, a myriad of fantastical creatures have emerged from the imagination of countless fans. The catalog has only grown since the first Monster Manual was published in 1977.
In this notebook we take a look at monster game stats for the 5th edition of DnD. The format of the Kaggle data set isn’t suitable for data exploration and analysis (EDA). Using Python language features such as lambda expressions and regex, we’ll pre-process the data before we analyze it any further.
Loading the Data
Kaggle has a handy dataset for D&D 5th Edition monsters. Before going to work on this, we download the CSV file, rename it as “DnD5e_monsters_orig.csv” and then upload it to Colab.
= pd.read_csv('sample_data/DnD5e_monsters_orig.csv')
df df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 324 entries, 0 to 323
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Name 324 non-null object
1 Size 324 non-null object
2 Race + alignment 324 non-null object
3 HP 324 non-null object
4 Armor 324 non-null object
5 Speed 324 non-null object
6 Challenge rating (XP) 324 non-null object
dtypes: object(7)
memory usage: 17.8+ KB
df.head()
Name | Size | Race + alignment | HP | Armor | Speed | Challenge rating (XP) | |
---|---|---|---|---|---|---|---|
0 | Aboleth | Large | aberration, Lawful Evil | 135 (18d10+36) | 17 (Natural Armor) | 10 ft., swim 40 ft. | 10 (5,900 XP) |
1 | Acolyte | Medium | humanoid (any race), Any Alignment | 9 (2d8) | 10 | 30 ft. | 1/4 (50 XP) |
2 | Adult Black Dragon | Huge | dragon, Chaotic Evil | 195 (17d12+85) | 19 (Natural Armor) | 40 ft., fly 80 ft., swim 40 ft. | 14 (11,500 XP) |
3 | Adult Blue Dragon | Huge | dragon, Lawful Evil | 225 (18d12+108) | 19 | 40 ft., burrow 30 ft., fly 80 ft. | 16 (15,000 XP) |
4 | Adult Brass Dragon | Huge | dragon, Chaotic Good | 172 (15d12+75) | 18 (Natural Armor) | 40 ft., burrow 40 ft., fly 80 ft. | 13 (10,000 XP) |
Data Reorganization and Cleanup
The data set has eight columns and 324 rows. Looking at the column headers, we’d like to split the Race + alignment information into two separate column. We note that the monster race and alignment and separated by a comma. We can instruct Pandas to divide the infomration along this delimeter. This, then, is the first alteration we perform on our data frame.
def split_racealignment_column(df_orig, inplace=True):
"""
Splits the 'Race + alignment' column into 'Race' and 'Alignment'.
Args:
df_orig: The input DataFrame.
Returns:
A new DataFrame with the 'Race + alignment' column split.
"""
= df_orig.copy()
retval_df = retval_df['Race + alignment'].str.split(r',', expand=True)
split_columns_df 'Race'] = split_columns_df[0].astype(str).str.strip()
retval_df['Alignment'] = split_columns_df[1].astype(str).str.strip()
retval_df[=['Race + alignment'], inplace=True)
retval_df.drop(columnsreturn retval_df
= split_racealignment_column(df)
split_race_alignment_df split_race_alignment_df.head()
Name | Size | HP | Armor | Speed | Challenge rating (XP) | Race | Alignment | |
---|---|---|---|---|---|---|---|---|
0 | Aboleth | Large | 135 (18d10+36) | 17 (Natural Armor) | 10 ft., swim 40 ft. | 10 (5,900 XP) | aberration | Lawful Evil |
1 | Acolyte | Medium | 9 (2d8) | 10 | 30 ft. | 1/4 (50 XP) | humanoid (any race) | Any Alignment |
2 | Adult Black Dragon | Huge | 195 (17d12+85) | 19 (Natural Armor) | 40 ft., fly 80 ft., swim 40 ft. | 14 (11,500 XP) | dragon | Chaotic Evil |
3 | Adult Blue Dragon | Huge | 225 (18d12+108) | 19 | 40 ft., burrow 30 ft., fly 80 ft. | 16 (15,000 XP) | dragon | Lawful Evil |
4 | Adult Brass Dragon | Huge | 172 (15d12+75) | 18 (Natural Armor) | 40 ft., burrow 40 ft., fly 80 ft. | 13 (10,000 XP) | dragon | Chaotic Good |
Problems immediately arise. An extraneous comma chacter is embedded in the subrace data for ‘Imp’ and the various were-monsters (‘Werebear’, ‘Wereboar’, ‘Wererat’, ‘Weretiger’, ‘Werewolf’). When it encounters this, our Pandas str.split() operation fails to correctly distinguish the race from the alignment.
Here are the offending rows isolated with the help of regular expressions.
import re
= df[df['Race + alignment'].apply(lambda x: True if re.search('.*,.*,', x) else False)]
bad_racealign_df bad_racealign_df
Name | Size | Race + alignment | HP | Armor | Speed | Challenge rating (XP) | |
---|---|---|---|---|---|---|---|
173 | Imp | Tiny | fiend (devil, shapechanger), Lawful Evil | 10 (3d4+3) | 13 | 20 ft., fly 40 ft. | 1 (200 XP) |
299 | Werebear | Medium | humanoid (human, shapechanger), Neutral Good | 135 (18d8+54) | 10 (In Humanoid Form, 11 In Bear And Hybrid Fo... | 30 ft. (40 ft., climb 30 ft. in bear or hybrid... | 5 (1,800 XP) |
300 | Wereboar | Medium | humanoid (human, shapechanger), Neutral Evil | 78 (12d8+24) | 10 (In Humanoid Form, 11 In Boar And Hybrid Fo... | 30 ft. (40 ft. in boar form) | 4 (1,100 XP) |
301 | Wererat | Medium | humanoid (human, shapechanger), Lawful Evil | 33 (6d8+6) | 12 | 30 ft. | 2 (450 XP) |
302 | Weretiger | Medium | humanoid (human, shapechanger), Neutral | 120 (16d8+48) | 12 | 30 ft. (40 ft. in tiger form) | 4 (1,100 XP) |
303 | Werewolf | Medium | humanoid (human, shapechanger), Chaotic Evil | 58 (9d8+18) | 11 (In Humanoid Form, 12 In Wolf Or Hybrid Form) | 30 ft. (40 ft. in wolf form) | 3 (700 XP) |
As you can see, re.split() will divide the column along the first comma of the two. This runs counter to our expectation that there should only be one comma delimiter in all the values of the Race + alignment column. To deal with this obstacle, let’s first get the list of indices of the offending rows.
= list(bad_racealign_df.index)
bad_racealign_idx bad_racealign_idx
[173, 299, 300, 301, 302, 303]
With this list, it’s a simple matter up applying a Python lambda that uses regex to subsitute the first comma with a space, updating only the rows in our bad_realign_idx list.
'Race + alignment'] = df.loc[bad_racealign_idx, 'Race + alignment'].apply(lambda str: re.sub(',', ' ', str, count=1))
df.loc[bad_racealign_idx, df.loc[bad_racealign_idx]
Name | Size | Race + alignment | HP | Armor | Speed | Challenge rating (XP) | |
---|---|---|---|---|---|---|---|
173 | Imp | Tiny | fiend (devil shapechanger), Lawful Evil | 10 (3d4+3) | 13 | 20 ft., fly 40 ft. | 1 (200 XP) |
299 | Werebear | Medium | humanoid (human shapechanger), Neutral Good | 135 (18d8+54) | 10 (In Humanoid Form, 11 In Bear And Hybrid Fo... | 30 ft. (40 ft., climb 30 ft. in bear or hybrid... | 5 (1,800 XP) |
300 | Wereboar | Medium | humanoid (human shapechanger), Neutral Evil | 78 (12d8+24) | 10 (In Humanoid Form, 11 In Boar And Hybrid Fo... | 30 ft. (40 ft. in boar form) | 4 (1,100 XP) |
301 | Wererat | Medium | humanoid (human shapechanger), Lawful Evil | 33 (6d8+6) | 12 | 30 ft. | 2 (450 XP) |
302 | Weretiger | Medium | humanoid (human shapechanger), Neutral | 120 (16d8+48) | 12 | 30 ft. (40 ft. in tiger form) | 4 (1,100 XP) |
303 | Werewolf | Medium | humanoid (human shapechanger), Chaotic Evil | 58 (9d8+18) | 11 (In Humanoid Form, 12 In Wolf Or Hybrid Form) | 30 ft. (40 ft. in wolf form) | 3 (700 XP) |
With the two-comma problem resolved, we can go ahead and split the Race + alignment column into two, removing the original column in the process.
= split_racealignment_column(df) split_racealignment_df
'Race').count() split_racealignment_df.groupby(
Name | Size | HP | Armor | Speed | Challenge rating (XP) | Alignment | |
---|---|---|---|---|---|---|---|
Race | |||||||
Fiend (Shapechanger) | 2 | 2 | 2 | 2 | 2 | 2 | 2 |
aberration | 5 | 5 | 5 | 5 | 5 | 5 | 5 |
beast | 87 | 87 | 87 | 87 | 87 | 87 | 87 |
celestial | 6 | 6 | 6 | 6 | 6 | 6 | 6 |
construct | 9 | 9 | 9 | 9 | 9 | 9 | 9 |
dragon | 43 | 43 | 43 | 43 | 43 | 43 | 43 |
elemental | 16 | 16 | 16 | 16 | 16 | 16 | 16 |
fey | 6 | 6 | 6 | 6 | 6 | 6 | 6 |
fiend | 4 | 4 | 4 | 4 | 4 | 4 | 4 |
fiend (demon) | 8 | 8 | 8 | 8 | 8 | 8 | 8 |
fiend (devil shapechanger) | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
fiend (devil) | 9 | 9 | 9 | 9 | 9 | 9 | 9 |
fiend (shapechanger) | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
giant | 10 | 10 | 10 | 10 | 10 | 10 | 10 |
humanoid (any race) | 21 | 21 | 21 | 21 | 21 | 21 | 21 |
humanoid (dwarf) | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
humanoid (elf) | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
humanoid (gnoll) | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
humanoid (gnome) | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
humanoid (goblinoid) | 3 | 3 | 3 | 3 | 3 | 3 | 3 |
humanoid (grimlock) | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
humanoid (human shapechanger) | 5 | 5 | 5 | 5 | 5 | 5 | 5 |
humanoid (human) | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
humanoid (kobold) | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
humanoid (lizardfolk) | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
humanoid (merfolk) | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
humanoid (orc) | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
humanoid (sahuagin) | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
monstrosity | 35 | 35 | 35 | 35 | 35 | 35 | 35 |
monstrosity (shapechanger) | 2 | 2 | 2 | 2 | 2 | 2 | 2 |
monstrosity (titan) | 2 | 2 | 2 | 2 | 2 | 2 | 2 |
ooze | 4 | 4 | 4 | 4 | 4 | 4 | 4 |
plant | 6 | 6 | 6 | 6 | 6 | 6 | 6 |
swarm of tiny beasts | 10 | 10 | 10 | 10 | 10 | 10 | 10 |
undead | 17 | 17 | 17 | 17 | 17 | 17 | 17 |
undead (shapechanger) | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
OK we come to the second update that we need to perform. Scanning that output above, we’re alerted to two entries that potentially overlap: * Fiend (Shapechanger) * fiend (shapechanger)
In fact, examining the split data frame, we isolate the three suspect rows.
'Race'].isin(['Fiend (Shapechanger)','fiend (shapechanger)'])] split_racealignment_df.loc[split_racealignment_df[
Name | Size | HP | Armor | Speed | Challenge rating (XP) | Race | Alignment | |
---|---|---|---|---|---|---|---|---|
174 | Incubus | Medium | 66 (12d8 + 12) | 15 (Natural Armor) | 30 ft., fly 60 ft. | 4 (1,100 XP) | Fiend (Shapechanger) | Neutral Evil |
268 | Succubus | Medium | 66 (12d8 + 12) | 15 (Natural Armor) | 30 ft., fly 60 ft. | 4 (1,100 XP) | Fiend (Shapechanger) | Neutral Evil |
269 | Succubus (Incubus) | Medium | 66 (12d8+12) | 15 (Natural Armor) | 30 ft., fly 60 ft. | 4 (1,100 XP) | fiend (shapechanger) | Neutral Evil |
It’s clear that index 269 duplicates the information in rows 174 and 268. (In the traditional interpretation of the game, a succcubus is female while an incubus is male. Both monsters are different.) Let’s go ahead and delete row 269.
=269, inplace=True) split_racealignment_df.drop(index
We proceed to the third alteration we want to make to our data frame. The Challenge rating (XP) column holds two distinct pieces of information. Let’s break that into Challenge and XP column, removing the original source column as we go.
def split_challengexp_column(df_orig, inplace=True):
"""
Splits the 'Challenge rating (XP)' column into 'Challenge' and 'XP'.
Args:
df_orig: The input DataFrame.
Returns:
A new DataFrame with the 'Challenge rating (XP)' column split.
"""
= df_orig.copy()
retval_df = retval_df['Challenge rating (XP)'].str.split(r' \(', expand=True)
split_columns_df 'Challenge'] = split_columns_df[0].astype(str).str.strip()
retval_df['XP'] = (split_columns_df[1].str.split(' XP\)', expand=True))[0].astype(str).str.strip()
retval_df[=['Challenge rating (XP)'], inplace=True)
retval_df.drop(columnsreturn retval_df
<>:14: SyntaxWarning: invalid escape sequence '\)'
<>:14: SyntaxWarning: invalid escape sequence '\)'
/tmp/ipykernel_10330/2349129103.py:14: SyntaxWarning: invalid escape sequence '\)'
retval_df['XP'] = (split_columns_df[1].str.split(' XP\)', expand=True))[0].astype(str).str.strip()
= split_challengexp_column(split_racealignment_df)
split_challengexp_column_df split_challengexp_column_df.head()
Name | Size | HP | Armor | Speed | Race | Alignment | Challenge | XP | |
---|---|---|---|---|---|---|---|---|---|
0 | Aboleth | Large | 135 (18d10+36) | 17 (Natural Armor) | 10 ft., swim 40 ft. | aberration | Lawful Evil | 10 | 5,900 |
1 | Acolyte | Medium | 9 (2d8) | 10 | 30 ft. | humanoid (any race) | Any Alignment | 1/4 | 50 |
2 | Adult Black Dragon | Huge | 195 (17d12+85) | 19 (Natural Armor) | 40 ft., fly 80 ft., swim 40 ft. | dragon | Chaotic Evil | 14 | 11,500 |
3 | Adult Blue Dragon | Huge | 225 (18d12+108) | 19 | 40 ft., burrow 30 ft., fly 80 ft. | dragon | Lawful Evil | 16 | 15,000 |
4 | Adult Brass Dragon | Huge | 172 (15d12+75) | 18 (Natural Armor) | 40 ft., burrow 40 ft., fly 80 ft. | dragon | Chaotic Good | 13 | 10,000 |
The last change we need to make to our data frame is to separate the Armor information into Armor_Class and related notes, called Armor_Info. In the process we create two new columns and remove the original one. Again we call on regular expressions but this time we use a more complex matching pattern: ()(\((.*)\))?
Our regex pattern has three components, or groups, of regular and special characters.
- () - Matches one or more integers representing the Armor_Class
- () - Matches zero or more white space.
- ((.*)\))? - Matches the Armor_Info, a parenthetical sequence made of a series of characters followed by a right parenthesis. For example, “(Leather Armor)”. The surrounding ( )? groups the pattern and indicates that we’re looking for at most a single occurence of this.
def split_armor_column(df_orig):
"""
Splits the 'Armor' column into 'Armor_Class' and 'Armor_Type'.
Handles cases where Armor has no type specified.
Args:
df_orig: The input DataFrame.
Returns:
A new DataFrame with the 'Armor' column split.
"""
= df_orig.copy()
retval_df
# Initialize new columns
'Armor_Class'] = None
retval_df['Armor_Info'] = None
retval_df[
for index, row in retval_df.iterrows():
= str(row['Armor']).strip()
armor_value
# Use regex to split, handling potential missing parentheses
= re.match(r'(\d+)(\s*)(\((.*)\))?', armor_value)
match if match:
'Armor_Class'] = match.group(1).strip()
retval_df.loc[index, # Check if there was a type specified in parentheses
if match.group(3):
'Armor_Info'] = match.group(3).strip().lstrip('(').rstrip(')')
retval_df.loc[index, else:
# If no type in parentheses, assume Natural Armor
'Armor_Info'] = "Natural Armor"
retval_df.loc[index, else:
# Handle cases that don't match the expected pattern
'Armor_Class'] = armor_value
retval_df.loc[index, 'Armor_Info'] = "Unknown" # Or handle as appropriate
retval_df.loc[index,
# Drop the original 'Armor' column after processing
=['Armor'], inplace=True)
retval_df.drop(columnsreturn retval_df
= split_armor_column(split_challengexp_column_df)
split_armor_column_df split_armor_column_df.head()
Name | Size | HP | Speed | Race | Alignment | Challenge | XP | Armor_Class | Armor_Info | |
---|---|---|---|---|---|---|---|---|---|---|
0 | Aboleth | Large | 135 (18d10+36) | 10 ft., swim 40 ft. | aberration | Lawful Evil | 10 | 5,900 | 17 | Natural Armor |
1 | Acolyte | Medium | 9 (2d8) | 30 ft. | humanoid (any race) | Any Alignment | 1/4 | 50 | 10 | Natural Armor |
2 | Adult Black Dragon | Huge | 195 (17d12+85) | 40 ft., fly 80 ft., swim 40 ft. | dragon | Chaotic Evil | 14 | 11,500 | 19 | Natural Armor |
3 | Adult Blue Dragon | Huge | 225 (18d12+108) | 40 ft., burrow 30 ft., fly 80 ft. | dragon | Lawful Evil | 16 | 15,000 | 19 | Natural Armor |
4 | Adult Brass Dragon | Huge | 172 (15d12+75) | 40 ft., burrow 40 ft., fly 80 ft. | dragon | Chaotic Good | 13 | 10,000 | 18 | Natural Armor |
We’ve reached a good checkpoint in our work. Before we draw things to a close, let’s store the cleaned-up data to a CSV file. As we don’t need the row headers – the index – we strip this away with the index=False option before writing the data frame to the file.
'sample_data/DnD5e_monsters_cleaned.csv', index=False) split_armor_column_df.to_csv(