Pandas Basic Analysis 04: File I/O

Python

Pandas

Jupyter

Data Analytics

Part four of data analysis with Python Pandas: file read and writes.

Author

Dennis Chua

Published

June 5, 2025

Pandas 04: File Input and Output

Content Outline

Introduction
Read data from a CSV file
Store data to a binary file
Store data to a CSV text file

Jupyter notebook

WW2 Leaders CSV

Office Workers TSV

Introduction

Being able to load and store data, as well as display graphs and plots are crucial tasks in data analysis. Fortunately Pandas can handle a variety of file formats, from CSV, tab-delimited and comma-delimited text files, to HTML, JSON and HDF5. In this tutorial we’ll introduce some of the basics for handling CSV files.

Read Data From a CSV File

To read a CSV data file into a DataFrame, call the read_csv() function with the path to the CSV file, along with the appropriate keyword arguments * delimiter - This parameter specifies the character separating the data fields. The comma character (,) is the default. Other common delimiters include tabs (, semicolons (;), spaces (), or even custom characters. * header - Determines how column names are handled when reading the file data. By default (header = ‘infer’) Pandas assumes the first row in the file lists the column labels. If the CSV file doesn’t have labels in the first row (header = None), Pandas will assign the default numerical labels, starting with zero onwards. With (header = 0) Pandas will treat the first row as a list of column labels. * index_col - Set one of the columns of the CSV file to turn it into the index for the data frame (the label for the rows). * skiprows - Use this to skip rows in the CSV file. We can set it to a single integer or to a list of integers. * names - Used together with header. To ignore the first row and supply a list of column labels, pass (header = 0, names = [‘col_01’, ‘col_02’, etc.]) to Pandas.

Let’s run some examples using the CSV file with information about World War 2 leaders (“ww2_leaders.csv”).

Calling read_csv() with the default parameters instructs Pandas to read the CSV and treat the first row as the labels for the columns.

import pandas as pd

df = pd.read_csv("sample_data/ww2_leaders.csv")
print(type(df))
df

<class 'pandas.core.frame.DataFrame'>

	Name	Born	Died	Age	Title	Country
0	Franklin Roosevelt	1882-01-30	1945-04-12	63	President	United States
1	Joseph Stalin	1878-12-06	1953-03-05	74	Great Leader	Soviet Union
2	Adolph Hitler	1889-04-20	1945-04-30	56	Fuhrer	Germany
3	Michinomiya Hirohito	1901-04-29	1989-01-07	87	Emperor	Japan
4	Charles de Gaulle	1890-11-22	1970-11-09	79	President	France
5	Winston Churchill	1874-11-30	1965-01-24	90	Prime Minister	United Kingdom
6	Manuel Camacho	1897-04-24	1955-10-13	58	President	Mexico
7	Jan Smuts	1870-05-24	1950-09-11	80	Prime Minister	South Africa
8	Ibn Saud	1875-01-15	1953-11-09	78	King	Saudi Arabia
9	Plaek Phibunsongkhram	1897-07-14	1965-06-11	66	Prime Minister	Thailand
10	John Curtin	1885-01-08	1945-07-05	60	Prime Minister	Australia
11	Haile Selassie	1892-07-23	1975-08-27	83	Emperor	Ethiopia

Alternately we can ignore the headers supplied in the CSV file. Pandas will instead supply zero-based headers for the data frame columns.

df = pd.read_csv("sample_data/ww2_leaders.csv", header = None)
df.loc[0:5]

	0	1	2	3	4	5
0	Name	Born	Died	Age	Title	Country
1	Franklin Roosevelt	1882-01-30	1945-04-12	63	President	United States
2	Joseph Stalin	1878-12-06	1953-03-05	74	Great Leader	Soviet Union
3	Adolph Hitler	1889-04-20	1945-04-30	56	Fuhrer	Germany
4	Michinomiya Hirohito	1901-04-29	1989-01-07	87	Emperor	Japan
5	Charles de Gaulle	1890-11-22	1970-11-09	79	President	France

We can also skip the row of columns labels in the CSV file altogether and substitute another list of labels.

df = pd.read_csv("sample_data/ww2_leaders.csv", header = 0, names = ['col_01', 'col_02', 'col_03', 'col_04', 'col_05', 'col_06'])
df.loc[0:5]

	col_01	col_02	col_03	col_04	col_05	col_06
0	Franklin Roosevelt	1882-01-30	1945-04-12	63	President	United States
1	Joseph Stalin	1878-12-06	1953-03-05	74	Great Leader	Soviet Union
2	Adolph Hitler	1889-04-20	1945-04-30	56	Fuhrer	Germany
3	Michinomiya Hirohito	1901-04-29	1989-01-07	87	Emperor	Japan
4	Charles de Gaulle	1890-11-22	1970-11-09	79	President	France
5	Winston Churchill	1874-11-30	1965-01-24	90	Prime Minister	United Kingdom

Now let’s use the last column, ‘Country’, as the list for row indices. Note how we use the country names as the row labels in the loc[ ] operator.

df = pd.read_csv("sample_data/ww2_leaders.csv", index_col = 5)
df.loc['United States':'United Kingdom']

	Name	Born	Died	Age	Title
Country
United States	Franklin Roosevelt	1882-01-30	1945-04-12	63	President
Soviet Union	Joseph Stalin	1878-12-06	1953-03-05	74	Great Leader
Germany	Adolph Hitler	1889-04-20	1945-04-30	56	Fuhrer
Japan	Michinomiya Hirohito	1901-04-29	1989-01-07	87	Emperor
France	Charles de Gaulle	1890-11-22	1970-11-09	79	President
United Kingdom	Winston Churchill	1874-11-30	1965-01-24	90	Prime Minister

So far the World War 2 leaders file we’re using is a comma-delimited data file. To show how we read a tab-delimited file, we’ll use this publicly available sample of office workers that we load from “office_workers.tsv”.

df = pd.read_csv("sample_data/office_workers.tsv", delimiter = '\t')
print(type(df))
df.loc[0:5]

<class 'pandas.core.frame.DataFrame'>

	Name	Position	Office	Age	Start date	Salary
0	Airi Satou	Accountant	Tokyo	33	2008-11-28	$162,700
1	Angelica Ramos	Chief Executive Officer (CEO)	London	47	2009-10-09	$1,200,000
2	Ashton Cox	Junior Technical Author	San Francisco	66	2009-01-12	$86,000
3	Bradley Greer	Software Engineer	London	41	2012-10-13	$132,000
4	Brenden Wagner	Software Engineer	San Francisco	28	2011-06-07	$206,850
5	Brielle Williamson	Integration Specialist	New York	61	2012-12-02	$372,000

Store Data to a Binary File

Pickle is a Python module for storage and retrieval of file data in binary format. The Pickle binary format is compact, therefore fast for I/O operations, but is not universally portable as Python is required.

Let’s derive a subset of our workers data frame. Let’s then save it to a pickle file.

df_pickle = df[['Name', 'Age']]
print(type(df_pickle))
pd.to_pickle(df_pickle, "sample_data/df_workers.pickle")

<class 'pandas.core.frame.DataFrame'>

Now let’s re-read the derived data frame from the pickle file. Note that unlike read_csv() Panda’s read_pickle() doesn’t have as much flexibility to ignore or alter column headers or to assign row indices.

df_unpickle = pd.read_pickle("sample_data/df_workers.pickle")
print(type(df_unpickle))
df_unpickle[0:5]

<class 'pandas.core.frame.DataFrame'>

	Name	Age
0	Airi Satou	33
1	Angelica Ramos	47
2	Ashton Cox	66
3	Bradley Greer	41
4	Brenden Wagner	28

Likewise we can take a set of series and perform binary I/O to and from a pickle file.

ser_pickle = df.loc[10]
print(type(ser_pickle))
pd.to_pickle(ser_pickle, "sample_data/ser_workers.pickle")

<class 'pandas.core.series.Series'>

ser_unpickle = pd.read_pickle("sample_data/ser_workers.pickle")
print(type(ser_unpickle))
ser_unpickle

<class 'pandas.core.series.Series'>

Name            Charde Marshall
Position      Regional Director
Office            San Francisco
Age                          36
Start date           2008-10-16
Salary                 $470,600
Name: 10, dtype: object

Store Data to a CSV Text File

Just as we can use read_csv() to load comma-delimited data files, we can also write out using that format. The Pandas command is simple: to_csv().

df_text = df[['Name', 'Age']]

df_text.to_csv("sample_data/df_workers.csv")

If we were to substitute the comma delimiter with another character, we simply supply a sep parameter.

df_text.to_csv("sample_data/df_workers.tsv", sep = '\t')

Finally Pandas lets us write out our data to Excel format.

df_text.to_excel("sample_data/df_workers.xlsx", sheet_name = "wokers", index = False)

The sheet_name and index parameters are optional. As the name implies, sheet_name stores our data to a new Excel sheet. Without index, Pandas will store our set of row indices as a column in the file. By setting this to False, we omit the row headers.