Pandas Basic Analysis 04: File I/O

Python
Pandas
Jupyter
Data Analytics
Part four of data analysis with Python Pandas: file read and writes.
Author

Dennis Chua

Published

June 5, 2025

Open In Colab

Pandas 04: File Input and Output

Content Outline

  • Introduction
  • Read data from a CSV file
  • Store data to a binary file
  • Store data to a CSV text file

Introduction

Being able to load and store data, as well as display graphs and plots are crucial tasks in data analysis. Fortunately Pandas can handle a variety of file formats, from CSV, tab-delimited and comma-delimited text files, to HTML, JSON and HDF5. In this tutorial we’ll introduce some of the basics for handling CSV files.

Read Data From a CSV File

To read a CSV data file into a DataFrame, call the read_csv() function with the path to the CSV file, along with the appropriate keyword arguments * delimiter - This parameter specifies the character separating the data fields. The comma character (,) is the default. Other common delimiters include tabs (, semicolons (;), spaces (), or even custom characters. * header - Determines how column names are handled when reading the file data. By default (header = ‘infer’) Pandas assumes the first row in the file lists the column labels. If the CSV file doesn’t have labels in the first row (header = None), Pandas will assign the default numerical labels, starting with zero onwards. With (header = 0) Pandas will treat the first row as a list of column labels. * index_col - Set one of the columns of the CSV file to turn it into the index for the data frame (the label for the rows). * skiprows - Use this to skip rows in the CSV file. We can set it to a single integer or to a list of integers. * names - Used together with header. To ignore the first row and supply a list of column labels, pass (header = 0, names = [‘col_01’, ‘col_02’, etc.]) to Pandas.

Let’s run some examples using the CSV file with information about World War 2 leaders (“ww2_leaders.csv”).

Calling read_csv() with the default parameters instructs Pandas to read the CSV and treat the first row as the labels for the columns.

import pandas as pd

df = pd.read_csv("sample_data/ww2_leaders.csv")
print(type(df))
df
<class 'pandas.core.frame.DataFrame'>
Name Born Died Age Title Country
0 Franklin Roosevelt 1882-01-30 1945-04-12 63 President United States
1 Joseph Stalin 1878-12-06 1953-03-05 74 Great Leader Soviet Union
2 Adolph Hitler 1889-04-20 1945-04-30 56 Fuhrer Germany
3 Michinomiya Hirohito 1901-04-29 1989-01-07 87 Emperor Japan
4 Charles de Gaulle 1890-11-22 1970-11-09 79 President France
5 Winston Churchill 1874-11-30 1965-01-24 90 Prime Minister United Kingdom
6 Manuel Camacho 1897-04-24 1955-10-13 58 President Mexico
7 Jan Smuts 1870-05-24 1950-09-11 80 Prime Minister South Africa
8 Ibn Saud 1875-01-15 1953-11-09 78 King Saudi Arabia
9 Plaek Phibunsongkhram 1897-07-14 1965-06-11 66 Prime Minister Thailand
10 John Curtin 1885-01-08 1945-07-05 60 Prime Minister Australia
11 Haile Selassie 1892-07-23 1975-08-27 83 Emperor Ethiopia

Alternately we can ignore the headers supplied in the CSV file. Pandas will instead supply zero-based headers for the data frame columns.

df = pd.read_csv("sample_data/ww2_leaders.csv", header = None)
df.loc[0:5]
0 1 2 3 4 5
0 Name Born Died Age Title Country
1 Franklin Roosevelt 1882-01-30 1945-04-12 63 President United States
2 Joseph Stalin 1878-12-06 1953-03-05 74 Great Leader Soviet Union
3 Adolph Hitler 1889-04-20 1945-04-30 56 Fuhrer Germany
4 Michinomiya Hirohito 1901-04-29 1989-01-07 87 Emperor Japan
5 Charles de Gaulle 1890-11-22 1970-11-09 79 President France

We can also skip the row of columns labels in the CSV file altogether and substitute another list of labels.

df = pd.read_csv("sample_data/ww2_leaders.csv", header = 0, names = ['col_01', 'col_02', 'col_03', 'col_04', 'col_05', 'col_06'])
df.loc[0:5]
col_01 col_02 col_03 col_04 col_05 col_06
0 Franklin Roosevelt 1882-01-30 1945-04-12 63 President United States
1 Joseph Stalin 1878-12-06 1953-03-05 74 Great Leader Soviet Union
2 Adolph Hitler 1889-04-20 1945-04-30 56 Fuhrer Germany
3 Michinomiya Hirohito 1901-04-29 1989-01-07 87 Emperor Japan
4 Charles de Gaulle 1890-11-22 1970-11-09 79 President France
5 Winston Churchill 1874-11-30 1965-01-24 90 Prime Minister United Kingdom

Now let’s use the last column, ‘Country’, as the list for row indices. Note how we use the country names as the row labels in the loc[ ] operator.

df = pd.read_csv("sample_data/ww2_leaders.csv", index_col = 5)
df.loc['United States':'United Kingdom']
Name Born Died Age Title
Country
United States Franklin Roosevelt 1882-01-30 1945-04-12 63 President
Soviet Union Joseph Stalin 1878-12-06 1953-03-05 74 Great Leader
Germany Adolph Hitler 1889-04-20 1945-04-30 56 Fuhrer
Japan Michinomiya Hirohito 1901-04-29 1989-01-07 87 Emperor
France Charles de Gaulle 1890-11-22 1970-11-09 79 President
United Kingdom Winston Churchill 1874-11-30 1965-01-24 90 Prime Minister

So far the World War 2 leaders file we’re using is a comma-delimited data file. To show how we read a tab-delimited file, we’ll use this publicly available sample of office workers that we load from “office_workers.tsv”.

df = pd.read_csv("sample_data/office_workers.tsv", delimiter = '\t')
print(type(df))
df.loc[0:5]
<class 'pandas.core.frame.DataFrame'>
Name Position Office Age Start date Salary
0 Airi Satou Accountant Tokyo 33 2008-11-28 $162,700
1 Angelica Ramos Chief Executive Officer (CEO) London 47 2009-10-09 $1,200,000
2 Ashton Cox Junior Technical Author San Francisco 66 2009-01-12 $86,000
3 Bradley Greer Software Engineer London 41 2012-10-13 $132,000
4 Brenden Wagner Software Engineer San Francisco 28 2011-06-07 $206,850
5 Brielle Williamson Integration Specialist New York 61 2012-12-02 $372,000

Store Data to a Binary File

Pickle is a Python module for storage and retrieval of file data in binary format. The Pickle binary format is compact, therefore fast for I/O operations, but is not universally portable as Python is required.

Let’s derive a subset of our workers data frame. Let’s then save it to a pickle file.

df_pickle = df[['Name', 'Age']]
print(type(df_pickle))
pd.to_pickle(df_pickle, "sample_data/df_workers.pickle")
<class 'pandas.core.frame.DataFrame'>

Now let’s re-read the derived data frame from the pickle file. Note that unlike read_csv() Panda’s read_pickle() doesn’t have as much flexibility to ignore or alter column headers or to assign row indices.

df_unpickle = pd.read_pickle("sample_data/df_workers.pickle")
print(type(df_unpickle))
df_unpickle[0:5]
<class 'pandas.core.frame.DataFrame'>
Name Age
0 Airi Satou 33
1 Angelica Ramos 47
2 Ashton Cox 66
3 Bradley Greer 41
4 Brenden Wagner 28

Likewise we can take a set of series and perform binary I/O to and from a pickle file.

ser_pickle = df.loc[10]
print(type(ser_pickle))
pd.to_pickle(ser_pickle, "sample_data/ser_workers.pickle")
<class 'pandas.core.series.Series'>
ser_unpickle = pd.read_pickle("sample_data/ser_workers.pickle")
print(type(ser_unpickle))
ser_unpickle
<class 'pandas.core.series.Series'>
Name            Charde Marshall
Position      Regional Director
Office            San Francisco
Age                          36
Start date           2008-10-16
Salary                 $470,600
Name: 10, dtype: object

Store Data to a CSV Text File

Just as we can use read_csv() to load comma-delimited data files, we can also write out using that format. The Pandas command is simple: to_csv().

df_text = df[['Name', 'Age']]
df_text.to_csv("sample_data/df_workers.csv")

If we were to substitute the comma delimiter with another character, we simply supply a sep parameter.

df_text.to_csv("sample_data/df_workers.tsv", sep = '\t')

Finally Pandas lets us write out our data to Excel format.

df_text.to_excel("sample_data/df_workers.xlsx", sheet_name = "wokers", index = False)

The sheet_name and index parameters are optional. As the name implies, sheet_name stores our data to a new Excel sheet. Without index, Pandas will store our set of row indices as a column in the file. By setting this to False, we omit the row headers.