Being able to load and store data, as well as display graphs and plots are crucial tasks in data analysis. Fortunately Pandas can handle a variety of file formats, from CSV, tab-delimited and comma-delimited text files, to HTML, JSON and HDF5. In this tutorial we’ll introduce some of the basics for handling CSV files.
Read Data From a CSV File
To read a CSV data file into a DataFrame, call the read_csv() function with the path to the CSV file, along with the appropriate keyword arguments * delimiter - This parameter specifies the character separating the data fields. The comma character (,) is the default. Other common delimiters include tabs (, semicolons (;), spaces (), or even custom characters. * header - Determines how column names are handled when reading the file data. By default (header = ‘infer’) Pandas assumes the first row in the file lists the column labels. If the CSV file doesn’t have labels in the first row (header = None), Pandas will assign the default numerical labels, starting with zero onwards. With (header = 0) Pandas will treat the first row as a list of column labels. * index_col - Set one of the columns of the CSV file to turn it into the index for the data frame (the label for the rows). * skiprows - Use this to skip rows in the CSV file. We can set it to a single integer or to a list of integers. * names - Used together with header. To ignore the first row and supply a list of column labels, pass (header = 0, names = [‘col_01’, ‘col_02’, etc.]) to Pandas.
Let’s run some examples using the CSV file with information about World War 2 leaders (“ww2_leaders.csv”).
Calling read_csv() with the default parameters instructs Pandas to read the CSV and treat the first row as the labels for the columns.
import pandas as pddf = pd.read_csv("sample_data/ww2_leaders.csv")print(type(df))df
<class 'pandas.core.frame.DataFrame'>
Name
Born
Died
Age
Title
Country
0
Franklin Roosevelt
1882-01-30
1945-04-12
63
President
United States
1
Joseph Stalin
1878-12-06
1953-03-05
74
Great Leader
Soviet Union
2
Adolph Hitler
1889-04-20
1945-04-30
56
Fuhrer
Germany
3
Michinomiya Hirohito
1901-04-29
1989-01-07
87
Emperor
Japan
4
Charles de Gaulle
1890-11-22
1970-11-09
79
President
France
5
Winston Churchill
1874-11-30
1965-01-24
90
Prime Minister
United Kingdom
6
Manuel Camacho
1897-04-24
1955-10-13
58
President
Mexico
7
Jan Smuts
1870-05-24
1950-09-11
80
Prime Minister
South Africa
8
Ibn Saud
1875-01-15
1953-11-09
78
King
Saudi Arabia
9
Plaek Phibunsongkhram
1897-07-14
1965-06-11
66
Prime Minister
Thailand
10
John Curtin
1885-01-08
1945-07-05
60
Prime Minister
Australia
11
Haile Selassie
1892-07-23
1975-08-27
83
Emperor
Ethiopia
Alternately we can ignore the headers supplied in the CSV file. Pandas will instead supply zero-based headers for the data frame columns.
So far the World War 2 leaders file we’re using is a comma-delimited data file. To show how we read a tab-delimited file, we’ll use this publicly available sample of office workers that we load from “office_workers.tsv”.
Pickle is a Python module for storage and retrieval of file data in binary format. The Pickle binary format is compact, therefore fast for I/O operations, but is not universally portable as Python is required.
Let’s derive a subset of our workers data frame. Let’s then save it to a pickle file.
Now let’s re-read the derived data frame from the pickle file. Note that unlike read_csv() Panda’s read_pickle() doesn’t have as much flexibility to ignore or alter column headers or to assign row indices.
Finally Pandas lets us write out our data to Excel format.
df_text.to_excel("sample_data/df_workers.xlsx", sheet_name ="wokers", index =False)
The sheet_name and index parameters are optional. As the name implies, sheet_name stores our data to a new Excel sheet. Without index, Pandas will store our set of row indices as a column in the file. By setting this to False, we omit the row headers.