Common Data Input Methods in Pandas

This notebook will demo the most common data input methods in Pandas.

read_csv
read_excel
read_html
read_sql

Read CSV file

One of the most common functions/methods for pandas to populate a DataFrame is read data from a comma-seprated values (CSV) file.

Pandas provides a native funtcion read_csv to perform this task. Thought it supports rich features, the most common/simple way is pass it the file name and it will read the file into DataFrame:

1 2	df = pd.read_csv(file_name) df.head()

1	import pandas as pd

1
2
3

file_name = 'iris.csv'
df = pd.read_csv(file_name)
df.head()

	sepal_length	sepal_width	petal_length	petal_width	species
0	5.1	3.5	1.4	0.2	setosa
1	4.9	3.0	1.4	0.2	setosa
2	4.7	3.2	1.3	0.2	setosa
3	4.6	3.1	1.5	0.2	setosa
4	5.0	3.6	1.4	0.2	setosa

If we do not have columns name in the first row in the file, we can specify them in parameter names:

1
2
3

file_name = 'Running.csv'
df = pd.read_csv(file_name, names = ['DateTime','HeartRate','Pace'])
df.head()

	DateTime	HeartRate	Pace
0	2018-02-22T02:24:14Z	100	2.6
1	2018-02-22T02:24:16Z	104	2.7
2	2018-02-22T02:24:18Z	104	2.6
3	2018-02-22T02:24:20Z	107	2.6
4	2018-02-22T02:24:22Z	108	2.5

Some other common/useful parameters:

sep: str, default ‘,’
useful when data file use other seperater like space,’|’, ‘\t’, etc.
skiprows: list-like, int or callable, optional
useful when the file has some explaining rows at the top
skipfooter: int, default 0
useful when file has some extra info at the end of the file
nrows: int, optional
useful when you just want read partical rows of a large file

Although we have many parameters to deal with date parsing, personally I prefer to read the date as is in first step, following some specific function to parse them seperately. My faviourate is to employ lambda function to do the trick.

We can also specify a URL as the file name as below.

1 2	url = 'http://cocl.us/Geospatial_data' geo_info = pd.read_csv(url)

1	geo_info.head()

	Postal Code	Latitude	Longitude
0	M1B	43.806686	-79.194353
1	M1C	43.784535	-79.160497
2	M1E	43.763573	-79.188711
3	M1G	43.770992	-79.216917
4	M1H	43.773136	-79.239476

Read Excel File

The function read_excel() provides the capability to read excel files into pandas DataFrame. Similiar with the function read_csv, the most basic use-case is passing the file name of the Excel file, and the sheet_name, will read the file into DataFrame:

1 2	df = pd.read_excel(file_name, 'Sheet1') df.head()

1 2	df = pd.read_excel('SYS_DD.xlsx','Sheet1') df.head()

	sys_dd_id	dd_category_cd	dd_value	dd_name
0	101	GPRS_DATA_TYPE	R1	数据上传
1	102	GPRS_DATA_TYPE	R2	保留
2	103	GPRS_DATA_TYPE	R3	校准上传
3	104	GPRS_DATA_TYPE	R4	版本上传
4	105	GPRS_DATA_TYPE	R5	传感器限值上传

Read HTML page

Function read_html() is not as common as read_csv/read_excel, but sometimes it’s ready easy the job when you want to fetch table-like data from a website.

This function searches for ‘<table>’ elements and only for “<tr>” and “<th>” rows and “<td>” elements within each “<tr>” or “<th>” element in the table. “<td>” stands for “table data”.

Please note that the function will always return a list of DataFrame or it will fail, it will NOT return an empty list.

    
try:
    dfs = pd.read_html(url)
except Exception as e:
    print('Error:{}'.format(e))
    pass

url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
try:
    dfs = pd.read_html(url)
    print('Found {} tables in URL:{}'.format(len(dfs), url))
except Exception as e:
    print('Error:{}'.format(e))

Found 3 tables in URL:https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M

Now we can safely check what’s inside these DataFrames, and find out which one is we want.

for idx, df in enumerate(dfs):
    print('DataFrame[{}]:{}'.format(idx, df.shape))

dfs[0].head()

DataFrame[0]:(180, 3)
DataFrame[1]:(4, 18)
DataFrame[2]:(2, 18)

	Postal Code	Borough	Neighborhood
0	M1A	Not assigned	NaN
1	M2A	Not assigned	NaN
2	M3A	North York	Parkwoods
3	M4A	North York	Victoria Village
4	M5A	Downtown Toronto	Regent Park, Harbourfront

Lets wrap read_html to a function, and check the following scenario:
(1) URL not found
(2) No table in specific URL

def demo_read_html(url):
    try:
        dfs = pd.read_html(url)
        print('Found {} tables in URL:{}'.format(len(dfs), url))
        return dfs 
    
    except Exception as e:
        print('Error:{}'.format(e))    
        return None

(1) URL not found

1 2	url = 'https://en.wikipedia.org/wiki/xxxxx' ret = demo_read_html(url)

Error:HTTP Error 404: Not Found

(2) No table in speific URL

1 2	url = 'https://maps.google.com' ret = demo_read_html(url)

Error:No tables found

Read Data from DB

Database is the most import data source for many applications. Most companies persistence their business data in varies Databases. In this notebook, we will demostrate how to use pandas function read_sql to fetch data from PostgresSQL database.

The usage of function read_sql is very simpliar with these ones described above:

df = pd.read_sql(' select * from oridata ', conn)
df.head()

``` 

Assume that we already set up the PostgresSQL Database with data populated in table **oridata**. Also the PostgresSQL adapter **Psycopg** installed as well.



```python
import psycopg2 as pg

For data security, the best practise is never hard code the user name, password in the code. Instead of, we employed a simple trick here, to save the credencial in a local file, and make sure this file is not visible to anyone else. More advanced security solution is out of the scope of this demo, will discuss it seperately.

import pickle 

with open('db.info','rb') as input:
    dbConnInfo = pickle.load(input)
    
dbConnInfo.database

'zlydb'

Now we can connect to the Database ‘zlydb‘ and read table oridata from the Database.

1	conn = pg.connect( database = dbConnInfo.database, user = dbConnInfo.user, password = dbConnInfo.password )

1 2	df = pd.read_sql(' select * from oridata ', conn) df.head()

	stockid	dt	jbcj
0	SH600000	2018-01-02	8954.56
1	SH600000	2018-01-03	10182.75
2	SH600000	2018-01-04	7726.19
3	SH600000	2018-01-05	8126.44
4	SH600000	2018-01-08	9226.89

Appendix

Below please find the python code to prepare the data info credentical file:

class dbinfo:
    database = "your DB Name"
    user="your DB user name"
    password="your password "
    
with open('db.info', 'wb') as output:
    pickle.dump(dbinfo, output)
```    

You may also add the file into git ignore file to avoid it being pushed into the public repo:

```shell
$ echo db.info >> .gitignore