0%

Common Data Input Methods in Pandas

This notebook will demo the most common data input methods in Pandas.

Read CSV file

One of the most common functions/methods for pandas to populate a DataFrame is read data from a comma-seprated values (CSV) file.

Pandas provides a native funtcion read_csv to perform this task. Thought it supports rich features, the most common/simple way is pass it the file name and it will read the file into DataFrame:

1
2
df = pd.read_csv(file_name)
df.head()
1
import pandas as pd
1
2
3
file_name = 'iris.csv'
df = pd.read_csv(file_name)
df.head()

sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa

If we do not have columns name in the first row in the file, we can specify them in parameter names:

1
2
3
file_name = 'Running.csv'
df = pd.read_csv(file_name, names = ['DateTime','HeartRate','Pace'])
df.head()

DateTime HeartRate Pace
0 2018-02-22T02:24:14Z 100 2.6
1 2018-02-22T02:24:16Z 104 2.7
2 2018-02-22T02:24:18Z 104 2.6
3 2018-02-22T02:24:20Z 107 2.6
4 2018-02-22T02:24:22Z 108 2.5

Some other common/useful parameters:

  • sep: str, default ‘,’
    useful when data file use other seperater like space,’|’, ‘\t’, etc.

  • skiprows: list-like, int or callable, optional
    useful when the file has some explaining rows at the top

  • skipfooter: int, default 0
    useful when file has some extra info at the end of the file

  • nrows: int, optional
    useful when you just want read partical rows of a large file

Although we have many parameters to deal with date parsing, personally I prefer to read the date as is in first step, following some specific function to parse them seperately. My faviourate is to employ lambda function to do the trick.

We can also specify a URL as the file name as below.

1
2
url = 'http://cocl.us/Geospatial_data'
geo_info = pd.read_csv(url)
1
geo_info.head()

Postal Code Latitude Longitude
0 M1B 43.806686 -79.194353
1 M1C 43.784535 -79.160497
2 M1E 43.763573 -79.188711
3 M1G 43.770992 -79.216917
4 M1H 43.773136 -79.239476

Read Excel File

The function read_excel() provides the capability to read excel files into pandas DataFrame. Similiar with the function read_csv, the most basic use-case is passing the file name of the Excel file, and the sheet_name, will read the file into DataFrame:

1
2
df = pd.read_excel(file_name, 'Sheet1')
df.head()
1
2
df = pd.read_excel('SYS_DD.xlsx','Sheet1')
df.head()

sys_dd_id dd_category_cd dd_value dd_name
0 101 GPRS_DATA_TYPE R1 数据上传
1 102 GPRS_DATA_TYPE R2 保留
2 103 GPRS_DATA_TYPE R3 校准上传
3 104 GPRS_DATA_TYPE R4 版本上传
4 105 GPRS_DATA_TYPE R5 传感器限值上传

Read HTML page

Function read_html() is not as common as read_csv/read_excel, but sometimes it’s ready easy the job when you want to fetch table-like data from a website.

This function searches for ‘<table>’ elements and only for “<tr>” and “<th>” rows and “<td>” elements within each “<tr>” or “<th>” element in the table. “<td>” stands for “table data”.

Please note that the function will always return a list of DataFrame or it will fail, it will NOT return an empty list.

1
2
3
4
5
6
    
try:
dfs = pd.read_html(url)
except Exception as e:
print('Error:{}'.format(e))
pass
1
2
3
4
5
6
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
try:
dfs = pd.read_html(url)
print('Found {} tables in URL:{}'.format(len(dfs), url))
except Exception as e:
print('Error:{}'.format(e))
Found 3 tables in URL:https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M

Now we can safely check what’s inside these DataFrames, and find out which one is we want.

1
2
3
4
for idx, df in enumerate(dfs):
print('DataFrame[{}]:{}'.format(idx, df.shape))

dfs[0].head()
DataFrame[0]:(180, 3)
DataFrame[1]:(4, 18)
DataFrame[2]:(2, 18)

Postal Code Borough Neighborhood
0 M1A Not assigned NaN
1 M2A Not assigned NaN
2 M3A North York Parkwoods
3 M4A North York Victoria Village
4 M5A Downtown Toronto Regent Park, Harbourfront

Lets wrap read_html to a function, and check the following scenario:
(1) URL not found
(2) No table in specific URL

1
2
3
4
5
6
7
8
9
def demo_read_html(url):
try:
dfs = pd.read_html(url)
print('Found {} tables in URL:{}'.format(len(dfs), url))
return dfs

except Exception as e:
print('Error:{}'.format(e))
return None

(1) URL not found

1
2
url = 'https://en.wikipedia.org/wiki/xxxxx'
ret = demo_read_html(url)
Error:HTTP Error 404: Not Found

(2) No table in speific URL

1
2
url = 'https://maps.google.com'
ret = demo_read_html(url)
Error:No tables found

Read Data from DB

Database is the most import data source for many applications. Most companies persistence their business data in varies Databases. In this notebook, we will demostrate how to use pandas function read_sql to fetch data from PostgresSQL database.

The usage of function read_sql is very simpliar with these ones described above:

1
2
3
4
5
6
7
8
9
10
11
df = pd.read_sql(' select * from oridata ', conn)
df.head()

```

Assume that we already set up the PostgresSQL Database with data populated in table **oridata**. Also the PostgresSQL adapter **Psycopg** installed as well.



```python
import psycopg2 as pg

For data security, the best practise is never hard code the user name, password in the code. Instead of, we employed a simple trick here, to save the credencial in a local file, and make sure this file is not visible to anyone else. More advanced security solution is out of the scope of this demo, will discuss it seperately.

1
2
3
4
5
6
import pickle 

with open('db.info','rb') as input:
dbConnInfo = pickle.load(input)

dbConnInfo.database
'zlydb'

Now we can connect to the Database ‘zlydb‘ and read table oridata from the Database.

1
conn = pg.connect( database = dbConnInfo.database, user = dbConnInfo.user, password = dbConnInfo.password )
1
2
df = pd.read_sql(' select * from oridata ', conn)
df.head()

stockid dt jbcj
0 SH600000 2018-01-02 8954.56
1 SH600000 2018-01-03 10182.75
2 SH600000 2018-01-04 7726.19
3 SH600000 2018-01-05 8126.44
4 SH600000 2018-01-08 9226.89

Appendix

Below please find the python code to prepare the data info credentical file:

1
2
3
4
5
6
7
8
9
10
11
12
13
class dbinfo:
database = "your DB Name"
user="your DB user name"
password="your password "

with open('db.info', 'wb') as output:
pickle.dump(dbinfo, output)
```

You may also add the file into git ignore file to avoid it being pushed into the public repo:

```shell
$ echo db.info >> .gitignore