One of the most common functions/methods for pandas to populate a DataFrame is read data from a comma-seprated values (CSV) file.
Pandas provides a native funtcion read_csv to perform this task. Thought it supports rich features, the most common/simple way is pass it the file name and it will read the file into DataFrame:
sep: str, default ‘,’ useful when data file use other seperater like space,’|’, ‘\t’, etc.
skiprows: list-like, int or callable, optional useful when the file has some explaining rows at the top
skipfooter: int, default 0 useful when file has some extra info at the end of the file
nrows: int, optional useful when you just want read partical rows of a large file
Although we have many parameters to deal with date parsing, personally I prefer to read the date as is in first step, following some specific function to parse them seperately. My faviourate is to employ lambda function to do the trick.
We can also specify a URL as the file name as below.
The function read_excel() provides the capability to read excel files into pandas DataFrame. Similiar with the function read_csv, the most basic use-case is passing the file name of the Excel file, and the sheet_name, will read the file into DataFrame:
Function read_html() is not as common as read_csv/read_excel, but sometimes it’s ready easy the job when you want to fetch table-like data from a website.
This function searches for ‘<table>’ elements and only for “<tr>” and “<th>” rows and “<td>” elements within each “<tr>” or “<th>” element in the table. “<td>” stands for “table data”.
Please note that the function will always return a list of DataFrame or it will fail, it will NOT return an empty list.
1 2 3 4 5 6
try: dfs = pd.read_html(url) except Exception as e: print('Error:{}'.format(e)) pass
1 2 3 4 5 6
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M' try: dfs = pd.read_html(url) print('Found {} tables in URL:{}'.format(len(dfs), url)) except Exception as e: print('Error:{}'.format(e))
Found 3 tables in URL:https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M
Now we can safely check what’s inside these DataFrames, and find out which one is we want.
1 2 3 4
for idx, df in enumerate(dfs): print('DataFrame[{}]:{}'.format(idx, df.shape))
Lets wrap read_html to a function, and check the following scenario: (1) URL not found (2) No table in specific URL
1 2 3 4 5 6 7 8 9
defdemo_read_html(url): try: dfs = pd.read_html(url) print('Found {} tables in URL:{}'.format(len(dfs), url)) return dfs except Exception as e: print('Error:{}'.format(e)) returnNone
(1) URL not found
1 2
url = 'https://en.wikipedia.org/wiki/xxxxx' ret = demo_read_html(url)
Error:HTTP Error 404: Not Found
(2) No table in speific URL
1 2
url = 'https://maps.google.com' ret = demo_read_html(url)
Error:No tables found
Read Data from DB
Database is the most import data source for many applications. Most companies persistence their business data in varies Databases. In this notebook, we will demostrate how to use pandas function read_sql to fetch data from PostgresSQL database.
The usage of function read_sql is very simpliar with these ones described above:
1 2 3 4 5 6 7 8 9 10 11
df = pd.read_sql(' select * from oridata ', conn) df.head()
```
Assume that we already set up the PostgresSQL Database with data populated in table **oridata**. Also the PostgresSQL adapter **Psycopg** installed as well.
```python import psycopg2 as pg
For data security, the best practise is never hard code the user name, password in the code. Instead of, we employed a simple trick here, to save the credencial in a local file, and make sure this file is not visible to anyone else. More advanced security solution is out of the scope of this demo, will discuss it seperately.
1 2 3 4 5 6
import pickle
with open('db.info','rb') as input: dbConnInfo = pickle.load(input) dbConnInfo.database
'zlydb'
Now we can connect to the Database ‘zlydb‘ and read table oridata from the Database.
df = pd.read_sql(' select * from oridata ', conn) df.head()
stockid
dt
jbcj
0
SH600000
2018-01-02
8954.56
1
SH600000
2018-01-03
10182.75
2
SH600000
2018-01-04
7726.19
3
SH600000
2018-01-05
8126.44
4
SH600000
2018-01-08
9226.89
Appendix
Below please find the python code to prepare the data info credentical file:
1 2 3 4 5 6 7 8 9 10 11 12 13
classdbinfo: database = "your DB Name" user="your DB user name" password="your password " with open('db.info', 'wb') as output: pickle.dump(dbinfo, output) ```
You may also add the file into git ignore file to avoid it being pushed into the public repo: