This is a 5-part series on learning the basic foundations of using Pandas library for data science. In this post, I will touch on the basics of using Pandas. I assume you have already installed Python and Pandas in your computer. If not, I would suggest downloading/installing Anaconda or simply typing this into your shell/terminal:
conda install -c anaconda python
Once Anaconda is installed, simply run the following to install Pandas:
conda install pandas
Before using Pandas,you need to import the library:
import pandas as pd
1. Data Structures
In Pandas library, there are two main data types: Series and DataFrames.
Series: is a one-dimensional object that contains a sequence of values and data labels(index). The basic syntax for creating is:
You can create Pandas series as the following:
In this case, notice that we have only provided the data. The index is optional to use and you can leave it. Providing index is handy when you want to provide your own index, be it numbers starting from zero or any other alphanumeric sequences.
DataFrame: has two-dimensional data consisting of rows and columns. It is tabular data, like excel. Columsn can be of different types as we will see (Strings, numeric, datetime, etc.)
The syntax for creating DataFrame is similar to creating Series:
To create a DataFrame, you have many options. You might read it from files (Excel, CSV, etc.), read it from Database or from the web. Let us start first with creating DataFrame from scratch to understand the basics.
However, most of the time, you might read a text file. So let us how to read different types of data in Pandas. First, we can read CSV Files in pandas by using the pd.read_csv().
In order to successfully load the data, you need to know the directory of the data. In this case we have used this dataset of world cities and thier population.Download the data, create a directory and give it a name (in the example above, the directory is Data, and the name of the csv file is worldcitiespop.csv.
As you can see, we have saved our data as csv_df and to display your data you can simply type the variable name (csv_df), but in this case we have only showed the first 7 rows by using, csv_df.head() method.
Although CSV files are most likely the data format that you will use most often, there are at times that you might need to read data from other sources. Pandas has many options to read different data formats. You can read Excel files, JSON files, directly from HTML on the web as well as many other types. They are almost like the method we have seen above (reading CSV files). But, let us take another example of reading data from a database. I use here PostgreSQL, but other databases would work the same.
Before reading the database tables into Pandas DataFrame, you need to set up the connection. Here we use psycog2 and create connection by calling psycog2.connect() with the database name, user and password. Then we create cursor and execute it using SQL query by selecting all attributes from a table called cities.
You can now create Pandas DataFrame by fetching the cursor created above. We also add optional parameter columns to name our DataFrame columns.
Lastly, reading data directly from websites is not always straightforward and you need to do some web scraping, but Pandas makes it easy to read HTML directly into DataFrames. Here we will read data directly from Wikipedia to a DataFrame by using pd.read_html()
2. Basic Data Exploration and Descriptive statistics
One fundamental routine in data science is an early sanity checking of the data at hand. Here are some basic functions that will allow you to have a quick look at your data and find out discrepancies.
I normally look first the number of rows and columns in my data like this:
csv_df.shape returns a tuple containing (rows, columns). In this example, we have 3173958 rows and 7 columns. In case you want to access only one of them, rows are index 0 and columns are at index 1.
Second, you might use csv_df.info() command to see the types of columns and some other information about the DataFrame.
Here, csv_df.info() returns information about the columns and thier types as well as range of the index. In this particular data, 3 of the 7 columns are Float type while the other 4 are objects.
If you want to have a quick look at some descriptive statistics of your data, look no further than the .describe() method.
Remember that we had only 3 float columns in our dataset. Well, you have some descriptive statistics about them including their count, mean, maximum, minimum, etc.
Well, you can see the maximum population of a city is 31 million, but which city is it? The next part will deal with data manipulation, transformation and cleaning. To whet your appetite, here is the code to get the city with the maximum population.
Stay tuned for the next part. You will learn data cleaning, one of the most crucial tasks in data science.
The Jupyter notebook is availabe at this link.