# Pandas DataFrame by Example

Last updated:Pandas is a very versatile tool for data analysis in Python and you must definitely know how to do, at the bare minimum, simple operations on it.

View this notebook for live examples of techniques seen here

Here are a couple of examples to help you quickly get productive using Pandas' main data structure: **the DataFrame**

Pandas version used: 0.18.0

So here are some of the most common things you'll want to do with a `DataFrame`

:

### 1) Read a CSV file into a DataFrame

```
# needed all around
import pandas as pd
df = pd.read_csv("data.csv",sep=",")
```

This is what our sample dataset looks like:

### 1.5) Write a DataFrame into a CSV file

```
df.to_csv("data-out.csv")
```

### 2) Select rows based on the value of a column

```
# people whose "age" is greater than 30
df[df["age"] > 30]
# people who have more pets than children
df[ df["num_pets"] > df[ "num_children"] ]
# people older than 40 who own pets
df[ (df["age"] > 40) & (df["num_pets"] > 0) ]
```

### 3) Drop a column

```
# df itself is not modified; a copy is returned instead
df.drop(["age","num_children"],axis=1)
```

### 4) Apply an aggregate function to every column

For every numeric column, what is the average over *all rows*?

Note that our resultset contains 3 rows (one for each numeric column in the original dataset).

```
df[["age","num_pets","num_children"]].apply(lambda row: np.mean(row),axis=0)
# >>
# age 36.666667
# num_pets 1.500000
# num_children 1.333333
```

### 5) Apply an aggregate function to every row

The same as above, but note that `axis=1`

is used.

For *every row*, what is the sum of all numeric columns?

```
# note that our resultset will have 6 rows (one for every row in the
# original dataset)
df[["age","num_pets","num_children"]].apply(lambda row: np.sum(row),axis=1)
# >>
# 0 25
# 1 84
# 2 22
# 3 25
# 4 47
# 5 34
```

### 6) Transform a column

For numerical data, it's straightforward:

Remember:

`df[['colname']]`

returns a new DataFrame, while`df['colname']`

returns a Series

```
# returns a new dataframe where every age
# is double its old value
df[["age"]].apply(lambda value: value*2)
```

For text data (or otherwise non-numerical) you need to cast:

```
# returns a new dataframe where every name is
# the old name to uppercase
df[["name"]].apply(lambda value: value.str.upper())
```

### 6.5) Another way to transform a column is to use map()

```
# convert all names to uppercase
df['name'] = df['name'].map(lambda name: name.upper())
```

### 7) Add a new column based on a another one

It's very common to *add* new columns using derived data. You just need to assign to a new column:

`assign()`

is the immutable version of this method

```
# df now has an extra column at the end of the dataframe
# which amount to the sum of num_pets and num_children
# for that row
df["pets_and_children"] = df["num_pets"] + df["num_children"]
```

You can also use custom elementwise functions to help create the new column:

```
# df now has an extra column at the end, which is just an uppercase
# version of the name (the original column is untouched)
df["name_uppercase"] = df[["name"]].apply(lambda name: name.str.upper())
```

### 8) Shuffle the rows in a DataFrame

Method `reindex()`

can be used to reindex your data and, if you pass random indices, you'll have shuffled your data:

```
# using a random permutation of the original indices
df = df.reindex(np.random.permutation(df.index))
```

### 9) Iterate over all rows in a DataFrame

Using `for ... in`

. This is similar to iterating over Python dictionaries (think `iteritems()`

or `items()`

in Python 3):

```
for index,row in df.iterrows():
print("{0} has name: {1}".format(index,row["name"]))
# >>
# 0 has name: john
# 1 has name: mary
# 2 has name: peter
# 3 has name: jeff
# 4 has name: bill
# 5 has name: lisa
```

### 10) Randomly sample rows from a DataFrame

Sampling is commonly used in Machine Learning tasks and many others.

```
# sample 4 rows from df
random_indices = np.random.choice(df.index.values, 4, replace=False)
# ix allows you to retrieve rows by their numeric indices
sampled_df = df.ix[random_indices]
```

### 11) Sort a DataFrame according to column values

This is pretty self-explanatory:

```
# sort by "age" column, larger to smaller
df.sort_values( "age",ascending=False )
```

You can also use multiple columns to break ties:

```
# sort by column "monthly_salary" descending and in case there
# are ties, use "age" ascending to sort those
df.sort_values( ["num_pets","age"], ascending=[False,True] )
```

### 12) Plot data using **matplotlib**

You need to have the

matplotlibmodule installed for this!

You can plot data *directly* from your DataFrame using the `plot()`

method:

```
# a scatter plot comparing num_children and num_pets
df.plot(kind='scatter',x='num_children',y='num_pets')
```

A lot of other types of plot are available. See all of them here

```
# a simple line plot
df.plot(kind='line',x='name',y='age')
```

### 13) Drop rows based on some predicate

```
# remove all rows where num_pets is zero
df = df.drop(df['num_pets'] == 0)
```

### 14) Perform complex selections using lambdas

If you need something more complex that the regular `df[df['somecolumn'] == 'somevalue']`

, you can use `apply`

with a lambda function too:

```
# select rows whose name begins with the letter 'j'
df[df.apply(lambda row: row['name'].startswith('j'),axis=1)]
```

**See also**