statslink

linking statistics to all

Some Python interview questions (Part III)

In Part II we talked about decorators and how they can be used to modify the behavior of a function or class without repeating code. Today we will continue to the final interview question:

  1. What is the difference between *args and *kwargs?
  2. What is a decorator?
  3. What is iterrows() used for?

The method iterrows() comes from a popular data science library pandas. The core data structure in pandasis a DataFrame, a two dimensional (rows and columns) table. The rows of a DataFrame can be indexed with numbers or strings and columns can be of different types like integer, float or string. DataFrames are suited for real world data because it allows for heterogeneous data types (whereas NumPy arrays only hold only homogeneous data types).

Since pandas is an external library, you need to import it first as the alias pd. Suppose you have a DataFrame consisting of three people’s names and their ages. First create a Python dictionary called mydict consisting of keys ‘Name’ and ‘Age’ each with a list of corresponding values (one string and the other integer). To convert or initialize this Python dictionary to a pandas DataFrame use pd.DataFrame. DataFrame is a class and by passing in

import pandas as pd

# Create a simple DataFrame
mydict= {'Name': ['Alan', 'Bob', 'Charlie'], 'Age': [28, 30, 40]}
df = pd.DataFrame(mydict)

Each column in the DataFrame is called a Series, which is essentially an (1D) array. Each element of a series has a label and an index. The interesting this is that pandas uses NumPy to store the data. A DataFrame is consists of one or more Series all sharing the same index. There are two main ways to index a DataFrame, by the label or the numeric index. To index a column by the label, pass in the string of the label directly or pass in a list of strings to index multiple columns noting that order matters. To index the column numerically, you need to use iloc (the i stands for integer) which allows you to index by number.

#indexes the first column using the label
df['Name']

0       Alan
1        Bob
2    Charlie
Name: Name, dtype: object

#if you want multiple columns, pass in a list
df[['Age','Name']]

Age	Name
0	28	Alan
1	30	Bob
2	40	Charlie

#indexes the second column using the numeric index
#note that 1 is the second column and 0 is the first
df.iloc[:,1]

0    28
1    30
2    40
Name: Age, dtype: int64

#indexes the second row 
df.iloc[1,:]

Name    Bob
Age      30
Name: 1, dtype: object

#to see all the inferred data types use attribute dtypes
df.dtypes

Name    object
Age      int64
dtype: object

As a sidenote, DataFrames are column-first which means df['Name'] will attempt to match the column to the given value ‘Name’. No special attribute is needed to index columns. However, to index rows we need either .loc for label-indexing or .iloc for position-indexing.

(Row) indexing DataFrames is similar to indexing in NumPy arrays since pandas actually uses NumPy under the hood. The default indexing parameters are [rows,columns]. If you use .iloc, index with integers or a list of integers. Rather than typing out the whole sequence, you can slice the entire row or column with the : symbol. Indexes start with 0, so for example [:,0] means give me all rows in the first column, [0,:] all columns in the first row, or [:,:] all rows and columns. The syntax is start:stop:step but the stop is exclusive and the default step is 1, so [0:1:1] is the same as [0:1] which excludes 1 so only index [0] which means give me the first element. By default .iloc will implicitly index all columns unless specified, so df.iloc[0:2] is the same as df.iloc[0:2,:] which means give me the first and second row. Interestingly, df.loc[0:2] works too not because it is position-based, but because by default row indexes are labeled integers. You can try for yourself if you rename the labels via df = pd.DataFrame(data, index=['row1', 'row2', 'row3']), then df.loc[0:2] would stop working.

Now that we’ve gone over pandas DataFrames and how to index rows, let’s answer the question what is iterrows()? It is a function in pandas used with a loop that allows the user to iterate over the index and rows of a DataFrame. The first question that comes to mind is why can’t we use range? Although you can certainly use range to loop through a pandas DataFrame, iterrows() is optimized for DataFrames because it automatically unpacks the label index and row data as a Series object. First of all, it’s not necessary to use iterrows() in a for loop but it is what it was designed for. To access the first index and row of the DataFrame you can use the next() which is a built-in function that takes in an iterator object and returns a pair of the index and row.

#store the output of iterrows() as an object
myiter = df.iterrows()


#first index 
print(next(myiter))

(0, Name    Alan
Age       28
Name: 0, dtype: object)

#second index
print(next(myiter))

(1, Name    Bob
Age      30
Name: 1, dtype: object)

Although we can continue this procedure until the last row and index, it’s more efficient to loop through iterrows() with a for loop. Since iterrows() returns both the index and row (in that order), think of this a set of tuples (a collection of ordered elements) that you need to iteratively loop over. In this case, we have three rows of data with the person’s data, so we need to loop over all three index and row pairs with the following for loop. (True or False, iterrows() is looping over the name and age. False, iterrows() is looping over the index and row)

#loop through iterrrows
for myindex, myrow in df.iterrows():
  print(myindex,myrow)

0 Name    Alan
Age       28
Name: 0, dtype: object
1 Name    Bob
Age      30
Name: 1, dtype: object
2 Name    Charlie
Age          40
Name: 2, dtype: object

Now we have looped over all three index and row pairs in the DataFrame and printed its contents without manually invoking next(). Remember that the row is a Series object, but only stores one current row we stored in the for loop. So len(row) returns 2 not because it’s returning the index and row, but because there are two elements namely Name and Age for that particular row. Since Charlie is the last person iterated row[0] will return 'Charlie'.

For your reference, Python.org includes some pretty thorough documentation and learning modules on basic Python concepts:

This concludes the series on Python interview questions. If you are interested in reading more of these posts or want a more structured walkthrough of key Python contents, comment below!