55
CHAPTER
SIXTEEN
WORKING WITH MISSING DATA
In this section,we will discuss missing (alsoreferred toas NA)values in pandas.
Note: The choice of using NaN internally todenote missing data was largely for simplicity and performance reasons.
It differs from the MaskedArray approach of, for example, scikits.timeseries. We are hopeful that NumPy
will soon be able to provide a native NA type solution (similar to R) performant enough to be used in pandas.
See thecookbook forsome advanced strategies
16.1 Missing data basics
16.1.1 When / why does data become missing?
Some might quibble over our usage of missing. By “missing” we simply mean null or “not present for whatever
reason”. Many data sets simply arrive with missing data, either because it exists and was not collected or it never
existed. For example, in a collection of financial time series, some of the time series might start on different dates.
Thus,values priorto the start date would generally be marked as missing.
In pandas,one of the most commonways thatmissing data is introducedinto a data set is by reindexing. Forexample
In [1]: : df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f', 'h'],
...:
columns=['one', 'two', 'three'])
...:
In [2]: : df['four'] = 'bar'
In [3]: : df['five'] = df['one'] > 0
In [4]: : df
Out[4]:
one
two
three four
five
a
0.469112 -0.282863 -1.509059
bar
True
c -1.135632
1.212112 -0.173215
bar
False
e
0.119209 -1.044236 -0.861849
bar
True
f -2.104569 -0.494929
1.071804
bar
False
h
0.721555 -0.706771 -1.039575
bar
True
In [5]: : df2 = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
In [6]: : df2
Out[6]:
one
two
three four
five
565