47
pandas: powerful Python data analysis toolkit, Release 0.18.1
10.5.4 Discretization and quantiling
Continuous values can be discretized using thecut() (bins based on values) andqcut() (bins based on sample
quantiles) functions:
In [113]: arr = np.random.randn(20)
In [114]: factor = pd.cut(arr, 4)
In [115]: factor
Out[115]:
[(-0.645, 0.336], (-2.61, -1.626], (-1.626, -0.645], (-1.626, -0.645], (-1.626, -0.645], ..., (0.336, 1.316], (0.336, 1.316], (0.336, 1.316], (0.336, 1.316], (-2.61, -1.626]]
Length: 20
Categories (4, object): [(-2.61, -1.626] < (-1.626, -0.645] < (-0.645, 0.336] < (0.336, 1.316]]
In [116]: factor = pd.cut(arr, [-5, -1, 0, 1, 5])
In [117]: factor
Out[117]:
[(-1, 0], (-5, -1], (-1, 0], (-5, -1], (-1, 0], ..., (0, 1], (1, 5], (0, 1], (0, 1], (-5, -1]]
Length: 20
Categories (4, object): [(-5, -1] < (-1, 0] < (0, 1] < (1, 5]]
qcut()computessamplequantiles.Forexample,wecouldsliceupsomenormallydistributeddataintoequal-size
quartiles like so:
In [118]: arr = np.random.randn(30)
In [119]: factor = pd.qcut(arr, [0, .25, .5, .75, 1])
In [120]: factor
Out[120]:
[(-0.139, 1.00736], (1.00736, 1.976], (1.00736, 1.976], [-1.0705, -0.439], [-1.0705, -0.439], ..., (1.00736, 1.976], [-1.0705, -0.439], (-0.439, -0.139], (-0.439, -0.139], (-0.439, -0.139]]
Length: 30
Categories (4, object): [[-1.0705, -0.439] < (-0.439, -0.139] < (-0.139, 1.00736] < (1.00736, 1.976]]
In [121]: pd.value_counts(factor)
Out[121]:
(1.00736, 1.976]
8
[-1.0705, -0.439]
8
(-0.139, 1.00736]
7
(-0.439, -0.139]
7
dtype: int64
We can also pass infinite values to define the bins:
In [122]: arr = np.random.randn(20)
In [123]: factor = pd.cut(arr, [-np.inf, 0, np.inf])
In [124]: factor
Out[124]:
[(-inf, 0], (0, inf], (0, inf], (0, inf], (-inf, 0], ..., (-inf, 0], (0, inf], (-inf, 0], (-inf, 0], (0, inf]]
Length: 20
Categories (2, object): [(-inf, 0] < (0, inf]]
10.5. Descriptive statistics
407