Studying a bit of data wrangling with Pandas

Going through the Python data wrangling book

Had some weird behavior happening in my sklearn pipelines yesterday which was due to the strange way pandas sometimes copies and sometimes uses references (apparently even when I explictly tell it to deep copy.) So, to clarify it a bit, decided to take a short detour into Pandas internals.

Pandas Series

A Series is a one-dimensional array-like object containing a sequence of values (of similar types to NumPy types) and an associated array of data labels, called its index.

A series can also be thought of as a fixed-length, ordered dict, as it is a mapping of index values to data values. It can be used in many contexts where you might use a dict.

Both the Series object itself and its index have a name attribute, which integrates with other key ares of pandas functionality:

DataFrame

The DataFrame has both a row and column index; it can be thought of as a dict of Series all sharing the same index.

Essentials

Split-apply-combine

In the first stage of the process, data contained in a pandas object, whether a Series, Data‐ Frame, or otherwise, is split into groups based on one or more keys that you provide. The splitting is performed on a particular axis of an object. For example, a DataFrame can be grouped on its rows (axis=0) or its columns (axis=1). Once this is done, a function is applied to each group, producing a new value. Finally, the results of all those function applications are combined into a result object. The form of the resulting object will usually depend on what’s being done to the data.

Going through the Python data wrangling book

Pandas Series

DataFrame

Essentials

Split-apply-combine

NOTE: continue in actual notebook to test the examples

Time-Series