TIDY DATA A foundation for wrangling in pandas INGESTING AND RESHAPING DATA Change the layout of a data set
In a tidy data set: gdf.sort_values(‘mpg’)
Order rows by values of a column (low
F M A F M A to high).
Planned for Future Release Order rows by values of a column (high
to low).
df.rename(columns = {‘y’:’year’})
Planned for Future Release
Each variable is saved Each observation is gdf = cuDF.read_csv(filename, delimiter=”,”, df.pivot(columns=’var’, values=’val’) Rename the columns of a DataFrame
in its own column. saved in its own row. names=col_names, dtype =col_types) Spread rows into columns. gdf.sort_index()
Sort the index of a DataFrame.
Tidy data complements pandas’ vectorized operations. pandas will gdf.set_index()
automatically preserve observations as you manipulate variables. No other Return a new DataFrame with a new index.

} }
format works as intuitively. gdf.drop_column(‘Length’)
Drop column from DataFrame.

cudf.concat([gdf1,gdf2]) gdf.add_column(‘name’, gdf1[‘name’])

M A Append rows of DataFrames. Append columns of DataFrames.


a b c
1 4 7 10
2 5 8 11
3 6 9 12
gdf.query(‘Length > 7’] df.sample(frac=0.5) gdf[[‘width’,’length’,’species’]]
gdf = cudf.DataFrame([
Extract rows that meet logical criteria. Randomly select fraction of rows. Select multiple columns with specific names.
(“a”, [4 ,5, 6]),
(“b”, [7, 8, 9]), df.drop_duplicates() df.sample(n=10) gdf[‘width’] or gdf.width
Remove duplicate rows (only considers Planned
Randomly for Future
select n rows.Release Select single column with specific name.
(“c”, [10, 11, 12])
]) columns). df.iloc[10:20] df.filter(regex=’regex’)
Specify values for each column. Planned for Future Release
df.head(n) Select rows by position. Select columns whose name matches regular expression regex.
Select first n rows. gdf.nlargest(n, ‘value’) REGEX (REGULAR EXPRESSIONS) EXAMPLES
gdf = cudf.DataFrame.from_records( df.tail(n) Select and order top n entries.
[[4, 7, 10], Select last n rows. gdf.nsmallest(n, ‘value’) ‘\.’ Matches strings containing a period ‘.’
[5, 8, 11], Select and order bottom n entries.
[6, 9, 12]], ‘Length$’ Planned
Matches forending
strings Futurewith
word ‘Length’
index=[1, 2, 3],
columns=[‘a’, ‘b’, ‘c’])
LOGIC IN PYTHON (AND PANDAS) ‘^Sepal’ Matches strings beginning with the word ‘Sepal’
Specify values for each row. < Less than != Not equal to ‘^x[1-5]$’ Matches strings beginning with ‘x’ and ending with 1,2,3,4,5
> Greater than df.column.isin(values) Group
membership ‘’^(?!Species$).*’ Matches strings except the string ‘Species’
METHOD CHAINING == Equals pd.isnull(obj) Is NaN gdf.loc[2:5,[‘x2’,’x4’]]
Get rows from index 2 to index 5 from ‘a’ and ‘b’ columns.
Most pandas methods return a DataFrame so another pandas method can be applied <= Less than or pd.notnull(obj) Is not NaN
to the result. This improves readability of code. df.iloc[:,[1,2,5]]
Select columns in positions 1, 2 and 5 (first column is 0).
gdf = cudf.from_pandas(df) >= Greater than or &,|,~,^,df.any(),df.all() Logical and, or, not, Planned for Future Release
.query(‘val >= 200’) df.loc[df[‘a’] > 10, [‘a’,’c’]]
equals xor, any, all Select rows meeting logical condition, and only the specific columns.
gdf[‘w’].value_counts() df.dropna()
Planned for Future Release gdf1 gdf2

+ =
Count number of rows with each unique value of variable. Drop rows with any column having NA/null data. x1 x3
x1 x2
len(gdf) gdf[‘length’].fillna(value) A T
A 1
# of rows in DataFrame. Replace all NA/null data with value.
B 2 B F
# of distinct values in a column. C 3 D T
Planned for Future Release
Basic descriptive statistics for each column (or GroupBy)
x1 x2 x3
A 1 T gdf.merge(gdf2,
how=’left’, on=’x1’)
df.assign(Area=lambda df: df.Length*df.Height)
Planned for Future Release B 2 F Join matching rows from bdf to adf.
Compute and append one or more new columns. C 3 NaN
Pygdf provides a set of summary functions that operate on different kinds of pandas gdf[‘Volume’] = gdf.Length*gdf.Height*gdf.Depth
objects (DataFrame columns, Series, GroupBy) and produce single values for each of the
Add single column. x1 x2 x3
pd.qcut(df.col, n, labels=False) A 1.0 T gdf.merge(gdf1, gdf2,
groups. When applied to a DataFrame, the result is returned as a pandas Series for each Planned for Future Release how=’right’, on=’x1’)
Bin column into n buckets. B 2.0 F
column. Examples: Join matching rows from gdf1 to gdf2.
Apply row Apply row D NaN T
sum() min() functions functions
Sum values of each object. Minimum value in each object. x1 x2 x3 gdf.merge(gdf1, gdf2,
count() max() pandas provides a large set of vector functions that operate on all columns of a A 1 T how=‘inner’, on=’x1’)
Count non-NA/null values of each Maximum value in each object. DataFrame or a single selected column (cuDF Series). These functions produce vectors Join data. Retain only rows in both sets.
of values for each of the columns, or a single Series for the individual Series. Examples:
B 2 F
object. mean()
median() Mean value of each object. max(axis=1) min(axis=1) x1 x2 x3
Median value of each object. var() Element-wise max. Element-wise min.
Planned for Future Release A 1 T gdf.merge(gdf1, gdf2,
quantile([0.25,0.75]) Variance of each object. clip(lower=-10,upper=10) abs() B 2 F how=‘outer’, on=’x1’)
Quantiles of each object. std() Trim values at input thresholds Absolute value.
C 3 NaN Join data. Retain all values, all rows.
applymap(function) Standard deviation of each object.
Apply function to each object. Define a kernal function: D NaN T
>>> def kernel(in1, in2, in3, out1, out2, extra1, extra2):
for i, (x, y, z) in enumerate(zip(in1, in2, in3)): FILTERING JOINS
out1[i] = extra2 * x - extra1 * y
GROUP DATA out2[i] = y - extra1 * z x1 x2 x
A 1 All rows for
in adf that have a match in bdf.
gdf.groupby(“col”) Call the kernel with apply_rows: Planned Future Release
Return a GroupBy object, grouped
B 2
>>> outdf = gdf.apply_rows(kernel,
by values in column named “col”. incols=[‘in1’, ‘in2’, ‘in3’], x1 x2 adf[~adf.x1.isin(bdf.x1)]
df.groupby(level=”ind”) outcols=dict(out1=np.float64,
out2=np.float64), C 3 All rows in adf that do not have a match in bdf.
Return a GroupBy
Planned object,Release
for Future grouped
by values in index level named “ind”. kwargs=dict(extra1=2.3, extra2=3.4))
gdf1 gdf2

+ =
x1 x2 x1 x2
B 2 C 3
All of the summary functions listed above can be applied to a group. Additional Return an Expanding object allowing summary functions to be applied C 3 D 4
GroupBy functions: cumulatively.
Planned for Future Release SET-LIKE OPERATIONS
Planned for Future Release
agg(function) df.rolling(n)
Size of each group. Aggregate group using function. Return a Rolling object allowing summary functions to be applied to windows
of length n.
x1 x2
gdf.merge(gdf1, gdf2, how=‘inner’)
The examples below can also be applied to groups. In this case, the function is B 2 Rows that appear in both ydf and zdf (Intersection).
applied on a per-group basis, and the returned vectors are of the length of the C 3
original DataFrame.

shift(1) shift(-1)
Copy with values shifted by 1. Copy with values lagged by 1. CuDF can convert pandas category data types into one-hot encoded or A 1 gdf.merge(gdf1, gdf2, how=’outer’)
rank(method=’dense’) cumsum() dummy variables easily. B 2 Rows that appear in either or both ydf and zdf
Ranks with no gaps. Cumulative sum. pet_owner = [1, 2, 3, 4, 5] (Union).
Planned for Future Release pet_type = [‘fish’, ‘dog’, ‘fish’, ‘bird’, ‘fish’]
C 3
rank(method=’min’) cummax()
df = pd.DataFrame({‘pet_owner’: pet_owner, ‘pet_type’: pet_type}) D 4
Ranks. Ties get min rank. Cumulative max.
df.pet_type = df.pet_type.astype(‘category’)
rank(pct=True) cummin() pd.merge(ydf, zdf, how=’outer’,
Ranks rescaled to interval [0, 1]. Cumulative min. my_gdf = cuDF.DataFrame.from_pandas(df) indicator=True)
x1 x2 Planned for Future Release
rank(method=’first’) cumprod() my_gdf[‘pet_codes’] = .query(‘_merge == “left_only”’)
Ranks. Ties go to first value. Cumulative product.
A 1 .drop(columns=[‘_merge’])
codes = my_gdf.pet_codes.unique() Rows that appear in ydf but not zdf (Setdiff).
enc_gdf = my_gdf.one_hot_encoding(‘pet_codes’, ‘pet_dummy’, codes)

