Use RAPIDS on a single GPU
This notebook describes a machine learning training workflow using the famous NYC Taxi Dataset. That dataset contains information on taxi trips in New York City.
In this exercise, we attempt to answer this classification question:
based on characteristics that can be known at the beginning of a trip, will this trip result in a high tip?
RAPIDS is a collection of libraries which enable you to take advantage of NVIDIA GPUs to accelerate machine learning workflows. This exercise uses the following RAPIDS packages to execute code on a GPU, rather than a CPU:
cudf
: data frame manipulation, similar topandas
cuml
: machine learning training and evaluation, similar toscikit-learn
Load data
The code below loads the data into a cudf
data frame. This is similar to a pandas
dataframe, but it lives in GPU memory and most operations on it are done on the GPU.
import cudf
taxi = cudf.read_csv(
"https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-01.csv",
parse_dates=["tpep_pickup_datetime", "tpep_dropoff_datetime"],
)
Many dataframe operations that you would execute on a pandas
dataframe also work for a cudf
dataframe:
len(taxi)
taxi.head()
Train model
Now that the data have been prepped, it’s time to build a model!
For this task, we’ll use the RandomForestClassifier
from cuml
. If you’ve never used a random forest or need a refresher, consult “Forests of randomized trees” in the scikit-learn
documentation. We cast to 32-bit types for compatibility with older versions of cuml
.
X = taxi[["PULocationID", "DOLocationID", "passenger_count"]].astype("float32").fillna(-1)
y = (taxi["tip_amount"] > 1).astype("int32")
from cuml.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(n_estimators=100)
_ = rfc.fit(X, y)
Calculate metrics
We’ll use another month of taxi data for the test set and calculate the AUC score
taxi_test = cudf.read_csv(
"https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-02.csv",
parse_dates=["tpep_pickup_datetime", "tpep_dropoff_datetime"],
)
X_test = taxi_test[["PULocationID", "DOLocationID", "passenger_count"]].astype("float32").fillna(-1)
y_test = (taxi_test["tip_amount"] > 1).astype("int32")
from cuml.metrics import roc_auc_score
preds = rfc.predict_proba(X_test)[1]
roc_auc_score(y_test, preds)
Need help, or have more questions? Contact us at:
- support@saturncloud.io
- On Intercom, using the icon at the bottom right corner of the screen
We'll be happy to help you and answer your questions!