Use RAPIDS on a single GPU
This notebook describes a machine learning training workflow using the famous NYC Taxi Dataset. That dataset contains information on taxi trips in New York City.
In this exercise, we attempt to answer this classification question:
based on characteristics that can be known at the beginning of a trip, will this trip result in a high tip?
RAPIDS is a collection of libraries which enable you to take advantage of NVIDIA GPUs to accelerate machine learning workflows. This exercise uses the following RAPIDS packages to execute code on a GPU, rather than a CPU:
cudf: data frame manipulation, similar topandascuml: machine learning training and evaluation, similar toscikit-learn
Load data
The code below loads the data into a cudf data frame. This is similar to a pandas dataframe, but it lives in GPU memory and most operations on it are done on the GPU.
import cudf
taxi = cudf.read_csv(
"https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-01.csv",
parse_dates=["tpep_pickup_datetime", "tpep_dropoff_datetime"],
)
Many dataframe operations that you would execute on a pandas dataframe also work for a cudf dataframe:
len(taxi)
taxi.head()
Train model
Now that the data have been prepped, it’s time to build a model!
For this task, we’ll use the RandomForestClassifier from cuml. If you’ve never used a random forest or need a refresher, consult “Forests of randomized trees” in the scikit-learn documentation. We cast to 32-bit types for compatibility with older versions of cuml.
X = taxi[["PULocationID", "DOLocationID", "passenger_count"]].astype("float32").fillna(-1)
y = (taxi["tip_amount"] > 1).astype("int32")
from cuml.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(n_estimators=100)
_ = rfc.fit(X, y)
Calculate metrics
We’ll use another month of taxi data for the test set and calculate the AUC score
taxi_test = cudf.read_csv(
"https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-02.csv",
parse_dates=["tpep_pickup_datetime", "tpep_dropoff_datetime"],
)
X_test = taxi_test[["PULocationID", "DOLocationID", "passenger_count"]].astype("float32").fillna(-1)
y_test = (taxi_test["tip_amount"] > 1).astype("int32")
from cuml.metrics import roc_auc_score
preds = rfc.predict_proba(X_test)[1]
roc_auc_score(y_test, preds)
Need help, or have more questions? Contact us at:
- support@saturncloud.io
- On Intercom, using the icon at the bottom right corner of the screen
We'll be happy to help you and answer your questions!