official/projects/waste_identification_ml/circularnet-docs/content/retrain-models/prepare-data.md
Begin by capturing images with your camera and performing the necessary preprocessing steps. Then, annotate the captured images to identify the materials present in each one. These annotations allow the model to learn which materials it must recognize during training. You must save annotations in COCO JSON format.
The GitHub repository provides the required preprocessing scripts to prepare the model for retraining. These scripts convert your annotated images into TFRecords, the required input format for TensorFlow models.
The preprocessing scripts of the repository let you perform the following actions:
Before proceeding with the retraining pipeline, run the preprocessing scripts with your data on your local workstation, remote server, or database. Once your TFRecords are ready, upload them to a Cloud Storage bucket. The bucket can have two locations, one for the training dataset and another for the validation dataset.
Configure your model training by customizing values in
the CircularNET_Vertex_AI_ReTraining_v1.ipynb script,
including your project ID, bucket URI, and region. Provide the following
information in the corresponding script variables:
input_train_data_path: path to the TFRecords of your training dataset
in the Cloud Storage bucket.input_validation_data_path: path to the TFRecords of your validation
dataset in the Cloud Storage bucket.init_checkpoint_path: path to the initial checkpoints with the model's
weights. You can use the open-source initial checkpoints in this variable
from
the configuration file.config_file_path: path to the configuration file containing parameters
for fine-tuning the training. You can use the open-source
configuration file
in this variable.service_account: name of the Vertex AI service account with
permissions to perform training jobs.The script's placeholders, such as PROJECT_ID, REGION, and
STAGING_BUCKET, must be replaced with your Google Cloud project, region, and
Cloud Storage bucket, respectively.
Note: You can also customize other script variables, such as num_classes,
which refers to the number of categories for materials or other classes in your
annotated images.