--- title: "RF100 Dataset Catalog" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{RF100 Dataset Catalog} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ## Overview The RoboFlow 100 (RF100) benchmark consists of 34 diverse object detection datasets organized into 6 collections. This vignette provides a comprehensive catalog to help you find the right dataset for your task. The RF100 datasets cover a wide range of domains including: - **Biology**: Microscopy, cells, bacteria, parasites (9 datasets) - **Medical**: X-rays, MRI, pathology (8 datasets) - **Infrared**: Thermal imaging, FLIR cameras (4 datasets) - **Damage**: Defect detection, infrastructure inspection (3 datasets) - **Underwater**: Marine life, coral, infrastructure (4 datasets) - **Document**: OCR, document parsing, diagrams (6 datasets) ## Quick Search The easiest way to find datasets is using the search functions: ```{r eval=FALSE} library(torchvision) # Search for specific topics search_rf100("cell") # Find cell-related datasets search_rf100("solar") # Find solar panel datasets search_rf100("x-ray") # Find X-ray datasets # List all datasets in a collection search_rf100(collection = "biology") search_rf100(collection = "medical") # View complete catalog catalog <- get_rf100_catalog() View(catalog) ``` ## Example: Finding a Photovoltaic Dataset One of the motivations for this catalog was answering questions like: *"Is there a photovoltaic dataset in torchvision?"* ```{r eval=FALSE} # Search for solar/photovoltaic datasets search_rf100("solar") search_rf100("photovoltaic") # Result shows: # - solar_panel in infrared collection # - solar_panel in damage collection ``` ## Complete Catalog Here's the complete catalog of all RF100 datasets: ```{r eval=FALSE} library(torchvision) library(knitr) catalog <- get_rf100_catalog() # Display key columns kable(catalog[, c("collection", "dataset", "description", "total_size_mb", "estimated_images")]) ``` ## Collections ### Biology Collection (9 datasets) Microscopy and biological imaging datasets for research and diagnostics: ```{r eval=FALSE} search_rf100(collection = "biology") ``` **Available datasets:** - `stomata_cell`: Plant stomata cells for biology research - `blood_cell`: Blood cell detection (RBC, WBC, platelets) - `parasite`: Parasite detection in microscopy images - `cell`: General cell detection in microscopy - `bacteria`: Bacteria detection in microscopy images - `cotton_disease`: Cotton plant disease detection - `mitosis`: Mitosis phase detection in cell images - `phage`: Bacteriophage detection in microscopy - `liver_disease`: Liver disease pathology detection ### Medical Collection (8 datasets) Medical imaging datasets for clinical and research applications: ```{r eval=FALSE} search_rf100(collection = "medical") ``` **Available datasets:** - `radio_signal`: Radio signal detection in medical imaging - `rheumatology`: Rheumatology X-ray abnormality detection - `knee`: ACL and knee X-ray analysis - `abdomen_mri`: Abdomen MRI organ detection - `brain_axial_mri`: Brain axial MRI structure detection - `gynecology_mri`: Gynecology MRI structure detection - `brain_tumor`: Brain tumor detection in MRI scans - `fracture`: Bone fracture detection in X-rays ### Infrared Collection (4 datasets) Thermal and infrared imaging datasets: ```{r eval=FALSE} search_rf100(collection = "infrared") ``` **Available datasets:** - `thermal_dog_and_people`: Thermal imaging of dogs and people - `solar_panel`: Solar panel detection in infrared imagery - `thermal_cheetah`: Thermal imaging of cheetahs - `ir_object`: FLIR camera object detection ### Damage Collection (3 datasets) Infrastructure damage and defect detection: ```{r eval=FALSE} search_rf100(collection = "damage") ``` **Available datasets:** - `liquid_crystals`: 4-fold defect detection in LCD displays - `solar_panel`: Solar panel defect and damage detection - `asbestos`: Asbestos detection for safety inspection ### Underwater Collection (4 datasets) Marine and underwater imaging datasets: ```{r eval=FALSE} search_rf100(collection = "underwater") ``` **Available datasets:** - `pipes`: Underwater pipe detection for infrastructure - `aquarium`: Aquarium fish and species detection - `objects`: Underwater object detection - `coral`: Coral reef detection and monitoring ### Document Collection (6 datasets) Document analysis and OCR datasets: ```{r eval=FALSE} search_rf100(collection = "document") ``` **Available datasets:** - `tweeter_post`: Twitter post element detection - `tweeter_profile`: Twitter profile element detection - `document_part`: Document structure and part detection - `activity_diagram`: Activity diagram element detection - `signature`: Signature detection in documents - `paper_part`: Academic paper structure detection ## Usage Example Once you've found a dataset, loading it is straightforward: ```{r eval=FALSE} library(torchvision) # Search for blood cell dataset search_rf100("blood") # Load the dataset ds <- rf100_biology_collection( dataset = "blood_cell", split = "train", download = TRUE ) # Inspect a sample item <- ds[1] print(item$y$labels) # Object classes print(item$y$boxes) # Bounding boxes # Visualize with bounding boxes boxed <- draw_bounding_boxes(item) tensor_image_browse(boxed) ``` ## Dataset Statistics ```{r eval=FALSE} catalog <- get_rf100_catalog() # Total size of all datasets sum(catalog$total_size_mb) / 1024 # In GB # Datasets by size catalog[order(-catalog$total_size_mb), c("dataset", "collection", "total_size_mb")] # Smallest and largest datasets catalog[which.min(catalog$total_size_mb), ] catalog[which.max(catalog$total_size_mb), ] # Average size by collection aggregate(total_size_mb ~ collection, data = catalog, FUN = mean) ``` ## Filtering and Exploration The catalog is a regular data frame, so you can use standard R operations: ```{r eval=FALSE} # Find small datasets (< 20 MB total) subset(catalog, total_size_mb < 20) # Find large datasets (> 200 MB total) subset(catalog, total_size_mb > 200) # Find datasets with specific keywords subset(catalog, grepl("tumor|cancer|disease", description, ignore.case = TRUE)) # Datasets with all three splits subset(catalog, has_train & has_test & has_valid) ``` ## Additional Resources - **RoboFlow Universe**: Browse datasets at https://universe.roboflow.com/browse/ - **Collection Functions**: See `?rf100_biology_collection`, `?rf100_medical_collection`, etc. - **Visualization**: See `?draw_bounding_boxes` for visualizing detections ## Citation If you use RF100 datasets in your research, please cite: ``` @article{roboflow100, title={Roboflow 100: A Rich, Multi-Domain Object Detection Benchmark}, author={Roboflow}, journal={arXiv preprint}, year={2022} } ```