Convert raw dataset to D3M dataset

Currently d3m package needs Python 3.6 only.

pip install d3m
python create_d3m_dataset.py <train_data.csv> <test_data.csv> <label> <metric> -t classification <-t ...>

Example

Some examples of valid commands are -

python create_d3m_dataset.py train_data.csv test_data.csv Label accuracy -t classification
python create_d3m_dataset.py train_data.csv test_data.csv Value meanSquaredError -t regression

-t option should be used to specify task types(s), data types(s). metrics. This script will create a directory structure “raw” for your dataset in D3M format. This dataset should be used as input to ./scripts/start_container.sh

This is the structure created for a generated D3M dataset:

raw$ tree
.
├── TEST
│   ├── dataset_TEST
│   │   ├── datasetDoc.json
│   │   ├── metadata.json
│   │   └── tables
│   │       └── learningData.csv
│   └── problem_TEST
│       └── problemDoc.json
└── TRAIN
    ├── dataset_TRAIN
    │   ├── datasetDoc.json
    │   ├── metadata.json
    │   └── tables
    │       └── learningData.csv
    └── problem_TRAIN
        └── problemDoc.json

8 directories, 8 files

Example of creating D3M dataset for image regression

python create_d3m_dataset.py train.csv test.csv WRISTBREADTH meanSquaredError -t regression -t image
Namespace(dataFileName='train.csv', metric='meanSquaredError', target='WRISTBREADTH', tasks=['regression', 'image'], testDataFileName='test.csv')
Going to create TRAIN files!
Going to create TEST files!
Please enter directory name for TRAIN media files: train_images
Please enter directory name for TEST media files: test_images
Please enter column name for media files: image_file

Note: Some task/data type(s) may not be entirely automated (Eg., object detection, graph problems). TRAIN, TEST hierarchies will be made available. However, datasetDoc.json might need to be customized for linking resources/tables for the specific task. For this purpose, example datasets are provided for reference purposes.

Valid task types(s)

linkPrediction, graphMatching, forecasting, classification, semiSupervised, clustering, collaborativeFiltering, regression, objectDetection, vertexNomination, communityDetection, vertexClassification

Valid data type(s)

Valid data type(s) to specify are- audio, image, video, text, timeSeries

Valid metrics

classification/linkPrediction/graphMatching/vertexNomination/vertexClassification: accuracy, f1Macro, f1Micro, rocAuc, rocAucMacro, rocAucMicro regression/forecasting/collaborativeFiltering: rSquared, meanSquaredError, meanSquaredError, meanAbsoluteError communityDetection/clustering: normalizedMutualInformation

Sample D3M dataset(s) for task type(s), data types(s):

classification: 185_baseball_MIN_METADATA
regression: 196_autoMpg_MIN_METADATA
forecasting: LL1_736_stock_market_MIN_METADATA
audio: 31_urbansound_MIN_METADATA
video: LL1_VID_UCF11_MIN_METADATA
text: LL1_TXT_CLS_airline_opinion_MIN_METADATA
timeseries: 66_chlorineConcentration_MIN_METADATA
image: 22_handgeometry_MIN_METADATA
collaborativeFiltering: 60_jester_MIN_METADATA
communityDetection: 6_70_com_amazon_MIN_METADATA
graphMatching: 49_facebook_MIN_METADATA
linkPrediction: 59_umls_MIN_METADATA
vertexClassification: LL1_VTXC_1343_cora_MIN_METADATA