Documentation
Models & Artifacts
The models/ directory contains pre-trained machine learning models and preprocessing artifacts for exoplanet detection. All models are trained on NASA Exoplanet Archive Kepler mission data.
Model Artifacts
1. Random Forest Classifier
- File: random_forest_classifier.pkl
- Type: Scikit-learn RandomForestClassifier
- Size: ~5-10 MB
- Description: Ensemble classifier combining multiple decision trees for robust predictions
- Performance: ~90% accuracy, AUC-ROC 0.94-0.96
2. Decision Tree Classifier
- File: decision_tree_classifier.pkl
- Type: Scikit-learn DecisionTreeClassifier
- Size: ~1-3 MB
- Description: Single decision tree for interpretable rule-based classification
- Performance: ~85% accuracy, AUC-ROC 0.88-0.91
3. CNN Model
- File: cnn_model.h5
- Type: TensorFlow/Keras Sequential neural network
- Size: ~500 KB - 2 MB
- Description: Deep learning model with dense layers and dropout regularization
- Architecture: Input(9) → Dense(64) → Dropout(0.3) → Dense(32) → Dropout(0.3) → Dense(16) → Dropout(0.3) → Output(1)
- Performance: ~89% accuracy, AUC-ROC 0.92-0.95
4. k-Nearest Neighbors
- File: knn_model.pkl
- Type: Scikit-learn KNeighborsClassifier
- Size: ~10-20 MB (stores training data)
- Description: Instance-based classifier using k=5 nearest neighbors
- Performance: ~87% accuracy, AUC-ROC 0.90-0.93
5. Standard Scaler
- File: scaler.pkl
- Type: Scikit-learn StandardScaler
- Size: ~1-5 KB
- Description: Feature normalization to zero mean and unit variance
- Critical: Must use the same scaler for training and inference
6. Simple Imputer
- File: imputer.pkl
- Type: Scikit-learn SimpleImputer
- Size: ~1-5 KB
- Description: Fill missing values using mean strategy
- Critical: Must use the same imputer for training and inference
Model Architectures & Hyperparameters
Random Forest
- n_estimators: 100
- max_depth: None
- threshold: 0.5
CNN
- hidden_layers: [64,32,16]
- dropout: 0.3
- epochs: 50
- threshold: 0.6
kNN
- n_neighbors: 5
- metric: euclidean
- threshold: 0.6
Decision Tree
- max_depth: 10
- criterion: gini
- threshold: 0.6
Performance Comparison
| Model | Accuracy | AUC-ROC | Training Time | Inference Speed | Best For |
|---|---|---|---|---|---|
| Random Forest | ~90% | 0.94-0.96 | Medium (1-2 min) | Fast | Production, general use |
| CNN | ~89% | 0.92-0.95 | Slow (5-15 min CPU) | Fast | Complex patterns, research |
| kNN | ~87% | 0.90-0.93 | Fast (instant) | Slow | Quick prototyping |
| Decision Tree | ~85% | 0.88-0.91 | Fast (10-20 sec) | Fast | Education, interpretability |
Prediction (auto-load)
Models are loaded automatically on first prediction. The detector checks if the model is loaded and calls the corresponding load method if needed.
from exobengal import DetectExoplanet
detector = DetectExoplanet()
sample = [365.0, 1.0, 288.0, 1.0, 4.44, 5778, 0.1, 5.0, 100.0]
print(detector.random_forest(sample))Explicit Pre-loading
from exobengal import DetectExoplanet
detector = DetectExoplanet()
detector.load_rf_model() # Pre-load for faster predictions
# Now predictions are faster
result = detector.random_forest([365.0, 1.0, 288.0, 1.0, 4.44, 5778, 0.1, 5.0, 100.0])
print(result)Retraining
Each train_* method overwrites its model file and updates scaler.pkl and imputer.pkl. Always backup existing models before retraining.
Important Notes
- Each
train_*overwrites its model and writesscaler.pklandimputer.pkl - Always backup existing models before retraining:
cp -r models/ models_backup/ - Training automatically handles data loading, preprocessing, splitting (80/20), and evaluation
- Outputs include classification report, confusion matrix, and AUC-ROC score
Retraining Example
from exobengal import DetectExoplanet
detector = DetectExoplanet()
# Retrain Random Forest with custom hyperparameters
detector.train_random_forest(
data_path="data/cumulative.csv",
n_estimators=200,
max_depth=20
)Loading Custom Models
You can specify custom model paths when initializing the detector:
from exobengal import DetectExoplanet
detector = DetectExoplanet(
rf_model_path="/path/to/my_rf_model.pkl",
scaler_path="/path/to/my_scaler.pkl",
imputer_path="/path/to/my_imputer.pkl"
)Troubleshooting
Common Issues
- Model file not found: Train models first or download from repository
- TensorFlow warnings: Safe to ignore or install tensorflow-cpu
- Inconsistent predictions: Ensure same scaler/imputer used for training and inference