無法登錄xgboost模型mlflow -磚- 37957

raghagra · ‎07-19-2023

我一直試圖日誌mlflow模型但似乎不工作。它隻記錄最後(也是最嚴重的運行)。

# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -13.0毫升XGBOost - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - # train_df = train_df.limit(188123)從hyperopt進口fmin, tpe,試驗,惠普進口numpy np進口mlflow mlflow進口。從pyspark.ml火花。功能導入StringIndexer, VectorAssembler pyspark。毫升從xgboost進口管道。火花從pyspark.ml進口SparkXGBRegressor。評價進口RegressionEvaluator進口numpy mlflow.models np。簽名進口infer_signature # vec_assembler = VectorAssembler (inputCols = train_df。列(1:),outputCol =“特性”)xgb = SparkXGBRegressor (num_workers = 1, label_col =“價格”,缺少= 0.0)= #管道管道(階段= [vec_assembler xgb])管道=管道(階段= [ordinal_encoder, vec_assembler xgb]) regression_evaluator = RegressionEvaluator (predictionCol =“預測”,labelCol =“價格”)def objective_function (params): #我們希望優化設置hyperparameters max_depth = params [“max_depth”] n_estimators = params (“n_estimators”)與mlflow.start_run(嵌套= True):估計量= pipeline.copy ({xgb。xgb max_depth: max_depth。n_estimators: n_estimators})模型= estimator.fit (train_df)僅僅= model.transform (test_df) rmse = regression_evaluator.evaluate(僅僅)# r2 = regression_evaluator.setMetricName (r2) .evaluate mlflow(僅僅)。rmse log_metric (rmse) # mlflow。log_metric (r2, r2)返回rmse search_space = {“max_depth”:惠普。選擇(max_depth, np。(12 15 dtype = int)不等),“n_estimators”:惠普。選擇(n_estimators, np。論壇(80,dtype = int))} mlflow.pyspark.ml.autolog (log_models = True, log_datasets = False) # mlflow.sklearn.autolog (log_models = False, log_datasets = False) # mlflow.xgboost.autolog (log_models = True) # mlflow.transformers.autolog (log_models = False) num_evals = = 1試驗試驗()best_hyperparam = fmin (fn = objective_function空間= search_space算法= tpe。建議,max_evals = num_evals試驗=試驗,rstate = np.random.default_rng(42)) #重新訓練模型訓練和驗證數據集和測試數據集評估與mlflow.start_run (): best_max_depth = best_hyperparam best_n_estimators = best_hyperparam (“max_depth”) (“n_estimators”)估計量= pipeline.copy ({xgb。xgb max_depth: best_max_depth。n_estimators: best_n_estimators}) #combined_df = train_df.union(test_df) # Combine train & validation together pipeline_model = estimator.fit(train_df) pred_df = pipeline_model.transform(test_df) #signature = infer_signature(test_df, pred_df) rmse = regression_evaluator.evaluate(pred_df) r2 = regression_evaluator.setMetricName("r2").evaluate(pred_df) # Log param and metrics for the final model mlflow.log_param("maxdepth", best_max_depth) mlflow.log_param("n_estimators", best_n_estimators) mlflow.log_metric("rmse", rmse) mlflow.log_metric("r2", r2) # mlflow.transformers.log_model(pipeline_model,"model",input_example=test_df.select(old_cols_list).limit(1).toPandas()) mlflow.spark.log_model(pipeline_model ,"model",input_example=test_df.select(old_cols_list).limit(1).toPandas()) #mlflow.xgboost.log_model(pipeline_model,"model",input_example=test_df.select(old_cols_list).limit(1).toPandas()) # mlflow.sklearn.log_model(pipeline_model,"model",input_example=test_df.select(old_cols_list).limit(1).toPandas())

庫瑪 · 一個月前

嗨@raghagra,

謝謝你發布你的問題在磚社區。

原因代碼隻是上次運行日誌是因為您使用mlflow.start_run objective_function內()函數()函數。這意味著每次你叫objective_function()函數,它將啟動一個新的運行。mlflow.spark.log_model()函數隻記錄當前運行模型,所以隻會被記錄在過去的運行模式。

為了解決這個問題,你可以移動mlflow.start_run objective_function之外的()函數()函數。這將確保為每個運行模型記錄。

請檢查它是如何工作的。

raghagra · 一個月前

@Kumaran仍然沒有工作。得到以下錯誤:
2023/07/22 mlflow 11:30:21信息。火花:推斷皮普需求通過重載登錄模型在磚構件庫,可耗時。加快,顯式地指定conda_env或pip_requirements當調用log_model ()。2023/07/22 mlflow.utils 11:31:02警告。環境:遇到一個意想不到的錯誤而推斷pip需求(模式URI: dbfs: /磚/ mlflow-tracking / 590967242928602 / e3bd64c64535425192a510bd4ee66dec /工件/ xgb_model / sparkml,味道:火花),跌回返回(“pyspark = = 3.4.0”)。日誌級別設置為調試看到完整的回溯。/磚/ python / lib / python3.10 /網站/ _distutils_hack / __init__。py: 33: UserWarning: Setuptools取代distutils。警告。警告(“Setuptools取代distutils。”)

庫瑪 · 4星期前

嗨@raghagra,

你可以嚐試以下代碼(請修改根據你的需要)記錄模型:

導入mlflow mlflow.start_run (experiment_id = " 1234 ")運行:mlflow。set_tag(“地位”,“開始”)mlflow。log_param mlflow (“git_hash”、“1234”)。log_param mlflow (“env”、“stg”)。log_param (“pipeline_id”、“es_s3_to_raw”) run_id = run.info.run_uuid mlflow。log_param mlflow (“run_id”, run_id)。set_tag (“run_url”、“URL的模型”)mlflow。log_param mlflow.log_param (“id_in_job”、“1726769”)(“上下文。用戶”、“電子郵件id”)

raghagra · 3周之前

@Kumaran運行這段代碼,但任何特定的日誌,我應該找什麼?

磚

無法登錄mlflow xgboost模型