Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Complex variable cant reference complex variable #1593

Open
dinjazelena opened this issue Jul 13, 2024 · 7 comments
Open

Complex variable cant reference complex variable #1593

dinjazelena opened this issue Jul 13, 2024 · 7 comments
Assignees
Labels
DABs DABs related issues Enhancement New feature or request

Comments

@dinjazelena
Copy link

Hey, so i have a job with a lot of tasks and its same for every target, only difference between targets is the parameters of first task.
image
Since i want to put tasks definitions as a complex variable i have to find a way to pass parameters of task as variable as well.
image

Will referencing complex variable within complex variable be possible?
Do u have idea how to tackle this problem?

Is it possible to do smth like this?


    resources:
      jobs:
        inference:
          name: "${bundle.name}-${var.segment}-inference"
          job_clusters:
            - job_cluster_key: inference
              new_cluster: ${var.inference_cluster}
          tasks: 
               - task_key: sensor
                   ....
                ${var.inference_tasks}
@dinjazelena dinjazelena added the DABs DABs related issues label Jul 13, 2024
@ribugent
Copy link

Similar issue here, before the complex variables we wrote a lot of boilerplate code to declare all job clusters.

We wanted to reduce as much as possible the boilerplate and we tried this before discovering this limitation:

variables:
  cluster_xxl:
    description: Spark big cluster
    type: complex
    default:
      spark_version: ${var.spark_version}
      spark_conf: ${var.spark_conf}
      num_workers: 1
      aws_attributes: ${var.aws_attributes}
      node_type_id: m6g.2xlarge
      spark_env_vars: ${var.spark_env_vars}
      enable_local_disk_encryption: true

  cluster_xl:
    description: Spark medium cluster
    type: complex
    default:
      spark_version: ${var.spark_version}
      spark_conf: ${var.spark_conf}
      num_workers: 1
      aws_attributes: ${var.aws_attributes}
      node_type_id: m6g.xlarge
      spark_env_vars: ${var.spark_env_vars}
      enable_local_disk_encryption: true

# and we have more ....

We fixed it by duplicating the values of referenced complex variables, but it would be nice if you could remove this limitation.

Thanks

@pietern
Copy link
Contributor

pietern commented Jul 15, 2024

@dinjazelena

i have a job with a lot of tasks and its same for every target

If the job is the same for every target, can't it be defined without variables? You could still use a complex variable for the parameters of the first task and specify overrides for this variable to customize it per target.

resources:
  jobs:
    job_with_parameters:
      name: job_with_parameters

      tasks:
        - task_key: task_a
          spark_python_task:
            python_file: ../src/task_a.py
            parameters: ${var.first_task_parameters}

        - task_key: task_b
          depends_on:
            - task_key: task_a
          spark_python_task:
            python_file: ../src/task_a.py
            parameters:
              - "--foo=bar"

To customize the parameters, you can define a different value per target:

targets:
  dev:
    variables:
      first_task_parameters:
        - "--mode=dev"
        - "--something=else"

  prod:
    variables:
      first_task_parameters:
        - "--mode=prod"
        - "--hello=world"

@pietern
Copy link
Contributor

pietern commented Jul 15, 2024

@ribugent Thanks for commenting on your use case, we'll take it into consideration.

I agree it would be nice and in line with expectations, but it takes a bit of an investment to make it possible, and as such we need to trade off the priority. First in line was getting complex variables out in their current form.

@dinjazelena
Copy link
Author

Hi,@pietern, thanks for help. I need to isolate each job as a target since i have different jobs that are same for each target but deployed with different owners, and since DAB tries to deploy everything from resources, i have to isolate each job as target.

@andrewnester
Copy link
Contributor

One other potential option here can be to use YAML anchors in combination with complex variables. YAML anchors can help with reusing duplication of configuration for common parts of complex variables

@andrewnester andrewnester self-assigned this Jul 15, 2024
@dgarridoa
Copy link

Same issue here,

I am trying to create two cluster that share most of their definition except the runtime. These are used in different jobs. As an example:

variables:
  catalog:
    default: hive_metastore
  spark_conf:
    default:
      spark.databricks.sql.initial.catalog.name: ${var.catalog}
  etl_cluster_config:
    type: complex
    default:
      spark_version: 14.3.x-scala2.12
      runtime_engine: PHOTON
      spark_conf: ${var.spark_conf}
  ml_cluster_config:
    type: complex
    default:
      spark_version: 14.3.x-cpu-ml-scala2.12
      spark_conf: ${var.spark_conf}

If is there another way to do these let me know.

Thanks!

@andrewnester andrewnester added the Enhancement New feature or request label Aug 1, 2024
@yb-yu
Copy link

yb-yu commented Sep 24, 2024

I'm experiencing a similar issue. I have many jobs (~ 200) with the same settings except for the number of workers, which is different for each job. I have tuned the number of workers according to the resources each job uses, so I'd like to manage it this way.

However, it is currently not possible to use complex variables. I'm looking for a way to override the setting like following:

resources:
  jobs:
    aa:
      ...
      job_clusters:
      - job_cluster_key: default
        new_cluster: 
          <<: ${var.legacy-multi-node-job}
          num_workers: 1
    bb:
      ...
      job_clusters:
      - job_cluster_key: default
        new_cluster:
          <<: ${var.legacy-multi-node-job}
          num_workers: 7

    ... # many jobs with different num of workers

While it is possible to use native YAML anchors, the jobs are spread across various YAML files, and since anchors need to be declared in each YAML file, they are difficult to maintain when configuration changes occur, so I prefer not to use them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
DABs DABs related issues Enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

6 participants