You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
First and foremost, thank you for writing this paper; it was very intriguing and informative. I have a question that arose during my reading.
What are the conceptual benefits when the supervisor model (GPT-4V) is included in the LVLM pool? Wouldn't this approach inherently bias the outcomes towards the decisions of GPT-4V? If so, how does the ensemble benefit in this scenario?
The text was updated successfully, but these errors were encountered:
Thank you for engaging with our paper. We appreciate your thoughtful question and the opportunity to clarify the inclusion of the GPT-4V model in our study.
GPT-4V is integrated into our LVLM pool due to its status as a representative commercial LVLM that is readily accessible. As highlighted in the preliminary study on GPT-4V (refer to link), it stands out as one of the most powerful LVLMs currently available. Importantly, its performance serves as a benchmark, forming the foundation for its role as the annotator in our ensemble.
Concerning the potential bias towards GPT-4V outcomes, particularly in annotated ratings, we acknowledge the possibility of unreliability and bias associated with GPT-4V annotations. To address this, we conducted a correlation analysis (refer to Paragraph 3 in Sec 2.4) comparing human annotators to GPT-4V. Impressively, this analysis revealed an average agreement rate of 83.1%, demonstrating a substantial alignment between human and GPT-4V annotations.
Moreover, in experiments involving DPO, we implemented a GPT-4V always as the best strategy, where GPT-4V responses were consistently chosen as the 'best' in DPO pairs. Notably, this simple heuristic outperformed the original backbone model significantly. This outcome suggests that biasing decisions towards GPT-4V does not guarantee a one-size-fits-all solution for performance improvement, emphasizing the nuanced nature of model ensemble dynamics.
We hope this provides clarity on the conceptual benefits of incorporating GPT-4V into our LVLM pool and how potential biases are addressed and validated in our study. If you have any further questions or require additional information, please feel free to ask.
First and foremost, thank you for writing this paper; it was very intriguing and informative. I have a question that arose during my reading.
What are the conceptual benefits when the supervisor model (GPT-4V) is included in the LVLM pool? Wouldn't this approach inherently bias the outcomes towards the decisions of GPT-4V? If so, how does the ensemble benefit in this scenario?
The text was updated successfully, but these errors were encountered: