Since its popularization by the Apple Macintosh in 1984, the graphical user interface has become the dominant medium for human-computer interaction. Multimodal language models can understand these interfaces, but struggle to "point" to visual controls with high precision.
The Open Interpreter Project has developed (and here freely hosts) an API which is capable of locating visual controls with single-pixel precision.
To get the coordinates of any on-screen control, provide its natural language description via the "query
" parameter, along with a representation of the computer's display via the "base64
" parameter.
import requests
query = 'blue bell icon'
base64 = 'data:image/png;base64,...'
response = requests.post(
'https://api.openinterpreter.com/v0/point/',
params = {'query': query, 'base64': base64}
)
response.json() # Outputs [(x, y)], a list of relative coordinates to each instance of the icon
This API was designed in Seattle for code interpreting language models. Last modified March 10th, 2024