Empty results of some probes #212

alexey-yarmosh · 2022-10-05T20:35:34Z

Sometimes some of the probes of the finished measurements have result fields. Maybe data is not sent by the probe or API didn't handle it properly. We need to adress it.

Steps to reproduce (may not reproduce every time):

send measurement request with the body

{
    "target": "yarmosh.by",
    "type": "ping",
    "measurementOptions": {
        "packets": 16
    },
    "limit": 500,
    "locations": []
}

wait until measurement status is finished
check all the result fields
Expected value:
some valid result data
Actual value:

The text was updated successfully, but these errors were encountered:

MartinKolarik · 2022-10-06T13:32:00Z

Isn't this a case of the measuremnt timing out? I believe we don't have any explicit way to indicate per-measurement errors so this is what happens then.

jimaek · 2022-10-06T13:37:24Z

The question is why it timed out. Our current probes should be reliable enough to work. If you give me some probes I can try to pull the local logs.

We probably need a status field to clearly mark what happened in each result. Did the whole probe time out because of bad internet, meaning the target endpoint is not at fault, did it time out because of an exception in which case we need to fix it, did the API fail and we need to fix something, did the target endpoint fail and the user can safely mark it as down

patrykcieszkowski · 2022-10-06T13:53:00Z

Isn't this a case of the measuremnt timing out? I believe we don't have any explicit way to indicate per-measurement errors so this is what happens then.

no. A measurement timeout is handled by the server, and the server doesn't include any other value than rawOutput.

alexey-yarmosh · 2023-01-06T13:31:37Z

Such output may occur for several reasons:

stdout of the result is empty (code)
stdout of the execa error is empty (code)
error happened but it is not execa error (code)

Also we are putting the reason of failed measurement (e.g. "Private IP ranges are not allowed") in rawOutput which is used for execa stdout.

I think we should add a status field to the measurement + use rawOutput only for execa output and also add message field for the reason of the failure. E.g.:

successful measurement

"result": {
  "status": "success",
  "message": "",
  "rawOutput": "PING ...",
  "resolvedAddress": "...",
  "resolvedHostname": "...",
  "timings": [...],
  "stats": {...}
}

stdout of the execa error is empty

"result": {
  "status": "failed",
  "message": "execa error",
  "rawOutput": "",
  "resolvedAddress": null,
  "resolvedHostname": null,
  "timings": [],
  "stats": {...}
}

error happened but it is not execa error

"result": {
  "status": "failed",
  "message": "cannot get property a of undefined",
  "rawOutput": "PING ...",
  "resolvedAddress": null,
  "resolvedHostname": null,
  "timings": [],
  "stats": {...}
}

private ip

"result": {
  "status": "failed",
  "message": "Private IP ranges are not allowed",
  "rawOutput": "",
  "resolvedAddress": null,
  "resolvedHostname": null,
  "timings": [],
  "stats": {...}
}

alexey-yarmosh · 2023-01-09T07:32:32Z

@jimaek @MartinKolarik what do you think?

MartinKolarik · 2023-01-09T07:42:37Z

Not sure if the message is necessary. Also, in the third case, we shouldn't expose the error message, only log it. Probe-level status maybe makes sense, just need to think how it interacts with the overall measurement status.

alexey-yarmosh · 2023-01-09T07:54:09Z

Not sure if the message is necessary

Idea is not to mix up reason of status and raw output.

Also, in the third case, we shouldn't expose the error message, only log it

@jimaek, how we are able to get logs from the probe?

Not sure if the message is necessary. Also, in the third case, we shouldn't expose the error message, only log it. Probe-level status maybe makes sense, just need to think how it interacts with the overall measurement status.

I think overall measurement status logic may be left as is. We can just add "in progress" (or "not finished") status for progress messages. So when the overall measurement is finished we understand that all "in progress" measurements mean timeout.

jimaek · 2023-01-09T07:59:56Z

Probe logs are local and if we don't control it we have no way to access them

patrykcieszkowski · 2023-01-18T15:29:26Z

Probe logs are local and if we don't control it we have no way to access them

maybe that's a potential idea for another feature - save probe logs to file, and make it accessible through the API?

jimaek · 2023-01-18T15:32:59Z

It sounds nice, but I can imagine it would be too complicated. e.g. hardware probes are read-only, there is no way for the user to interact. We could build a system to trigger log upload on adopted probes via the dashboard but that sounds like a huge project to implement correctly and securely

alexey-yarmosh · 2023-01-27T12:01:22Z

So the new field that we are adding is status: "not-finished" | "success" | "failed" - at the very start all probes are "not-finished" and when global measurement is "finished" we are stopping all the probe updates. So we are able to see that some probes are "success", some are "failed", and some where not finished in 30 sec and still have "not-finished" status - that means "timeout";

Pushing "Private IP ranges are not allowed" to rawOutput makes sense if some globalping-based app shows rawOutput to the end user. if that is true keeping message there seems valid.

According to sending logs, maybe we can use socket.io for that? On probe side it can be simple socket.emit('probe:log:info', someData) or socket.emit('probe:log:error', error) in error handlers. And on API side we will listen and log that messages with special prefix. Or smth like winston-socket.io transport can be used as well.

MartinKolarik · 2023-01-30T14:12:13Z

Let's use values in-progress, finished, and failed. When the global measurement timer runs out, change all probes that are in-progress to failed and maybe add a message "The request timed out" to rawOutput. That way the clients can easily differentiate between success/error and also provide details to the users in a similar way like for the "Private IP ranges are not allowed" case.

alexey-yarmosh · 2023-01-30T14:58:02Z

Seems we can override long running script's rawOutput with the "The request timed out" value, not sure if that is ok.

MartinKolarik · 2023-01-30T15:00:48Z

Assuming the rawOutput is empty when the timeout is reached, it definitely makes sense to me. If there's already some output, maybe add it as the last line?

alexey-yarmosh · 2023-01-30T15:04:05Z

or timeout may be one of the status values

MartinKolarik · 2023-01-30T15:07:00Z

I considered that but don't think we should have a special type for each type of error. From the client's perspective, it's simply a failure and the raw output has the message with additional details that can be shown to the user.

alexey-yarmosh · 2023-01-30T15:16:10Z

Ok, another thing is an extra load to redis. As I see the only way to that is:

const measurement = this.getMeasurement(id);
const ids = measurement.results.filter(filterInProgress);
await Promise.all(ids.map(id => this.redis.json.set(key, `$.results.${id}.result.status`, 'failed')));

So under high load when timeouts will happen more frequently we will have event more redis writes.

MartinKolarik · 2023-01-30T15:21:50Z

We can do one set on the whole JSON document, can't we? Not sure how exactly redis implements the individual field operations, but I imagine if there are more than a few fields to update, the whole document set might be faster.

alexey-yarmosh · 2023-01-30T15:26:32Z

Yes, seems whole document set is not a perfect, but a better option.

alexey-yarmosh · 2023-02-10T08:56:40Z

So, we added statuses and they help with timeouts, but there are still a few cases when we are able to get the same {status: 'failed', rawOutput: ''} result:

stdout of the result is empty (code)
stdout of the execa error is empty (code)
error happened but it is not execa error (code)

We need to differentiate them, possible options are:

Option 1: Add messages to the rawOutput.

stdout of the result is empty
remain as is
stdout of the execa error is empty
add message to the rawOutput field like result.rawOutput += '\n\nExeca error happened.'
error happened but it is not execa error
change the response from {status: 'failed', rawOutput: ''} to {status: 'failed', rawOutput: 'Runtime error happened.'}

Option 2: Use exitCode from execa response.

stdout of the result is empty
add exitCode field to the result (it should be 0 all the time)
stdout of the execa error is empty
add exitCode field to the result (it should be non-0 all the time).
error happened but it is not execa error
change the response from {status: 'failed', exitCode: 0, rawOutput: ''} to {status: 'failed', exitCode: 0, rawOutput: 'Runtime error happened.'}

Option 3: Use other execa fields.

Possible fields are failed, timedOut, isCanceled, killed. We can use them in the same way exitCode is used in option 2.

Addition. Here is the example of execa error object:

{
  shortMessage: "Command failed with exit code 68: unbuffer ping -c 3 -i 0.2 asdf.asdf",
  command: "unbuffer ping -c 3 -i 0.2 asdf.asdf",
  escapedCommand: "unbuffer ping -c 3 -i 0.2 asdf.asdf",
  exitCode: 68, // for some reason it is 68 for ping of unknown host
  signal: undefined,
  signalDescription: undefined,
  stdout: "ping: cannot resolve asdf.asdf: Unknown host",
  stderr: "",
  failed: true,
  timedOut: false,
  isCanceled: false,
  killed: false,
}

Here is the example of execa successful object:

{
  command: "unbuffer ping -c 3 -i 0.2 google.com",
  escapedCommand: "unbuffer ping -c 3 -i 0.2 google.com",
  exitCode: 0,
  stdout: "PING google.com (142.250.75.14): 56 data bytes\n64 bytes from 142.250.75.14: icmp_seq=0 ttl=120 time=7.826 ms\n64 bytes from 142.250.75.14: icmp_seq=1 ttl=120 time=6.043 ms\n64 bytes from 142.250.75.14: icmp_seq=2 ttl=120 time=7.785 ms\n\n--- google.com ping statistics ---\n3 packets transmitted, 3 packets received, 0.0% packet loss\nround-trip min/avg/max/stddev = 6.043/7.218/7.826/0.831 ms",
  stderr: "",
  all: undefined,
  failed: false,
  timedOut: false,
  isCanceled: false,
  killed: false,
}

The best option from my perspective is to use exitCode field - we are not polluting rawOutput plus we can clearly see either problem is in JS code or execa command.
@jimaek @MartinKolarik please share your thoughts.

MartinKolarik · 2023-02-16T09:29:50Z

From the user's perspective, what matters is that the test failed and that there is an indicative error message. That means failed status and a message added to rawOutput because that's what the tools will show.
The exit code, on the other hand, is worthless to the user and also bounds us to a specific implementation. What if we later reimplement the DNS command so that it's executed within node and there are no exit codes at all?

Solution 1 makes the most sense to me. The message itself may change depending on the exit code (or depending on the execa fields as in option 3) if you feel it explains the problem better to the user / helps us in debugging.

jimaek · 2023-02-16T10:23:35Z

Maybe we can also locally log with a lot more detail when issues like that happen? To make it easier to debug in case a user complains that a test failed.

alexey-yarmosh · 2023-02-16T11:22:56Z

Sounds valid, I'll prepare a draft PR for a ping command.

alexey-yarmosh · 2023-02-16T14:00:03Z

Here it is:
jsdelivr/globalping-probe#114

alexey-yarmosh · 2023-02-20T16:59:54Z

So, there are only 4 faulty probes, they constantly respond with “Test failed. Please try again.” in case of dns, ping, traceroute or mtr measurements. http type is working fine. Seems like these probes are behind a firewall or smth that blocks everything except http.

Issue is still valid so I am not closing it. But to proceed we should either:

implement Get logs from the probes #289 to explicitly see what is going on there (unlikely we will be able fix them anyway);
implement Exponential backoff for probes #52 and remove these probes from the system or from specific measurement type.

alexey-yarmosh · 2023-03-22T10:11:49Z

Fixed by #52

alexey-yarmosh changed the title ~~Lost data from a probe~~ Empty results of some probes Oct 5, 2022

jimaek assigned alexey-yarmosh Dec 19, 2022

This was referenced Feb 1, 2023

feat: measurement statuses jsdelivr/globalping-probe#108

Merged

feat: add measurement statuses #283

Merged

alexey-yarmosh mentioned this issue Feb 13, 2023

RequestError: Timeout awaiting 'request' for 15000ms #288

Closed

alexey-yarmosh mentioned this issue Feb 16, 2023

feat: add logging of errors or unexpected behavior jsdelivr/globalping-probe#114

Merged

alexey-yarmosh closed this as completed Mar 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Empty results of some probes #212

Empty results of some probes #212

alexey-yarmosh commented Oct 5, 2022 •

edited

Loading

MartinKolarik commented Oct 6, 2022

jimaek commented Oct 6, 2022

patrykcieszkowski commented Oct 6, 2022

alexey-yarmosh commented Jan 6, 2023

alexey-yarmosh commented Jan 9, 2023

MartinKolarik commented Jan 9, 2023

alexey-yarmosh commented Jan 9, 2023

jimaek commented Jan 9, 2023

patrykcieszkowski commented Jan 18, 2023

jimaek commented Jan 18, 2023

alexey-yarmosh commented Jan 27, 2023

MartinKolarik commented Jan 30, 2023

alexey-yarmosh commented Jan 30, 2023

MartinKolarik commented Jan 30, 2023 •

edited

Loading

alexey-yarmosh commented Jan 30, 2023

MartinKolarik commented Jan 30, 2023

alexey-yarmosh commented Jan 30, 2023 •

edited

Loading

MartinKolarik commented Jan 30, 2023 •

edited

Loading

alexey-yarmosh commented Jan 30, 2023

alexey-yarmosh commented Feb 10, 2023 •

edited

Loading

MartinKolarik commented Feb 16, 2023 •

edited

Loading

jimaek commented Feb 16, 2023

alexey-yarmosh commented Feb 16, 2023

alexey-yarmosh commented Feb 16, 2023

alexey-yarmosh commented Feb 20, 2023 •

edited

Loading

alexey-yarmosh commented Mar 22, 2023

Empty results of some probes #212

Empty results of some probes #212

Comments

alexey-yarmosh commented Oct 5, 2022 • edited Loading

MartinKolarik commented Oct 6, 2022

jimaek commented Oct 6, 2022

patrykcieszkowski commented Oct 6, 2022

alexey-yarmosh commented Jan 6, 2023

alexey-yarmosh commented Jan 9, 2023

MartinKolarik commented Jan 9, 2023

alexey-yarmosh commented Jan 9, 2023

jimaek commented Jan 9, 2023

patrykcieszkowski commented Jan 18, 2023

jimaek commented Jan 18, 2023

alexey-yarmosh commented Jan 27, 2023

MartinKolarik commented Jan 30, 2023

alexey-yarmosh commented Jan 30, 2023

MartinKolarik commented Jan 30, 2023 • edited Loading

alexey-yarmosh commented Jan 30, 2023

MartinKolarik commented Jan 30, 2023

alexey-yarmosh commented Jan 30, 2023 • edited Loading

MartinKolarik commented Jan 30, 2023 • edited Loading

alexey-yarmosh commented Jan 30, 2023

alexey-yarmosh commented Feb 10, 2023 • edited Loading

MartinKolarik commented Feb 16, 2023 • edited Loading

jimaek commented Feb 16, 2023

alexey-yarmosh commented Feb 16, 2023

alexey-yarmosh commented Feb 16, 2023

alexey-yarmosh commented Feb 20, 2023 • edited Loading

alexey-yarmosh commented Mar 22, 2023

alexey-yarmosh commented Oct 5, 2022 •

edited

Loading

MartinKolarik commented Jan 30, 2023 •

edited

Loading

alexey-yarmosh commented Jan 30, 2023 •

edited

Loading

MartinKolarik commented Jan 30, 2023 •

edited

Loading

alexey-yarmosh commented Feb 10, 2023 •

edited

Loading

MartinKolarik commented Feb 16, 2023 •

edited

Loading

alexey-yarmosh commented Feb 20, 2023 •

edited

Loading