Beta API questions

Hi! I have some new questions about the Beta API:

  1. Is the API documentation available in any easy-to-parse format, like JSON schema or anything else? (so that I can make a quick wrapper for the API, instead of copy-pasting from the webpage)

  2. vocabupdates endpoint documentation mentiones the fields parameter (for Vocabs!), but it doesn’t seem to work (neither for Vocab fields, nor for VocabUpdates)

  3. It wasn’t very obvious for me that to add a new word for a user (like Quick Add on the legacy website), I need to modify some list. I thought instead of using Vocabs (which are neutral common entities), I have to do something with Items and therefore I spent some time trying to use POST on the items endpoint (it says “Creates new Items”).
    So I just suggest to clarify this a bit somewhere in the docs. And to explain in general what is the conceptual difference between Vocabs and Items.

  4. Related question: is there any simple way to Quick Add a word? Or otherwise, how does it work exactly? I suppose that it takes the latest custom user list and chooses some section in it. But it takes several requests to do, so I thought that there might be a simpler way.

  5. From responses I see that if a Vocab has an audio attribute, it also has audioURL, which is the same and a more interesting audios, which is an array of objects (shown below). This is undocumented and I before asking any questions about it, I wonder if it’s something that is going to stay in the API.

{
  "source": "tan",
  "reading": "hui4",
  "mp3": "http://storage.googleapis.com/skritter_audio/zh/tan/5730999004561408.mp3",
  "writing": null,
  "id": "5730999004561408"
}
  1. Authorization page mentions an option to get an “anonymous” access token. Does it mean that you can do basically everything, except adding new items to the queue?

Get a token with no specific user authorization. The API calls you make and data you receive is of course limited to data that’s accessible to all Skritter users, but depending on your needs this may be the simplest solution.

  1. Is there any simple way to get all Vocab IDs that a user has learned? As I understand, requests to vocabs endpoint are influenced by the user settings, but there is no reference to whether a user has this word in a queue, or has ever learned it. So probably I should call items with ids_only parameter and then extract Vocab IDs from them (still have questions about it). Is this correct?

Thanks for making this API publicly available! :clap:
And I understand that it’s still in beta and may change any moment, so some of these questions probably don’t have a proper answer…

  1. Nope, just the website documentation.
  2. The fields (and perhaps limit) parameters probably shouldn’t be listed for that endpoint. It was only built to return an array of Vocab IDs that were changed given a specific date offset.
  3. Two important things to keep in mind when it comes to items is they are spawned from vocabs and vocabs must be bound to a list.
  4. By quick add, do you mean quick adding a word to a vocablist? The only way to do it right now is to GET a vocablist with sections, add the word to a section and then PUT the changes. It’s up to you to determine where these values get added. The way the legacy website words is by using the Miscellaneous list and putting it in a section with a Month/Year formatted name.
  5. audioURL is the one we typically use and was added because we migrated our audio awhile back, but wanted to keep the other audio property around. It’s possible that in the future we might just have an audios property.
  6. Yes, you’ll be able to get basic user profiles, vocabs and a few other things that aren’t specific to a user.
  7. Yes, if you only want learned VocabIDs you’ll need to get items first, as you’ve mentioned ids_only would be the fastest. The reason vocabs might not be a good indicator of active items is because when people reset accounts or remove vocablists we still keep copies of the vocab to maintain any custom definitions that might have added.
1 Like

Thanks for answers @josh!

  1. OK :disappointed:
  2. OK
  3. OK
  4. OK, although I think currently Quick Add on the legacy website adds it to the most recently used user-list. I don’t know whether it chooses the last section or the one user currently adds words from (currentSection attribute), but I can try and figure that out.
  5. The problem with audioURL is that it offer only one pronunciation, while a syllable may have multiple readings (still may have no audios for each reading). So if it’s fine, I’m going to rely on audios property. Also I just discovered that there is an undocumented endpoint http://beta.skritter.com/api/v0/audio, which I can query by reading. Is it ok to use it?
  6. Cool!
  7. OK

I don’t think the audios property will every contain individual syllables for multiple character words, but rather all the recording we have for that word. If you want to get the individual recordings you’ll need to fetch the individual character vocabs (or use the audio thing you discovered below). I think there might be a hidden parameter called include_contained that does this automatically.

Yes, the api/v0/audio endpoint was originally designed for our internal usage with out client for recording more audio, but you can use it for querying by reading.

1 Like

I know this :ok_hand: I meant a different thing though: some characters (and probably words?) have multiple readings, for example, has reading: "hui4, kuai4" and audioURL refers to the file for hui4, while audios has both pronunciations:

{
  "writing": "会",
  "reading": "hui4, kuai4",
  "audio": "http://storage.googleapis.com/skritter_audio/zh/tan/5730999004561408.mp3",
  "audioURL": "http://storage.googleapis.com/skritter_audio/zh/tan/5730999004561408.mp3",
  "audios": [
    {
      "source": "tan",
      "reading": "hui4",
      "mp3": "http://storage.googleapis.com/skritter_audio/zh/tan/5730999004561408.mp3",
      "writing": null,
      "id": "5730999004561408"
    },
    {
      "source": "tan",
      "reading": "kuai4",
      "mp3": "http://storage.googleapis.com/skritter_audio/zh/tan/6166047784697856.mp3",
      "writing": null,
      "id": "6166047784697856"
    }
  ]
}

Great :thumbsup:


An unrelated general question about API: have you considered providing a GraphQL interface for the database?
I have no idea, of course, how you store the data and how feasible it would be, it just seems that with the current REST API any meaningful application requires making a chain of several ping-pong requests to different endpoints. If I understand it right (I’ve never worked with it before), GraphQL API would allow client to shape the data it wants in the request query.

Ah, looks like a misread your comment regarding the audioURL and audios. Yes, the audios will contain readings for characters that have multiple and it will also include any duplicates we have from different speakers (described in the source property).

I haven’t looked into GraphQL too much, but just glancing at the website again it looks pretty cool. Right now it’s probably not feasible because our data is mostly running from a Google Cloud Datastore (https://cloud.google.com/datastore/). It was a good choice about 10 years ago, but they are kind of slow expensive dinosaurs in this day in age. We’re planning on moving to MongoDB in the nearish future which should increase our query flexibility quite a bit.

1 Like

Interesting. I wasn’t familiar with Google Cloud-based DB options. It seems that this Cloud Datastore is generally not a bad option, but you say it’s not a good fit for Skritter data… Another similar option I see is the Firebase Realtime Database (EDIT: or is it just for caching?), which seems to be fast and have simple (JSON-based) API, but probably isn’t a good option due to the scale/price.
By the way, how big is your data? (if it’s not a secret, of course)

The datastore thing is not great for a slew of reasons. I guess for smaller apps with more static data it might be alright, but for larger apps with lots of moving components it’s rather lacking. 1.) They don’t provide any feasible builtin backup solutions and charge you for using their hacked together one. For a company with any amount of data this can cost $1000’s of dollar for a single full backup and take hours. 2.) In comparison to other nosql databases it’s very limited in how you can query and access data. 3.) We’ve noticed that for even some simple queries there is just some inherent lag when compared with other services. It might only be 50-200ms extra, but in the large scale of requests that is a huge performance hit.

The realtime database is cool if you’re got things like leaderboards, but would be a bad choice for storing large scale amount of data. It’d essentially be like the datastore, but even more limited.

OK. I see you reasons. Thanks for explaining.
You mentioned the scale of the data, but didn’t say about the size explicitly, which makes me even more curious about it :sweat_smile: