Skip to content
Victor Lin edited this page Oct 4, 2023 · 7 revisions

In biosample_set.xml.gz, each biosample has a mix of potential coordinate and/or text attributes that provide geospatial context. The following describes a method to extract an optimal amount of geospatial data in the form of x/y coordinates.

Search pattern

BioSample attributes are not standard, so a search must be done to (1) find potentially useful attributes (e.g. location, province) and (2) remove false-positive values (e.g. a location of tumor or tissue is not useful). The search pattern is configured here.

SQL tables

  1. biosample -> coordinate_x/y(nullable), geo_text(nullable)
  2. geo_text -> coordinate_x/y
  3. biosample -> coordinate_x/y, geocoded_text(nullable)

Steps

  1. (src) Extract from raw data. For each biosample:
    1. (src) For each potential coordinate attribute:
      1. Check for existance of a digit in the value. If no digit is present, no coordinates can be extracted. Continue to next potential attribute.
        1. Take the first value with a digit*. Try extracting coordinates by regular expression.
    2. [If no coordinate values are taken] For each potential text attribute:
      1. Convert text value to lowercase and remove null-encoding substrings.
      2. Take the first non-empty value*.
    3. If no values were extracted in steps above, then no geospatial data can be associated with this biosample.
    4. Upload biosample entry to SQL table 1.
  2. Reduce values of text attributes to a minimal set. For each value in the set:
    1. (src) Geocode with Amazon Location Service. This returns a list of potential coordinates. Take the first result.
    2. Upload entry to SQL table 2.
  3. Apply geocoding to applicable biosamples. For each biosample in the first SQL table that has geospatial data extracted:
    1. If data is a text value, lookup coordinates in SQL table 2 and set geocoded_text to text value.
    2. Upload biosample entry to SQL table 3.

* No method of determining precedence has been established. Taking the first result is the most naive way to get a value.

Ideas for improvement

  1. Add a dictionary of null value keywords (helps primarily with text)
  2. Support pairs of coordinate attributes (e.g. latitude,longitude)
  3. Determine precedence for multiple coordinate or text attributes.
  4. Combine text attributes (see section below)

Getting useful text

Using geo_loc_name if available isn't always a good idea:

{"location": "Sayward_Estuary", "geo_loc_name": "USA: GAZ"}

Can combine different keys, but this may result in larger geocoding set

All potential attribute keys:

select count(*), cardinality(keys), keys
from (
	select array(select jsonb_object_keys(geo_text)) as keys
	from biosample3
) as subquery
group by keys
order by count(*) desc
Clone this wiki locally