-
Notifications
You must be signed in to change notification settings - Fork 0
Method
In biosample_set.xml.gz
, each biosample has a mix of potential coordinate and/or text attributes that provide geospatial context. The following describes a method to extract an optimal amount of geospatial data in the form of x/y coordinates.
BioSample attributes are not standard, so a search must be done to (1) find potentially useful attributes (e.g. location
, province
) and (2) remove false-positive values (e.g. a location
of tumor
or tissue
is not useful). The search pattern is configured here.
biosample -> coordinate_x/y(nullable), geo_text(nullable)
geo_text -> coordinate_x/y
biosample -> coordinate_x/y, geocoded_text(nullable)
- (src) Extract from raw data. For each biosample:
- (src) For each potential coordinate attribute:
- Check for existance of a digit in the value. If no digit is present, no coordinates can be extracted. Continue to next potential attribute.
- Take the first value with a digit*. Try extracting coordinates by regular expression.
- Check for existance of a digit in the value. If no digit is present, no coordinates can be extracted. Continue to next potential attribute.
- [If no coordinate values are taken] For each potential text attribute:
- Convert text value to lowercase and remove null-encoding substrings.
- Take the first non-empty value*.
- If no values were extracted in steps above, then no geospatial data can be associated with this biosample.
- Upload biosample entry to SQL table 1.
- (src) For each potential coordinate attribute:
-
Reduce values of text attributes to a minimal set. For each value in the set:
- (src) Geocode with Amazon Location Service. This returns a list of potential coordinates. Take the first result.
- Upload entry to SQL table 2.
-
Apply geocoding to applicable biosamples. For each biosample in the first SQL table that has geospatial data extracted:
- If data is a text value, lookup coordinates in SQL table 2 and set
geocoded_text
to text value. - Upload biosample entry to SQL table 3.
- If data is a text value, lookup coordinates in SQL table 2 and set
* No method of determining precedence has been established. Taking the first result is the most naive way to get a value.
- Add a dictionary of null value keywords (helps primarily with text)
- Support pairs of coordinate attributes (e.g.
latitude
,longitude
) - Determine precedence for multiple coordinate or text attributes.
- Combine text attributes (see section below)
Using geo_loc_name
if available isn't always a good idea:
{"location": "Sayward_Estuary", "geo_loc_name": "USA: GAZ"}
Can combine different keys, but this may result in larger geocoding set
All potential attribute keys:
select count(*), cardinality(keys), keys
from (
select array(select jsonb_object_keys(geo_text)) as keys
from biosample3
) as subquery
group by keys
order by count(*) desc